FSG: Fast String Graph Construction for De Novo Assembly of Reads Data

  • Paola Bonizzoni
  • Gianluca Della Vedova
  • Yuri Pirola
  • Marco Previtali
  • Raffaella Rizzi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9683)

Abstract

The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this paper, we explore a novel approach to compute the string graph, based on the FM-index and Burrows-Wheeler Transform (BWT). We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads. Our algorithm has been integrated into the SGA assembler as a standalone module to construct the string graph.

The new integrated assembler has been assessed on a standard benchmark, showing that FSG is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads.

References

  1. 1.
    Bankevich, A., Nurk, S., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Bauer, M., Cox, A., Rosone, G.: Lightweight algorithms for constructing and inverting the BWT of string collections. Theoret. Comput. Sci. 483, 134–148 (2013)MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Bauer, M.J., Cox, A.J., Rosone, G., Sciortino, M.: Lightweight LCP construction for next-generation sequencing datasets. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 326–337. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  4. 4.
    Ben-Bassat, I., Chor, B.: String graph construction using incremental hashing. Bioinformatics 30(24), 3515–3523 (2014)CrossRefGoogle Scholar
  5. 5.
    Beretta, S., Bonizzoni, P., Della Vedova, G., Pirola, Y., Rizzi, R.: Modeling alternative splicing variants from RNA-Seq data with isoform graphs. J. Comput. Biol. 16(1), 16–40 (2014)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Bonizzoni, P., Della Vedova, G., Dondi, R., Li, J.: The haplotyping problem: an overview of computational models and solutions. J. Comput. Sci. Technol. 18(6), 675–688 (2003)MathSciNetCrossRefMATHGoogle Scholar
  7. 7.
    Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M., Rizzi, R.: Constructing string graphs in external memory. In: Brown, D., Morgenstern, B. (eds.) WABI 2014. LNCS, vol. 8701, pp. 311–325. Springer, Heidelberg (2014)Google Scholar
  8. 8.
    Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M., Rizzi, R.: LSG: an external-memory tool to compute string graphs for NGS data assembly. J. Comp. Biol. 23(3), 137–149 (2016)CrossRefGoogle Scholar
  9. 9.
    Boucher, C., Bowe, A., Gagie, T., et al.: Variable-order de bruijn graphs. In: 2015 Data Compression Conference (DCC), pp. 383–392. IEEE (2015)Google Scholar
  10. 10.
    Bradnam, K.R., Fass, J.N., Alexandrov, A., et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2(1), 1–31 (2013)CrossRefGoogle Scholar
  11. 11.
    Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report, Digital Systems Research Center (1994)Google Scholar
  12. 12.
    Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de bruijn graphs. J. Comp. Biol. 22(5), 336–352 (2015)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Alg. Mol. Biol. 8(22), 1–9 (2013)Google Scholar
  14. 14.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
    Gonnella, G., Kurtz, S.: Readjoiner: a fast and memory efficient string graph-based sequence assembler. BMC Bioinform. 13(1), 82 (2012)CrossRefGoogle Scholar
  16. 16.
    Lacroix, V., Sammeth, M., Guigo, R., Bergeron, A.: Exact transcriptome reconstruction from short sequence reads. In: Crandall, K.A., Lagergren, J. (eds.) WABI 2008. LNCS (LNBI), vol. 5251, pp. 50–63. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  17. 17.
    Li, H.: Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28(14), 1838–1844 (2012)CrossRefGoogle Scholar
  18. 18.
    Myers, E.: The fragment assembly string graph. Bioinformatics 21(s2), 79–85 (2005)Google Scholar
  19. 19.
    Peng, Y., Leung, H.C., Yiu, S.-M., Chin, F.: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420–1428 (2012)CrossRefGoogle Scholar
  20. 20.
    Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve the memory usage for de brujin graphs. Alg. Mol. Biol. 9(1), 2 (2014)CrossRefGoogle Scholar
  21. 21.
    Salzberg, S.L., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22(3), 557–567 (2012)CrossRefGoogle Scholar
  22. 22.
    Shi, F.: Suffix arrays for multiple strings: a method for on-line multiple string searches. In: Jaffar, J., Yap, R.H.C. (eds.) ASIAN 1996. LNCS, vol. 1179, pp. 11–22. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  23. 23.
    Simpson, J., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)CrossRefGoogle Scholar
  24. 24.
    Simpson, J., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012)CrossRefGoogle Scholar
  25. 25.
    Simpson, J., Wong, K., Jackman, S., et al.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Paola Bonizzoni
    • 1
  • Gianluca Della Vedova
    • 1
  • Yuri Pirola
    • 1
  • Marco Previtali
    • 1
  • Raffaella Rizzi
    • 1
  1. 1.DISCoUniversity of Milano-BicoccaMilanItaly

Personalised recommendations