, Volume 78, Issue 2, pp 394–424 | Cite as

An External-Memory Algorithm for String Graph Construction

  • Paola Bonizzoni
  • Gianluca Della Vedova
  • Yuri Pirola
  • Marco PrevitaliEmail author
  • Raffaella Rizzi


Some recent results (Bauer et al. in Algorithms in bioinformatics, Springer, Berlin, pp 326–337, 2012; Cox et al. in Algorithms in bioinformatics, Springer, Berlin, pp. 214–224, 2012; Rosone and Sciortino in The nature of computation. Logic, algorithms, applications, Springer, Berlin, pp 353–364, 2013) have introduced external-memory algorithms to compute self-indexes of a set of strings, mainly via computing the Burrows–Wheeler transform of the input strings. The motivations for those results stem from Bioinformatics, where a large number of short strings (called reads) are routinely produced and analyzed. In that field, a fundamental problem is to assemble a genome from a large set of much shorter samples extracted from the unknown genome. The approaches that are currently used to tackle this problem are memory-intensive. This fact does not bode well with the ongoing increase in the availability of genomic data. A data structure that is used in genome assembly is the string graph, where vertices correspond to samples and arcs represent two overlapping samples. In this paper we address an open problem of Simpson and Durbin (Bioinformatics 26(12):i367–i373, 2010): to design an external-memory algorithm to compute the string graph.


External memory algorithms Burrows–Wheeler transform String graphs Genome assembly 



The authors acknowledge the support of the MIUR PRIN 2010-2011 grant “Automi e Linguaggi Formali: Aspetti Matematici e Applicativi” code 2010LYA9RH, of the Cariplo Foundation grant 2013-0955 (Modulation of anti cancer immune response by regulatory non-coding RNAs), of the FA 2013 grant “Metodi algoritmici e modelli: aspetti teorici e applicazioni in bioinformatica” code 2013-ATE-0281, and of the FA 2014 grant “Algoritmi e modelli computazionali: aspetti teorici e applicazioni nelle scienze della vita” code 2014-ATE-0382. The authors would like to thank the anonymous reviewers for their insightful comments.


  1. 1.
    Abouelhoda, M., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Aggarwal, A., Vitter, J.: The Input/Output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Alizadeh, F., Karp, R., Newberg, L., Weisser, D.: Physical mapping of chromosomes: a combinatorial problem in molecular biology. Algorithmica 13, 52–76 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Alizadeh, F., Karp, R., Weisser, D., Zweig, G.: Physical mapping of chromosomes using unique probes. J. Comput. Biol. 2, 159–184 (1995)CrossRefzbMATHGoogle Scholar
  5. 5.
    Bankevich, A., Nurk, S., Antipov, D., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Bauer, M., Cox, A., Rosone, G.: Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci. 483, 134–148 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Bauer, M., Cox, A., Rosone, G., Sciortino, M.: Lightweight LCP construction for next-generation sequencing datasets. In: Algorithms in Bioinformatics, LNCS, vol. 7534, pp. 326–337. Springer, Berlin, Germany (2012)Google Scholar
  8. 8.
    Beerenwinkel, N., Beretta, S., Bonizzoni, P., Dondi, R., Pirola, Y.: Covering pairs in directed acyclic graphs. Comput. J. 58(7), 1673–1686 (2015)CrossRefzbMATHGoogle Scholar
  9. 9.
    Benson, D., Clark, K., Karsch-Mizrachi, I., et al.: GenBank. Nucleic Acids Research 42(D1), D32–D37 (2014)CrossRefGoogle Scholar
  10. 10.
    Beretta, S., Bonizzoni, P., Della Vedova, G., Pirola, Y., Rizzi, R.: Modeling alternative splicing variants from RNA-Seq data with isoform graphs. J. Comput Biol 16(1), 16–40 (2014)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Blum, A., Jiang, T., Li, M., Tromp, J., Yannakakis, M.: Linear approximation of shortest superstrings. J. ACM 41, 630–647 (1994)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M., Rizzi, R.: Constructing string graphs in external memory. In: Algorithms in Bioinformatics, LNCS, vol. 8701, pp. 311–325. Springer, Berlin, Germany (2014)Google Scholar
  13. 13.
    Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M., Rizzi, R.: LSG: An external-memory tool to compute string graphs for NGS data assembly. J. Comput. Biol. 23(3), 137–149 (2016). doi: 10.1089/cmb.2015.0172 MathSciNetCrossRefGoogle Scholar
  14. 14.
    Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report, Digital Systems Research Center (1994)Google Scholar
  15. 15.
    Chen, Y., Dong, G., Han, J., Wah, B., Wang, J.: Multi-dimensional regression analysis of time-series data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 323–334. VLDB Endowment (2002)Google Scholar
  16. 16.
    Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8, 22 (2013)CrossRefGoogle Scholar
  17. 17.
    Cox, A., Bauer, M., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)CrossRefGoogle Scholar
  18. 18.
    Cox, A., Jakobi, T., Rosone, G., Schulz-Trieglaff, O.: Comparing DNA sequence collections by direct comparison of compressed text indexes. In: Algorithms in Bioinformatics, LNCS, vol. 7534, pp. 214–224. Springer, Berlin, Germany (2012)Google Scholar
  19. 19.
    Demetrescu, C., Finocchi, I., Ribichini, A.: Trading off space for passes in graph streaming problems. ACM Trans. Algorithms 6(1), 6 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Diestel, R.: Graph Theory. Graduate Texts in Mathematics, 3rd edn. Springer, Heidelberg (2005)Google Scholar
  21. 21.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Henzinger, M., Raghavan, P., Rajagopalan, S.: Computing on data streams. In: External Memory Algorithms, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 50, pp. 107–118. AMS, Boston, MA, USA (1999)Google Scholar
  23. 23.
    Lacroix, V., Sammeth, M., Guigo, R., Bergeron, A.: Exact transcriptome reconstruction from short sequence reads. In: Algorithms in Bioinformatics, LNCS, vol. 5251, pp. 50–63. Springer, Berlin, Heidelberg (2008)Google Scholar
  24. 24.
    Lam, T., Li, R., Tam, A., Wong, S., Wu, E., Yiu, S.: High throughput short read alignment via bi-directional BWT. In: Bioinformatics and Biomedicine (BIBM ’09), pp. 31–36. IEEE Computer Society, Washington, DC, USA (2009)Google Scholar
  25. 25.
    McKenna, A., Hanna, M., Banks, E., et al.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9), 1297–1303 (2010)CrossRefGoogle Scholar
  26. 26.
    Myers, E.: The fragment assembly string graph. Bioinformatics 21(suppl. 2), ii79–ii85 (2005)Google Scholar
  27. 27.
    Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J., Brown, C.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. PNAS 109(33), 13272–13277 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Peng, Y., Leung, H.C., Yiu, S.-M., Chin, F.: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420–1428 (2012)CrossRefGoogle Scholar
  29. 29.
    Rosone, G., Sciortino, M.: The Burrows–Wheeler transform between data compression and combinatorics on words. In: The Nature of Computation. Logic, Algorithms, Applications, LNCS, vol. 7921, pp. 353–364. Springer, Berlin, Heidelberg (2013)Google Scholar
  30. 30.
    Sedgewick, R.: Algorithms in Java. Addison-Wesley Professional, Reading (2002)Google Scholar
  31. 31.
    Shi, F.: Suffix arrays for multiple strings: a method for on-line multiple string searches. In: Concurrency and Parallelism, Programming, Networking, and Security, LNCS, vol. 1179, pp. 11–22. Springer Berlin, Heidelberg (1996)Google Scholar
  32. 32.
    Simpson, J., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)CrossRefGoogle Scholar
  33. 33.
    Simpson, J., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012)CrossRefGoogle Scholar
  34. 34.
    Simpson, J., Wong, K., Jackman, S., et al.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)CrossRefGoogle Scholar
  35. 35.
    Valiant, L.: General purpose parallel architectures. In: Handbook of Theoretical Computer Science, vol. A, pp. 943–973. MIT Press, Cambridge, MA, USA (1990)Google Scholar
  36. 36.
    Vitter, J.: External memory algorithms and data structures: dealing with massive data. ACM Comput. Surv. 33(2), 209–271 (2001)CrossRefGoogle Scholar
  37. 37.
    Vitter, J., Shriver, E.: Algorithms for parallel memory, I: two-level memories. Algorithmica 12(2), 110–147 (1994)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.DISCoUniversità degli Studi di Milano-BicoccaMilanItaly

Personalised recommendations