An External-Memory Algorithm for String Graph Construction

Abstract

Some recent results (Bauer et al. in Algorithms in bioinformatics, Springer, Berlin, pp 326–337, 2012; Cox et al. in Algorithms in bioinformatics, Springer, Berlin, pp. 214–224, 2012; Rosone and Sciortino in The nature of computation. Logic, algorithms, applications, Springer, Berlin, pp 353–364, 2013) have introduced external-memory algorithms to compute self-indexes of a set of strings, mainly via computing the Burrows–Wheeler transform of the input strings. The motivations for those results stem from Bioinformatics, where a large number of short strings (called reads) are routinely produced and analyzed. In that field, a fundamental problem is to assemble a genome from a large set of much shorter samples extracted from the unknown genome. The approaches that are currently used to tackle this problem are memory-intensive. This fact does not bode well with the ongoing increase in the availability of genomic data. A data structure that is used in genome assembly is the string graph, where vertices correspond to samples and arcs represent two overlapping samples. In this paper we address an open problem of Simpson and Durbin (Bioinformatics 26(12):i367–i373, 2010): to design an external-memory algorithm to compute the string graph.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  1. 1.

    Abouelhoda, M., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)

    MathSciNet  Article  MATH  Google Scholar 

  2. 2.

    Aggarwal, A., Vitter, J.: The Input/Output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)

    MathSciNet  Article  Google Scholar 

  3. 3.

    Alizadeh, F., Karp, R., Newberg, L., Weisser, D.: Physical mapping of chromosomes: a combinatorial problem in molecular biology. Algorithmica 13, 52–76 (1995)

    MathSciNet  Article  MATH  Google Scholar 

  4. 4.

    Alizadeh, F., Karp, R., Weisser, D., Zweig, G.: Physical mapping of chromosomes using unique probes. J. Comput. Biol. 2, 159–184 (1995)

    Article  MATH  Google Scholar 

  5. 5.

    Bankevich, A., Nurk, S., Antipov, D., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)

    MathSciNet  Article  Google Scholar 

  6. 6.

    Bauer, M., Cox, A., Rosone, G.: Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci. 483, 134–148 (2013)

    MathSciNet  Article  MATH  Google Scholar 

  7. 7.

    Bauer, M., Cox, A., Rosone, G., Sciortino, M.: Lightweight LCP construction for next-generation sequencing datasets. In: Algorithms in Bioinformatics, LNCS, vol. 7534, pp. 326–337. Springer, Berlin, Germany (2012)

  8. 8.

    Beerenwinkel, N., Beretta, S., Bonizzoni, P., Dondi, R., Pirola, Y.: Covering pairs in directed acyclic graphs. Comput. J. 58(7), 1673–1686 (2015)

    Article  MATH  Google Scholar 

  9. 9.

    Benson, D., Clark, K., Karsch-Mizrachi, I., et al.: GenBank. Nucleic Acids Research 42(D1), D32–D37 (2014)

    Article  Google Scholar 

  10. 10.

    Beretta, S., Bonizzoni, P., Della Vedova, G., Pirola, Y., Rizzi, R.: Modeling alternative splicing variants from RNA-Seq data with isoform graphs. J. Comput Biol 16(1), 16–40 (2014)

    MathSciNet  Article  Google Scholar 

  11. 11.

    Blum, A., Jiang, T., Li, M., Tromp, J., Yannakakis, M.: Linear approximation of shortest superstrings. J. ACM 41, 630–647 (1994)

    MathSciNet  Article  MATH  Google Scholar 

  12. 12.

    Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M., Rizzi, R.: Constructing string graphs in external memory. In: Algorithms in Bioinformatics, LNCS, vol. 8701, pp. 311–325. Springer, Berlin, Germany (2014)

  13. 13.

    Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M., Rizzi, R.: LSG: An external-memory tool to compute string graphs for NGS data assembly. J. Comput. Biol. 23(3), 137–149 (2016). doi:10.1089/cmb.2015.0172

    MathSciNet  Article  Google Scholar 

  14. 14.

    Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report, Digital Systems Research Center (1994)

  15. 15.

    Chen, Y., Dong, G., Han, J., Wah, B., Wang, J.: Multi-dimensional regression analysis of time-series data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 323–334. VLDB Endowment (2002)

  16. 16.

    Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8, 22 (2013)

    Article  Google Scholar 

  17. 17.

    Cox, A., Bauer, M., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)

    Article  Google Scholar 

  18. 18.

    Cox, A., Jakobi, T., Rosone, G., Schulz-Trieglaff, O.: Comparing DNA sequence collections by direct comparison of compressed text indexes. In: Algorithms in Bioinformatics, LNCS, vol. 7534, pp. 214–224. Springer, Berlin, Germany (2012)

  19. 19.

    Demetrescu, C., Finocchi, I., Ribichini, A.: Trading off space for passes in graph streaming problems. ACM Trans. Algorithms 6(1), 6 (2009)

    MathSciNet  Article  MATH  Google Scholar 

  20. 20.

    Diestel, R.: Graph Theory. Graduate Texts in Mathematics, 3rd edn. Springer, Heidelberg (2005)

    Google Scholar 

  21. 21.

    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    MathSciNet  Article  MATH  Google Scholar 

  22. 22.

    Henzinger, M., Raghavan, P., Rajagopalan, S.: Computing on data streams. In: External Memory Algorithms, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 50, pp. 107–118. AMS, Boston, MA, USA (1999)

  23. 23.

    Lacroix, V., Sammeth, M., Guigo, R., Bergeron, A.: Exact transcriptome reconstruction from short sequence reads. In: Algorithms in Bioinformatics, LNCS, vol. 5251, pp. 50–63. Springer, Berlin, Heidelberg (2008)

  24. 24.

    Lam, T., Li, R., Tam, A., Wong, S., Wu, E., Yiu, S.: High throughput short read alignment via bi-directional BWT. In: Bioinformatics and Biomedicine (BIBM ’09), pp. 31–36. IEEE Computer Society, Washington, DC, USA (2009)

  25. 25.

    McKenna, A., Hanna, M., Banks, E., et al.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9), 1297–1303 (2010)

    Article  Google Scholar 

  26. 26.

    Myers, E.: The fragment assembly string graph. Bioinformatics 21(suppl. 2), ii79–ii85 (2005)

    Google Scholar 

  27. 27.

    Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J., Brown, C.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. PNAS 109(33), 13272–13277 (2012)

    MathSciNet  Article  MATH  Google Scholar 

  28. 28.

    Peng, Y., Leung, H.C., Yiu, S.-M., Chin, F.: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420–1428 (2012)

    Article  Google Scholar 

  29. 29.

    Rosone, G., Sciortino, M.: The Burrows–Wheeler transform between data compression and combinatorics on words. In: The Nature of Computation. Logic, Algorithms, Applications, LNCS, vol. 7921, pp. 353–364. Springer, Berlin, Heidelberg (2013)

  30. 30.

    Sedgewick, R.: Algorithms in Java. Addison-Wesley Professional, Reading (2002)

    Google Scholar 

  31. 31.

    Shi, F.: Suffix arrays for multiple strings: a method for on-line multiple string searches. In: Concurrency and Parallelism, Programming, Networking, and Security, LNCS, vol. 1179, pp. 11–22. Springer Berlin, Heidelberg (1996)

  32. 32.

    Simpson, J., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)

    Article  Google Scholar 

  33. 33.

    Simpson, J., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012)

    Article  Google Scholar 

  34. 34.

    Simpson, J., Wong, K., Jackman, S., et al.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)

    Article  Google Scholar 

  35. 35.

    Valiant, L.: General purpose parallel architectures. In: Handbook of Theoretical Computer Science, vol. A, pp. 943–973. MIT Press, Cambridge, MA, USA (1990)

  36. 36.

    Vitter, J.: External memory algorithms and data structures: dealing with massive data. ACM Comput. Surv. 33(2), 209–271 (2001)

    Article  Google Scholar 

  37. 37.

    Vitter, J., Shriver, E.: Algorithms for parallel memory, I: two-level memories. Algorithmica 12(2), 110–147 (1994)

    MathSciNet  Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors acknowledge the support of the MIUR PRIN 2010-2011 grant “Automi e Linguaggi Formali: Aspetti Matematici e Applicativi” code 2010LYA9RH, of the Cariplo Foundation grant 2013-0955 (Modulation of anti cancer immune response by regulatory non-coding RNAs), of the FA 2013 grant “Metodi algoritmici e modelli: aspetti teorici e applicazioni in bioinformatica” code 2013-ATE-0281, and of the FA 2014 grant “Algoritmi e modelli computazionali: aspetti teorici e applicazioni nelle scienze della vita” code 2014-ATE-0382. The authors would like to thank the anonymous reviewers for their insightful comments.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Marco Previtali.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bonizzoni, P., Della Vedova, G., Pirola, Y. et al. An External-Memory Algorithm for String Graph Construction. Algorithmica 78, 394–424 (2017). https://doi.org/10.1007/s00453-016-0165-4

Download citation

Keywords

  • External memory algorithms
  • Burrows–Wheeler transform
  • String graphs
  • Genome assembly