Abstract
Some recent results (Bauer et al. in Algorithms in bioinformatics, Springer, Berlin, pp 326–337, 2012; Cox et al. in Algorithms in bioinformatics, Springer, Berlin, pp. 214–224, 2012; Rosone and Sciortino in The nature of computation. Logic, algorithms, applications, Springer, Berlin, pp 353–364, 2013) have introduced external-memory algorithms to compute self-indexes of a set of strings, mainly via computing the Burrows–Wheeler transform of the input strings. The motivations for those results stem from Bioinformatics, where a large number of short strings (called reads) are routinely produced and analyzed. In that field, a fundamental problem is to assemble a genome from a large set of much shorter samples extracted from the unknown genome. The approaches that are currently used to tackle this problem are memory-intensive. This fact does not bode well with the ongoing increase in the availability of genomic data. A data structure that is used in genome assembly is the string graph, where vertices correspond to samples and arcs represent two overlapping samples. In this paper we address an open problem of Simpson and Durbin (Bioinformatics 26(12):i367–i373, 2010): to design an external-memory algorithm to compute the string graph.
This is a preview of subscription content,
to check access.



References
Abouelhoda, M., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)
Aggarwal, A., Vitter, J.: The Input/Output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)
Alizadeh, F., Karp, R., Newberg, L., Weisser, D.: Physical mapping of chromosomes: a combinatorial problem in molecular biology. Algorithmica 13, 52–76 (1995)
Alizadeh, F., Karp, R., Weisser, D., Zweig, G.: Physical mapping of chromosomes using unique probes. J. Comput. Biol. 2, 159–184 (1995)
Bankevich, A., Nurk, S., Antipov, D., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)
Bauer, M., Cox, A., Rosone, G.: Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci. 483, 134–148 (2013)
Bauer, M., Cox, A., Rosone, G., Sciortino, M.: Lightweight LCP construction for next-generation sequencing datasets. In: Algorithms in Bioinformatics, LNCS, vol. 7534, pp. 326–337. Springer, Berlin, Germany (2012)
Beerenwinkel, N., Beretta, S., Bonizzoni, P., Dondi, R., Pirola, Y.: Covering pairs in directed acyclic graphs. Comput. J. 58(7), 1673–1686 (2015)
Benson, D., Clark, K., Karsch-Mizrachi, I., et al.: GenBank. Nucleic Acids Research 42(D1), D32–D37 (2014)
Beretta, S., Bonizzoni, P., Della Vedova, G., Pirola, Y., Rizzi, R.: Modeling alternative splicing variants from RNA-Seq data with isoform graphs. J. Comput Biol 16(1), 16–40 (2014)
Blum, A., Jiang, T., Li, M., Tromp, J., Yannakakis, M.: Linear approximation of shortest superstrings. J. ACM 41, 630–647 (1994)
Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M., Rizzi, R.: Constructing string graphs in external memory. In: Algorithms in Bioinformatics, LNCS, vol. 8701, pp. 311–325. Springer, Berlin, Germany (2014)
Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M., Rizzi, R.: LSG: An external-memory tool to compute string graphs for NGS data assembly. J. Comput. Biol. 23(3), 137–149 (2016). doi:10.1089/cmb.2015.0172
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report, Digital Systems Research Center (1994)
Chen, Y., Dong, G., Han, J., Wah, B., Wang, J.: Multi-dimensional regression analysis of time-series data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 323–334. VLDB Endowment (2002)
Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8, 22 (2013)
Cox, A., Bauer, M., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)
Cox, A., Jakobi, T., Rosone, G., Schulz-Trieglaff, O.: Comparing DNA sequence collections by direct comparison of compressed text indexes. In: Algorithms in Bioinformatics, LNCS, vol. 7534, pp. 214–224. Springer, Berlin, Germany (2012)
Demetrescu, C., Finocchi, I., Ribichini, A.: Trading off space for passes in graph streaming problems. ACM Trans. Algorithms 6(1), 6 (2009)
Diestel, R.: Graph Theory. Graduate Texts in Mathematics, 3rd edn. Springer, Heidelberg (2005)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Henzinger, M., Raghavan, P., Rajagopalan, S.: Computing on data streams. In: External Memory Algorithms, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 50, pp. 107–118. AMS, Boston, MA, USA (1999)
Lacroix, V., Sammeth, M., Guigo, R., Bergeron, A.: Exact transcriptome reconstruction from short sequence reads. In: Algorithms in Bioinformatics, LNCS, vol. 5251, pp. 50–63. Springer, Berlin, Heidelberg (2008)
Lam, T., Li, R., Tam, A., Wong, S., Wu, E., Yiu, S.: High throughput short read alignment via bi-directional BWT. In: Bioinformatics and Biomedicine (BIBM ’09), pp. 31–36. IEEE Computer Society, Washington, DC, USA (2009)
McKenna, A., Hanna, M., Banks, E., et al.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9), 1297–1303 (2010)
Myers, E.: The fragment assembly string graph. Bioinformatics 21(suppl. 2), ii79–ii85 (2005)
Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J., Brown, C.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. PNAS 109(33), 13272–13277 (2012)
Peng, Y., Leung, H.C., Yiu, S.-M., Chin, F.: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420–1428 (2012)
Rosone, G., Sciortino, M.: The Burrows–Wheeler transform between data compression and combinatorics on words. In: The Nature of Computation. Logic, Algorithms, Applications, LNCS, vol. 7921, pp. 353–364. Springer, Berlin, Heidelberg (2013)
Sedgewick, R.: Algorithms in Java. Addison-Wesley Professional, Reading (2002)
Shi, F.: Suffix arrays for multiple strings: a method for on-line multiple string searches. In: Concurrency and Parallelism, Programming, Networking, and Security, LNCS, vol. 1179, pp. 11–22. Springer Berlin, Heidelberg (1996)
Simpson, J., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)
Simpson, J., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012)
Simpson, J., Wong, K., Jackman, S., et al.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)
Valiant, L.: General purpose parallel architectures. In: Handbook of Theoretical Computer Science, vol. A, pp. 943–973. MIT Press, Cambridge, MA, USA (1990)
Vitter, J.: External memory algorithms and data structures: dealing with massive data. ACM Comput. Surv. 33(2), 209–271 (2001)
Vitter, J., Shriver, E.: Algorithms for parallel memory, I: two-level memories. Algorithmica 12(2), 110–147 (1994)
Acknowledgments
The authors acknowledge the support of the MIUR PRIN 2010-2011 grant “Automi e Linguaggi Formali: Aspetti Matematici e Applicativi” code 2010LYA9RH, of the Cariplo Foundation grant 2013-0955 (Modulation of anti cancer immune response by regulatory non-coding RNAs), of the FA 2013 grant “Metodi algoritmici e modelli: aspetti teorici e applicazioni in bioinformatica” code 2013-ATE-0281, and of the FA 2014 grant “Algoritmi e modelli computazionali: aspetti teorici e applicazioni nelle scienze della vita” code 2014-ATE-0382. The authors would like to thank the anonymous reviewers for their insightful comments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bonizzoni, P., Della Vedova, G., Pirola, Y. et al. An External-Memory Algorithm for String Graph Construction. Algorithmica 78, 394–424 (2017). https://doi.org/10.1007/s00453-016-0165-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-016-0165-4