Abstract
Despite the recent progress in genome sequencing and assembly, many of the currently available assembled genomes come in a draft form. Such draft genomes consist of a large number of genomic fragments (scaffolds), whose order and/or orientation (i.e., strand) in the genome are unknown. There exist various scaffold assembly methods, which attempt to determine the order and orientation of scaffolds along the genome chromosomes. Some of these methods (e.g., based on FISH physical mapping, chromatin conformation capture, etc.) can infer the order of scaffolds, but not necessarily their orientation. This leads to a special case of the scaffold orientation problem (i.e., deducing the orientation of each scaffold) with a known order of the scaffolds. We address the problem of orientating ordered scaffolds as an optimization problem based on given weighted orientations of scaffolds and their pairs (e.g., coming from pair-end sequencing reads, long reads, or homologous relations). We formalize this problem using notion of a scaffold graph (i.e., a graph, where vertices correspond to the assembled contigs or scaffolds and edges represent connections between them). We prove that this problem is \(\textsf {NP}\)-hard, and present a polynomial-time algorithm for solving its special case, where orientation of each scaffold is imposed relatively to at most two other scaffolds. We further develop a fixed-parameter tractable algorithm for the general case of the orientation of ordered scaffolds problem.
Similar content being viewed by others
Notes
We remark that contigs can be viewed as a special type of scaffolds with no gaps.
It can will be seen later that any assembly realization in this case is conflicting.
It can be easily seen that a realization of \({\mathbb {A}}\) may exist only if \({\mathbb {A}}\) is proper.
More generally, \({\mathbb {O}}\) may be a multiset whose elements have real positive multiplicities (weights).
We remind that a vertex is articulation if its removal from the graph increases the number of connected components.
References
Aganezov S, Alekseyev MA. In: Bourgeois A, Skums P, Wan X, Zelikovsky A, editors. Multi-genome scaffold co-assembly based on the analysis of gene orders and genomic repeats, vol. 9683. Cham: Springer; 2016. pp. 237–49. https://doi.org/10.1007/978-3-319-38782-6_20.
Aganezov SS, Alekseyev MA. CAMSA: a tool for comparative analysis and merging of scaffold assemblies. BMC Bioinform. 2017;18(15):496. https://doi.org/10.1186/s12859-017-1919-y.
Anselmetti Y, Berry V, Chauve C, Chateau A, Tannier E, Bérard S. Ancestral gene synteny reconstruction improves extant species scaffolding. BMC Genom. 2015;16(Suppl 10):S11. https://doi.org/10.1186/1471-2164-16-S10-S11.
Assour LA, Emrich SJ. Multi-genome synteny for assembly improvement multi-genome synteny for assembly improvement. In: Proceedings of 7th international conference on bioinformatics and computational biology, 2015. pp. 193–199
Avdeyev P, Jiang S, Aganezov S, Hu F, Alekseyev MA. Reconstruction of ancestral genomes in presence of gene gain and loss. J Comput Biol. 2016;23(3):150–64. https://doi.org/10.1089/cmb.2015.0160.
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77. https://doi.org/10.1089/cmb.2012.0021.
Bashir A, Klammer AA, Robins WP, Chin CS, Webster D, Paxinos E, Hsu D, Ashby M, Wang S, Peluso P, Sebra R, Sorenson J, Bullard J, Yen J, Valdovino M, Mollova E, Luong K, Lin S, LaMay B, Joshi A, Rowe L, Frace M, Tarr CL, Turnsek M, Davis BM, Kasarskis A, Mekalanos JJ, Waldor MK, Schadt EE. A hybrid approach for the automated finishing of bacterial genomes. Nat Biotechnol. 2012;30(7):701–7. https://doi.org/10.1038/nbt.2288.
Bazgan C, Paschos VT. Differential approximation for optimal satisfiability and related problems. Eur J Oper Res. 2003;147(2):397–404. https://doi.org/10.1016/S0377-2217(02)00299-0.
Bentley JL, Haken D, Saxe JB. A general method for solving divide-and-conquer recurrences. ACM SIGACT News. 1980;12(3):36–44. https://doi.org/10.1145/1008861.1008865.
Bodily PM, Fujimoto MS, Snell Q, Ventura D, Clement MJ. ScaffoldScaffolder: solving contig orientation via bidirected to directed graph reduction. Bioinformatics. 2015;32(1):17–24. https://doi.org/10.1093/bioinformatics/btv548.
Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011;27(4):578–9. https://doi.org/10.1093/bioinformatics/btq683.
Boetzer M, Pirovano W. SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinform. 2014;15(1):211. https://doi.org/10.1186/1471-2105-15-211.
Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO, Shendure J. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol. 2013;31(12):1119–25. https://doi.org/10.1038/nbt.2727.
Chen ZZ, Harada Y, Guo F, Wang L. Approximation algorithms for the scaffolding problem and its generalizations. Theor Comput Sci. 2017. https://doi.org/10.1016/j.tcs.2017.03.042.
Dayarian A, Michael TP, Sengupta AM. SOPRA: scaffolding algorithm for paired reads via statistical optimization. BMC Bioinform. 2010;11:345. https://doi.org/10.1186/1471-2105-11-345.
Escoffier B, Paschos VT. Differential approximation of min sat, max sat and related problems. Eur J Oper Res. 2007;181(2):620–33. https://doi.org/10.1016/j.ejor.2005.04.057.
Gao S, Sung WK, Nagarajan N. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J Comput Biol. 2011;18(11):1681–91. https://doi.org/10.1089/cmb.2011.0170.
Garey MR, Johnson DS. Computers and intractability: a guide to the theory of NP-completeness, vol. 58. San Francisco: Freeman; 1979.
Garey MR, Johnson DS, Stockmeyer L. Some simplified NP-complete graph problems. Theor Comput Sci. 1976;1(3):237–67.
Gritsenko AA, Nijkamp JF, Reinders MJT, de Ridder D. GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies. Bioinformatics. 2012;28(11):1429–37. https://doi.org/10.1093/bioinformatics/bts175.
Hunt M, Newbold C, Berriman M, Otto TD. A comprehensive evaluation of assembly scaffolding tools. Genome Biol. 2014. https://doi.org/10.1186/gb-2014-15-3-r42.
Jiao WB, Garcia Accinelli G, Hartwig B, Kiefer C, Baker D, Severing E, Willing EM, Piednoel M, Woetzel S, Madrid-Herrero E, Huettel B, Hümann U, Reinhard R, Koch MA, Swan D, Clavijo B, Coupland G, Schneeberger K. Improving and correcting the contiguity of long-read genome assemblies of three plant species using optical mapping and chromosome conformation capture data. Genome Res. 2017;27(5):116. https://doi.org/10.1101/gr.213652.116.
Kececioglu JD, Myers EW. Combinatorial algorithms for DNA sequence assembly. Algorithmica. 1995;13(1–2):7–51. https://doi.org/10.1007/BF01188580.
Kolmogorov M, Armstrong J, Raney BJ, Streeter I, Dunn M, Yang F, Odom D, Flicek P, Keane T, Thybert D, Paten B, Pham S. Chromosome assembly of large and complex genomes using multiple references. Preprint bioRxiv. 2016. https://doi.org/10.1101/088435.
Koren S, Treangen TJ, Pop M. Bambus 2: scaffolding metagenomes. Bioinformatics. 2011;27(21):2964–71. https://doi.org/10.1093/bioinformatics/btr520.
Lam KK, Labutti K, Khalak A, Tse D. FinisherSC: a repeat-aware tool for upgrading de novo assembly using long reads. Bioinformatics. 2015;31(19):3207–9. https://doi.org/10.1093/bioinformatics/btv280.
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, Wang J. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012;1(1):18. https://doi.org/10.1186/2047-217X-1-18.
Nagarajan N, Read TD, Pop M. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics. 2008;24(10):1229–35. https://doi.org/10.1093/bioinformatics/btn102.
Pop M, Kosack DS, Salzberg SL. Hierarchical scaffolding with Bambus. Genome Res. 2004;14(1):149–59. https://doi.org/10.1101/gr.1536204.
Putnam NH, O’Connell BL, Stites JC, Rice BJ, Blanchette M, Calef R, Troll CJ, Fields A, Hartley PD, Sugnet CW, Haussler D, Rokhsar DS, Green RE. Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res. 2016;26(3):342–50. https://doi.org/10.1101/gr.193474.115.
Reyes-Chin-Wo S, Wang Z, Yang X, Kozik A, Arikit S, Song C, Xia L, Froenicke L, Lavelle DO, Truco MJ, Xia R, Zhu S, Xu C, Xu H, Xu X, Cox K, Korf I, Meyers BC, Michelmore RW. Genome assembly with in vitro proximity ligation data and whole-genome triplication in lettuce. Nat Commun. 2017. https://doi.org/10.1038/ncomms14953.
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19(6):1117–23. https://doi.org/10.1101/gr.089532.108.
Tang H, Zhang X, Miao C, Zhang J, Ming R, Schnable JC, Schnable PS, Lyons E, Lu J. ALLMAPS: robust scaffold ordering based on multiple maps. Genome Biol. 2015;16(1):3. https://doi.org/10.1186/s13059-014-0573-1.
Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJM, Birol I. LINKS: scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience. 2015;4(1):35. https://doi.org/10.1186/s13742-015-0076-3.
Zimin AV, Smith DR, Sutton G, Yorke JA. Assembly reconciliation. Bioinformatics. 2008;24(1):42–5. https://doi.org/10.1093/bioinformatics/btm542.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “String Processing and Combinatorial Algorithms” guest edited by Simone Faro.
Appendix: Pseudocodes
Appendix: Pseudocodes
In the algorithms below we do not explicitly describe the function OrConsCount, which takes 4 arguments:
-
1.
a subgraph c from \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) with 1 or 2 vertices;
-
2.
a hash table so with scaffolds as keys and their orientations as values;
-
3.
a set of orientation imposing assembly points \({\mathbb {O}}\);
-
4.
an assembly \({\mathbb {A}}\)
and counts the assembly points from \({\mathbb {O}}\) that have consistent orientation with \({\mathbb {A}}\) in the case where scaffold(s) corresponding to vertices from c were to have orientation from so in \({\mathbb {A}}\). With simple hash-table based preprocessing of \({\mathbb {A}}\) and \({\mathbb {O}}\) this function runs in \({\mathcal {O}}\left( n\right)\) time, where n is a number of assembly points in \({\mathbb {O}}\) involving scaffolds that correspond to vertices in c. So, total running time for all invocations of this function will be \({\mathcal {O}}\left( |{\mathbb {O}}|\right)\) (i.e., \({\mathcal {O}}\left( |{\mathbb {S}}({\mathbb {A}})|^2\right)\)).
Rights and permissions
About this article
Cite this article
Aganezov, S., Avdeyev, P., Alexeev, N. et al. Orienting Ordered Scaffolds: Complexity and Algorithms. SN COMPUT. SCI. 3, 308 (2022). https://doi.org/10.1007/s42979-022-01198-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-022-01198-7