Skip to main content
Log in

Orienting Ordered Scaffolds: Complexity and Algorithms

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Despite the recent progress in genome sequencing and assembly, many of the currently available assembled genomes come in a draft form. Such draft genomes consist of a large number of genomic fragments (scaffolds), whose order and/or orientation (i.e., strand) in the genome are unknown. There exist various scaffold assembly methods, which attempt to determine the order and orientation of scaffolds along the genome chromosomes. Some of these methods (e.g., based on FISH physical mapping, chromatin conformation capture, etc.) can infer the order of scaffolds, but not necessarily their orientation. This leads to a special case of the scaffold orientation problem (i.e., deducing the orientation of each scaffold) with a known order of the scaffolds. We address the problem of orientating ordered scaffolds as an optimization problem based on given weighted orientations of scaffolds and their pairs (e.g., coming from pair-end sequencing reads, long reads, or homologous relations). We formalize this problem using notion of a scaffold graph (i.e., a graph, where vertices correspond to the assembled contigs or scaffolds and edges represent connections between them). We prove that this problem is \(\textsf {NP}\)-hard, and present a polynomial-time algorithm for solving its special case, where orientation of each scaffold is imposed relatively to at most two other scaffolds. We further develop a fixed-parameter tractable algorithm for the general case of the orientation of ordered scaffolds problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. We remark that contigs can be viewed as a special type of scaffolds with no gaps.

  2. It can will be seen later that any assembly realization in this case is conflicting.

  3. It can be easily seen that a realization of \({\mathbb {A}}\) may exist only if \({\mathbb {A}}\) is proper.

  4. More generally, \({\mathbb {O}}\) may be a multiset whose elements have real positive multiplicities (weights).

  5. We remind that a vertex is articulation if its removal from the graph increases the number of connected components.

References

  1. Aganezov S, Alekseyev MA. In: Bourgeois A, Skums P, Wan X, Zelikovsky A, editors. Multi-genome scaffold co-assembly based on the analysis of gene orders and genomic repeats, vol. 9683. Cham: Springer; 2016. pp. 237–49. https://doi.org/10.1007/978-3-319-38782-6_20.

    Chapter  Google Scholar 

  2. Aganezov SS, Alekseyev MA. CAMSA: a tool for comparative analysis and merging of scaffold assemblies. BMC Bioinform. 2017;18(15):496. https://doi.org/10.1186/s12859-017-1919-y.

    Article  Google Scholar 

  3. Anselmetti Y, Berry V, Chauve C, Chateau A, Tannier E, Bérard S. Ancestral gene synteny reconstruction improves extant species scaffolding. BMC Genom. 2015;16(Suppl 10):S11. https://doi.org/10.1186/1471-2164-16-S10-S11.

    Article  Google Scholar 

  4. Assour LA, Emrich SJ. Multi-genome synteny for assembly improvement multi-genome synteny for assembly improvement. In: Proceedings of 7th international conference on bioinformatics and computational biology, 2015. pp. 193–199

  5. Avdeyev P, Jiang S, Aganezov S, Hu F, Alekseyev MA. Reconstruction of ancestral genomes in presence of gene gain and loss. J Comput Biol. 2016;23(3):150–64. https://doi.org/10.1089/cmb.2015.0160.

    Article  MathSciNet  Google Scholar 

  6. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77. https://doi.org/10.1089/cmb.2012.0021.

    Article  MathSciNet  Google Scholar 

  7. Bashir A, Klammer AA, Robins WP, Chin CS, Webster D, Paxinos E, Hsu D, Ashby M, Wang S, Peluso P, Sebra R, Sorenson J, Bullard J, Yen J, Valdovino M, Mollova E, Luong K, Lin S, LaMay B, Joshi A, Rowe L, Frace M, Tarr CL, Turnsek M, Davis BM, Kasarskis A, Mekalanos JJ, Waldor MK, Schadt EE. A hybrid approach for the automated finishing of bacterial genomes. Nat Biotechnol. 2012;30(7):701–7. https://doi.org/10.1038/nbt.2288.

    Article  Google Scholar 

  8. Bazgan C, Paschos VT. Differential approximation for optimal satisfiability and related problems. Eur J Oper Res. 2003;147(2):397–404. https://doi.org/10.1016/S0377-2217(02)00299-0.

    Article  MathSciNet  MATH  Google Scholar 

  9. Bentley JL, Haken D, Saxe JB. A general method for solving divide-and-conquer recurrences. ACM SIGACT News. 1980;12(3):36–44. https://doi.org/10.1145/1008861.1008865.

    Article  MATH  Google Scholar 

  10. Bodily PM, Fujimoto MS, Snell Q, Ventura D, Clement MJ. ScaffoldScaffolder: solving contig orientation via bidirected to directed graph reduction. Bioinformatics. 2015;32(1):17–24. https://doi.org/10.1093/bioinformatics/btv548.

    Article  Google Scholar 

  11. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011;27(4):578–9. https://doi.org/10.1093/bioinformatics/btq683.

    Article  Google Scholar 

  12. Boetzer M, Pirovano W. SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinform. 2014;15(1):211. https://doi.org/10.1186/1471-2105-15-211.

    Article  Google Scholar 

  13. Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO, Shendure J. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol. 2013;31(12):1119–25. https://doi.org/10.1038/nbt.2727.

    Article  Google Scholar 

  14. Chen ZZ, Harada Y, Guo F, Wang L. Approximation algorithms for the scaffolding problem and its generalizations. Theor Comput Sci. 2017. https://doi.org/10.1016/j.tcs.2017.03.042.

    Article  MATH  Google Scholar 

  15. Dayarian A, Michael TP, Sengupta AM. SOPRA: scaffolding algorithm for paired reads via statistical optimization. BMC Bioinform. 2010;11:345. https://doi.org/10.1186/1471-2105-11-345.

    Article  Google Scholar 

  16. Escoffier B, Paschos VT. Differential approximation of min sat, max sat and related problems. Eur J Oper Res. 2007;181(2):620–33. https://doi.org/10.1016/j.ejor.2005.04.057.

    Article  MathSciNet  MATH  Google Scholar 

  17. Gao S, Sung WK, Nagarajan N. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J Comput Biol. 2011;18(11):1681–91. https://doi.org/10.1089/cmb.2011.0170.

    Article  MathSciNet  Google Scholar 

  18. Garey MR, Johnson DS. Computers and intractability: a guide to the theory of NP-completeness, vol. 58. San Francisco: Freeman; 1979.

    MATH  Google Scholar 

  19. Garey MR, Johnson DS, Stockmeyer L. Some simplified NP-complete graph problems. Theor Comput Sci. 1976;1(3):237–67.

    Article  MathSciNet  Google Scholar 

  20. Gritsenko AA, Nijkamp JF, Reinders MJT, de Ridder D. GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies. Bioinformatics. 2012;28(11):1429–37. https://doi.org/10.1093/bioinformatics/bts175.

    Article  Google Scholar 

  21. Hunt M, Newbold C, Berriman M, Otto TD. A comprehensive evaluation of assembly scaffolding tools. Genome Biol. 2014. https://doi.org/10.1186/gb-2014-15-3-r42.

    Article  Google Scholar 

  22. Jiao WB, Garcia Accinelli G, Hartwig B, Kiefer C, Baker D, Severing E, Willing EM, Piednoel M, Woetzel S, Madrid-Herrero E, Huettel B, Hümann U, Reinhard R, Koch MA, Swan D, Clavijo B, Coupland G, Schneeberger K. Improving and correcting the contiguity of long-read genome assemblies of three plant species using optical mapping and chromosome conformation capture data. Genome Res. 2017;27(5):116. https://doi.org/10.1101/gr.213652.116.

    Article  Google Scholar 

  23. Kececioglu JD, Myers EW. Combinatorial algorithms for DNA sequence assembly. Algorithmica. 1995;13(1–2):7–51. https://doi.org/10.1007/BF01188580.

    Article  MathSciNet  MATH  Google Scholar 

  24. Kolmogorov M, Armstrong J, Raney BJ, Streeter I, Dunn M, Yang F, Odom D, Flicek P, Keane T, Thybert D, Paten B, Pham S. Chromosome assembly of large and complex genomes using multiple references. Preprint bioRxiv. 2016. https://doi.org/10.1101/088435.

    Article  Google Scholar 

  25. Koren S, Treangen TJ, Pop M. Bambus 2: scaffolding metagenomes. Bioinformatics. 2011;27(21):2964–71. https://doi.org/10.1093/bioinformatics/btr520.

    Article  Google Scholar 

  26. Lam KK, Labutti K, Khalak A, Tse D. FinisherSC: a repeat-aware tool for upgrading de novo assembly using long reads. Bioinformatics. 2015;31(19):3207–9. https://doi.org/10.1093/bioinformatics/btv280.

    Article  Google Scholar 

  27. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, Wang J. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012;1(1):18. https://doi.org/10.1186/2047-217X-1-18.

    Article  Google Scholar 

  28. Nagarajan N, Read TD, Pop M. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics. 2008;24(10):1229–35. https://doi.org/10.1093/bioinformatics/btn102.

    Article  Google Scholar 

  29. Pop M, Kosack DS, Salzberg SL. Hierarchical scaffolding with Bambus. Genome Res. 2004;14(1):149–59. https://doi.org/10.1101/gr.1536204.

    Article  Google Scholar 

  30. Putnam NH, O’Connell BL, Stites JC, Rice BJ, Blanchette M, Calef R, Troll CJ, Fields A, Hartley PD, Sugnet CW, Haussler D, Rokhsar DS, Green RE. Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res. 2016;26(3):342–50. https://doi.org/10.1101/gr.193474.115.

    Article  Google Scholar 

  31. Reyes-Chin-Wo S, Wang Z, Yang X, Kozik A, Arikit S, Song C, Xia L, Froenicke L, Lavelle DO, Truco MJ, Xia R, Zhu S, Xu C, Xu H, Xu X, Cox K, Korf I, Meyers BC, Michelmore RW. Genome assembly with in vitro proximity ligation data and whole-genome triplication in lettuce. Nat Commun. 2017. https://doi.org/10.1038/ncomms14953.

    Article  Google Scholar 

  32. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19(6):1117–23. https://doi.org/10.1101/gr.089532.108.

    Article  Google Scholar 

  33. Tang H, Zhang X, Miao C, Zhang J, Ming R, Schnable JC, Schnable PS, Lyons E, Lu J. ALLMAPS: robust scaffold ordering based on multiple maps. Genome Biol. 2015;16(1):3. https://doi.org/10.1186/s13059-014-0573-1.

    Article  Google Scholar 

  34. Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJM, Birol I. LINKS: scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience. 2015;4(1):35. https://doi.org/10.1186/s13742-015-0076-3.

    Article  Google Scholar 

  35. Zimin AV, Smith DR, Sutton G, Yorke JA. Assembly reconciliation. Bioinformatics. 2008;24(1):42–5. https://doi.org/10.1093/bioinformatics/btm542.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Max A. Alekseyev.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “String Processing and Combinatorial Algorithms” guest edited by Simone Faro.

Appendix: Pseudocodes

Appendix: Pseudocodes

In the algorithms below we do not explicitly describe the function OrConsCount, which takes 4 arguments:

  1. 1.

    a subgraph c from \({{\,\mathrm{\mathsf {COG}}\,}}({\mathbb {O}})\) with 1 or 2 vertices;

  2. 2.

    a hash table so with scaffolds as keys and their orientations as values;

  3. 3.

    a set of orientation imposing assembly points \({\mathbb {O}}\);

  4. 4.

    an assembly \({\mathbb {A}}\)

and counts the assembly points from \({\mathbb {O}}\) that have consistent orientation with \({\mathbb {A}}\) in the case where scaffold(s) corresponding to vertices from c were to have orientation from so in \({\mathbb {A}}\). With simple hash-table based preprocessing of \({\mathbb {A}}\) and \({\mathbb {O}}\) this function runs in \({\mathcal {O}}\left( n\right)\) time, where n is a number of assembly points in \({\mathbb {O}}\) involving scaffolds that correspond to vertices in c. So, total running time for all invocations of this function will be \({\mathcal {O}}\left( |{\mathbb {O}}|\right)\) (i.e., \({\mathcal {O}}\left( |{\mathbb {S}}({\mathbb {A}})|^2\right)\)).

figure a
figure b
figure c
figure d

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aganezov, S., Avdeyev, P., Alexeev, N. et al. Orienting Ordered Scaffolds: Complexity and Algorithms. SN COMPUT. SCI. 3, 308 (2022). https://doi.org/10.1007/s42979-022-01198-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-022-01198-7

Keywords

Navigation