Skip to main content

Orientation of Ordered Scaffolds

  • Conference paper
  • First Online:
  • 946 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10562))

Abstract

Despite the recent progress in genome sequencing and assembly, many of the currently available assembled genomes come in a draft form. Such draft genomes consist of a large number of genomic fragments (scaffolds), whose order and/or orientation (i.e., strand) in the genome are unknown. There exist various scaffold assembly methods, which attempt to determine the order and orientation of scaffolds along the genome chromosomes. Some of these methods (e.g., based on FISH physical mapping, chromatin conformation capture, etc.) can infer the order of scaffolds, but not necessarily their orientation. This leads to a special case of the scaffold orientation problem (i.e., deducing the orientation of each scaffold) with a known order of the scaffolds.

We address the problem of orientation of ordered scaffolds as an optimization problem based on given weighted orientations of scaffolds and their pairs (e.g., coming from pair-end sequencing reads, long reads, or homologous relations). We formalize this problem within the earlier introduced framework for comparative analysis and merging of scaffold assemblies (CAMSA). We prove that this problem is \(\mathsf {NP}\)-hard, and further present a polynomial-time algorithm for solving its special case, where orientation of each scaffold is imposed relatively to at most two other scaffolds. This lays the foundation for a follow-up FPT algorithm for the general case. The proposed algorithms are implemented in the CAMSA software version 2.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    We remark that contigs can be viewed as a special type of scaffolds with no gaps.

  2. 2.

    It can be easily seen that a realization of \(\mathbb {A}\) may exist only if \(\mathbb {A}\) is proper.

  3. 3.

    \(\deg (v)\) denotes the degree of a vertex v, i.e., the number of edges (counted with multiplicity) incident to v.

  4. 4.

    More generally, \(\mathbb {O}\) may be a multiset whose elements have real positive multiplicities (weights).

  5. 5.

    We remind that a vertex is articulation if its removal from the graph increases the number of connected components.

References

  1. Aganezov, S., Alekseyev, M.A.: Multi-genome scaffold co-assembly based on the analysis of gene orders and genomic repeats. In: Bourgeois, A., Skums, P., Wan, X., Zelikovsky, A. (eds.) ISBRA 2016. LNCS, vol. 9683, pp. 237–249. Springer, Cham (2016). doi:10.1007/978-3-319-38782-6_20

    Google Scholar 

  2. Aganezov, S.S., Alekseyev, M.A.: CAMSA: A Tool for Comparative Analysis and Merging of Scaffold Assemblies. Preprint bioRrxiv:10.1101/069153 (2016)

  3. Anselmetti, Y., Berry, V., Chauve, C., Chateau, A., Tannier, E., Bérard, S.: Ancestral gene synteny reconstruction improves extant species scaffolding. BMC Genom. 16(Suppl 10), S11 (2015)

    Article  Google Scholar 

  4. Assour, L.A., Emrich, S.J.: Multi-genome synteny for assembly improvement multi-genome synteny for assembly improvement. In: Proceedings of 7th International Conference on Bioinformatics and Computational Biology, pp. 193–199 (2015)

    Google Scholar 

  5. Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A.V., Sirotkin, A.V., Vyahhi, N., Tesler, G., Alekseyev, M.A., Pevzner, P.A.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)

    Article  MathSciNet  Google Scholar 

  6. Bashir, A., Klammer, A.A., Robins, W.P., Chin, C.S., Webster, D., Paxinos, E., Hsu, D., Ashby, M., Wang, S., Peluso, P., Sebra, R., Sorenson, J., Bullard, J., Yen, J., Valdovino, M., Mollova, E., Luong, K., Lin, S., LaMay, B., Joshi, A., Rowe, L., Frace, M., Tarr, C.L., Turnsek, M., Davis, B.M., Kasarskis, A., Mekalanos, J.J., Waldor, M.K., Schadt, E.E.: A hybrid approach for the automated finishing of bacterial genomes. Nat. Biotech. 30(7), 701–707 (2012)

    Article  Google Scholar 

  7. Bazgan, C., Paschos, V.T.: Differential approximation for optimal satisfiability and related problems. Eur. J. Oper. Res. 147(2), 397–404 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  8. Bentley, J.L., Haken, D., Saxe, J.B.: A general method for solving divide-and-conquer recurrences. ACM SIGACT News 12(3), 36–44 (1980)

    Article  MATH  Google Scholar 

  9. Bodily, P.M., Fujimoto, M.S., Snell, Q., Ventura, D., Clement, M.J.: ScaffoldScaffolder: solving contig orientation via bidirected to directed graph reduction. Bioinformatics 32(1), 17–24 (2015)

    Google Scholar 

  10. Boetzer, M., Henkel, C.V., Jansen, H.J., Butler, D., Pirovano, W.: Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27(4), 578–579 (2011)

    Article  Google Scholar 

  11. Boetzer, M., Pirovano, W.: SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinf. 15(1), 211 (2014)

    Article  Google Scholar 

  12. Burton, J.N., Adey, A., Patwardhan, R.P., Qiu, R., Kitzman, J.O., Shendure, J.: Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31(12), 1119–1125 (2013)

    Article  Google Scholar 

  13. Chen, Z.Z., Harada, Y., Guo, F., Wang, L.: Approximation algorithms for the scaffolding problem and its generalizations. Theoret. Comput. Sci. (2017). http://www.sciencedirect.com/science/article/pii/S0304397517302815

  14. Dayarian, A., Michael, T.P., Sengupta, A.M.: SOPRA: scaffolding algorithm for paired reads via statistical optimization. BMC Bioinf. 11, 345 (2010)

    Article  Google Scholar 

  15. Escoffier, B., Paschos, V.T.: Differential approximation of min sat, max sat and related problems. Eur. J. Oper. Res. 181(2), 620–633 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  16. Gao, S., Nagarajan, N., Sung, W.-K.: Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. In: Bafna, V., Sahinalp, S.C. (eds.) RECOMB 2011. LNCS, vol. 6577, pp. 437–451. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20036-6_40

    Chapter  Google Scholar 

  17. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide To The Theory of Np-completeness, vol. 58. Freeman, San Francisco (1979)

    MATH  Google Scholar 

  18. Garey, M.R., Johnson, D.S., Stockmeyer, L.: Some simplified NP-complete graph problems. Theoret. Comput. Sci. 1(3), 237–267 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  19. Gritsenko, A.A., Nijkamp, J.F., Reinders, M.J.T., de Ridder, D.: GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies. Bioinformatics 28(11), 1429–1437 (2012)

    Article  Google Scholar 

  20. Hunt, M., Newbold, C., Berriman, M., Otto, T.D.: A comprehensive evaluation of assembly scaffolding tools. Genome Biol. 15(3), R42 (2014)

    Article  Google Scholar 

  21. Jiao, W.B., Garcia Accinelli, G., Hartwig, B., Kiefer, C., Baker, D., Severing, E., Willing, E.M., Piednoel, M., Woetzel, S., Madrid-Herrero, E., Huettel, B., Hümann, U., Reinhard, R., Koch, M.A., Swan, D., Clavijo, B., Coupland, G., Schneeberger, K.: Improving and correcting the contiguity of long-read genome assemblies of three plant species using optical mapping and chromosome conformation capture data. Genome Res. 27(5), 116 (2017)

    Article  Google Scholar 

  22. Kececioglu, J.D., Myers, E.W.: Combinatorial algorithms for DNA sequence assembly. Algorithmica 13(1–2), 7–51 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  23. Kolmogorov, M., Armstrong, J., Raney, B.J., Streeter, I., Dunn, M., Yang, F., Odom, D., Flicek, P., Keane, T., Thybert, D., Paten, B., Pham, S.: Chromosome assembly of large and complex genomes using multiple references. Preprint bioRxiv:10.1101/088435 (2016)

  24. Koren, S., Treangen, T.J., Pop, M.: Bambus 2: scaffolding metagenomes. Bioinformatics 27(21), 2964–2971 (2011)

    Article  Google Scholar 

  25. Lam, K.K., Labutti, K., Khalak, A., Tse, D.: FinisherSC: a repeat-aware tool for upgrading de novo assembly using long reads. Bioinformatics 31(19), 3207–3209 (2015)

    Article  Google Scholar 

  26. Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., Wang, J.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1(1), 18 (2012)

    Article  Google Scholar 

  27. Nagarajan, N., Read, T.D., Pop, M.: Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics 24(10), 1229–1235 (2008)

    Article  Google Scholar 

  28. Pop, M., Kosack, D.S., Salzberg, S.L.: Hierarchical scaffolding with Bambus. Genome Res. 14(1), 149–159 (2004)

    Article  Google Scholar 

  29. Putnam, N.H., O’Connell, B.L., Stites, J.C., Rice, B.J., Blanchette, M., Calef, R., Troll, C.J., Fields, A., Hartley, P.D., Sugnet, C.W., Haussler, D., Rokhsar, D.S., Green, R.E.: Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res. 26(3), 342–350 (2016)

    Article  Google Scholar 

  30. Reyes-Chin-Wo, S., Wang, Z., Yang, X., Kozik, A., Arikit, S., Song, C., Xia, L., Froenicke, L., Lavelle, D.O., Truco, M.J., Xia, R., Zhu, S., Xu, C., Xu, H., Xu, X., Cox, K., Korf, I., Meyers, B.C., Michelmore, R.W.: Genome assembly with in vitro proximity ligation data and whole-genome triplication in lettuce. Nat. Commun. 8, Article no. 14953 (2017). https://www.nature.com/articles/ncomms14953

  31. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)

    Article  Google Scholar 

  32. Tang, H., Zhang, X., Miao, C., Zhang, J., Ming, R., Schnable, J.C., Schnable, P.S., Lyons, E., Lu, J.: ALLMAPS: robust scaffold ordering based on multiple maps. Genome Biol. 16(1), 3 (2015)

    Article  Google Scholar 

  33. Warren, R.L., Yang, C., Vandervalk, B.P., Behsaz, B., Lagman, A., Jones, S.J.M., Birol, I.: LINKS: scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience 4(1), 35 (2015)

    Article  Google Scholar 

  34. Zimin, A.V., Smith, D.R., Sutton, G., Yorke, J.A.: Assembly reconciliation. Bioinformatics 24(1), 42–45 (2008)

    Article  Google Scholar 

Download references

Acknowledgements

The authors thank the anonymous reviewers for their suggestions and comments that helped to improve the exposition.

The work is supported by the National Science Foundation under the grant No. IIS-1462107. The work of SA is also partially supported by the National Science Foundation under the grant No. CCF-1053753 and by the National Institute of Health under the grant No. U24CA211000.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sergey Aganezov .

Editor information

Editors and Affiliations

Appendix. Pseudocodes

Appendix. Pseudocodes

In the algorithms below we do not explicitly describe the function OrConsCount, which takes 4 arguments:

figure a
figure b
figure c
  1. 1.

    a subgraph c from \({{\mathrm{\mathsf {COG}}}}(\mathbb {O}_o)\) with 1 or 2 vertices;

  2. 2.

    a hash table so with scaffolds as keys and their orientations as values;

  3. 3.

    a set of orientation imposing assembly points \(\mathbb {O}_o\);

  4. 4.

    an assembly \(\mathbb {A}\)

and counts the assembly points from \(\mathbb {O}\) that have consistent orientation with \(\mathbb {A}\) in the case where scaffold(s) corresponding to vertices from c were to have orientation from so in \(\mathbb {A}\). With simple hash-table based preprocessing of \(\mathbb {A}\) and \(\mathbb {O}\) (can be done in \(\mathcal {O}\left( k\log (k)\right) \) time, where \(k=\max \{|\mathbb {O}|, \mathbb {S}(\mathbb {A})\}\)) this function runs in \(\mathcal {O}\left( n\right) \) time, where n is a number of assembly points in \(\mathbb {O}\) involving scaffolds that correspond to vertices in c. So, total running time for all invocations of this function will be \(\mathcal {O}\left( |\mathbb {O}|\right) \).

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Aganezov, S., Alekseyev, M.A. (2017). Orientation of Ordered Scaffolds. In: Meidanis, J., Nakhleh, L. (eds) Comparative Genomics. RECOMB-CG 2017. Lecture Notes in Computer Science(), vol 10562. Springer, Cham. https://doi.org/10.1007/978-3-319-67979-2_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67979-2_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67978-5

  • Online ISBN: 978-3-319-67979-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics