Skip to main content

Co-linear Chaining with Overlaps and Gap Costs

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2022)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13278))

Abstract

Co-linear chaining has proven to be a powerful heuristic for finding near-optimal alignments of long DNA sequences (e.g., long reads or a genome assembly) to a reference. It is used as an intermediate step in several alignment tools that employ a seed-chain-extend strategy. Despite this popularity, efficient subquadratic-time algorithms for the general case where chains support anchor overlaps and gap costs are not currently known. We present algorithms to solve the co-linear chaining problem with anchor overlaps and gap costs in \(\tilde{O}(n)\) time, where n denotes the count of anchors. We also establish the first theoretical connection between co-linear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal ‘anchored’ edit distance equals the optimal co-linear chaining cost. Finally, we demonstrate experimentally that optimal co-linear chaining cost under the proposed cost function can be computed orders of magnitude faster than edit distance, and achieves correlation coefficient above 0.9 with edit distance for closely as well as distantly related sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    \(\widetilde{O}(\cdot )\) hides poly-logarithmic factors.

  2. 2.

    https://github.com/Martinsos/edlib/tree/master/test_data.

References

  1. Abouelhoda, M., Ohlebusch, E.: Chaining algorithms for multiple genome comparison. J. Discrete Algorithms 3(2–4), 321–341 (2005)

    Article  MathSciNet  Google Scholar 

  2. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: CoCoNUT: an efficient system for the comparison and analysis of genomes. BMC Bioinf. 9(1), 476 (2008). https://doi.org/10.1186/1471-2105-9-476

    Article  Google Scholar 

  3. Backurs, A., Indyk, P.: Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, pp. 51–58 (2015)

    Google Scholar 

  4. de Berg, M., Cheong, O., van Kreveld, M.J., Overmars, M.H.: Computational Geometry: Algorithms and applications, 3rd edn. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-77974-2

    Book  MATH  Google Scholar 

  5. Bray, N., Dubchak, I., Pachter, L.: AVID: a global alignment program. Genome Res. 13(1), 97–102 (2003)

    Article  Google Scholar 

  6. Chaisson, M.J., Tesler, G.: Mapping single molecule sequencing reads using basic local alignment with successive refinement ((BLASR): application and theory. BMC Bioinf. 13(1), 238 (2012). https://doi.org/10.1186/1471-2105-13-238

    Article  Google Scholar 

  7. Delcher, A.L., Kasif, S., et al.: Alignment of whole genomes. Nucleic Acids Res. 27(11), 2369–2376 (1999)

    Article  Google Scholar 

  8. Eppstein, D., Galil, Z., Giancarlo, R., Italiano, G.F.: Sparse dynamic programming i: linear cost functions. J. ACM (JACM) 39(3), 519–545 (1992)

    Article  MathSciNet  Google Scholar 

  9. Eppstein, D., Galil, Z., et al.: Sparse dynamic programming ii: convex and concave cost functions. J.. ACM (JACM) 39(3), 546–567 (1992)

    Article  MathSciNet  Google Scholar 

  10. Hoppenworth, G., Bentley, J.W., Gibney, D., Thankachan, S.V.: The fine-grained complexity of median and center string problems under edit distance. In: 28th Annual European Symposium on Algorithms, ESA 2020, Pisa, Italy, vol. 173, pp. 61:1–61:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020)

    Google Scholar 

  11. Jain, C., Gibney, D., Thankachan, S.V.: Co-linear chaining with overlaps and gap costs. bioRxiv (2021). https://doi.org/10.1101/2021.02.03.429492

  12. Jain, C., Rhie, A., Hansen, N., Koren, S., Phillippy, A.M.: A long read mapping method for highly repetitive reference sequences. bioRxiv (2020)

    Google Scholar 

  13. Kalikar, S., Jain, C., Md, V., Misra, S.: Accelerating long-read analysis on modern CPUs. bioRxiv (2021)

    Google Scholar 

  14. Kurtz, S., et al.: Versatile and open software for comparing large genomes. Genome Biol. 5(2), R12 (2004)

    Article  Google Scholar 

  15. Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018)

    Article  Google Scholar 

  16. Li, H., Feng, X., Chu, C.: The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21(1), 265 (2020). https://doi.org/10.1186/s13059-020-02168-z

    Article  Google Scholar 

  17. Mäkinen, V., Sahlin, K.: Chaining with overlaps revisited. In: 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, 17–19 June 2020, Copenhagen, Denmark, vol. 161, pp. 25:1–25:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020)

    Google Scholar 

  18. Mäkinen, V., Tomescu, A.I., Kuosmanen, A., Paavilainen, T., Gagie, T., Chikhi, R.: Sparse dynamic programming on DAGs with small width. ACM Trans. Algorithms 15(2), 29:1-29:21 (2019). https://doi.org/10.1145/3301312

    Article  MathSciNet  MATH  Google Scholar 

  19. Marçais, G., Delcher, A.L., et al.: MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14(1), e1005944 (2018)

    Google Scholar 

  20. Morgenstern, B.: A simple and space-efficient fragment-chaining algorithm for alignment of DNA and protein sequences. Appl. Math. Lett. 15(1), 11–16 (2002)

    Google Scholar 

  21. Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM (JACM) 46(3), 395–415 (1999)

    Google Scholar 

  22. Myers, G., Miller, W.: Chaining multiple-alignment fragments in sub-quadratic time. In: SODA. vol. 95, pp. 38–47 (1995)

    Google Scholar 

  23. Otto, C., Hoffmann, S., Gorodkin, J., Stadler, P.F.: Fast local fragment chaining using sum-of-pair gap costs. Algorithms Mol. Biol. 6(1), 4 (2011). https://doi.org/10.1186/1748-7188-6-4

  24. Ren, J., Chaisson, M.J.: lra: a long read aligner for sequences and contigs. PLOS Comput. Biol. 17(6), e1009078 (2021)

    Google Scholar 

  25. Sahlin, K., Mäkinen, V.: Accurate spliced alignment of long RNA sequencing reads. Bioinformatics 37(24), 4643–4651 (2021)

    Article  Google Scholar 

  26. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85 (2003)

    Google Scholar 

  27. Sedlazeck, F.J., et al.: Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15(6), 461–468 (2018)

    Google Scholar 

  28. Shibuya, T., Kurochkin, I.: Match chaining algorithms for cDNA mapping. In: Benson, G., Page, R.D.M. (eds.) WABI 2003. LNCS, vol. 2812, pp. 462–475. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39763-2_33

  29. Šošić, M., Šikić, M.: Edlib: a C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33(9), 1394–1395 (2017)

    Article  Google Scholar 

  30. Ukkonen, E.: Algorithms for approximate string matching. Inf. Control 64(1–3), 100–118 (1985)

    Google Scholar 

  31. Uricaru, R., et al.: Novel definition and algorithm for chaining fragments with proportional overlaps. J. Comput. Biol. 18(9), 1141–1154 (2011)

    Google Scholar 

  32. Vyverman, M., De Baets, B., et al.: essaMEM: finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics 29(6), 802–804 (2013)

    Google Scholar 

  33. Wilbur, W.J., Lipman, D.J.: Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. 80(3), 726–730 (1983)

    Google Scholar 

Download references

Acknowledgements

This research is supported in part by the U.S. National Science Foundation (NSF) grants CCF-1704552, CCF-1816027, CCF-2112643, CCF-2146003, and funding from the Indian Institute of Science.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Gibney .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jain, C., Gibney, D., Thankachan, S.V. (2022). Co-linear Chaining with Overlaps and Gap Costs. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-04749-7_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-04748-0

  • Online ISBN: 978-3-031-04749-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics