Abstract
Co-linear chaining has proven to be a powerful heuristic for finding near-optimal alignments of long DNA sequences (e.g., long reads or a genome assembly) to a reference. It is used as an intermediate step in several alignment tools that employ a seed-chain-extend strategy. Despite this popularity, efficient subquadratic-time algorithms for the general case where chains support anchor overlaps and gap costs are not currently known. We present algorithms to solve the co-linear chaining problem with anchor overlaps and gap costs in \(\tilde{O}(n)\) time, where n denotes the count of anchors. We also establish the first theoretical connection between co-linear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal ‘anchored’ edit distance equals the optimal co-linear chaining cost. Finally, we demonstrate experimentally that optimal co-linear chaining cost under the proposed cost function can be computed orders of magnitude faster than edit distance, and achieves correlation coefficient above 0.9 with edit distance for closely as well as distantly related sequences.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
\(\widetilde{O}(\cdot )\) hides poly-logarithmic factors.
- 2.
References
Abouelhoda, M., Ohlebusch, E.: Chaining algorithms for multiple genome comparison. J. Discrete Algorithms 3(2–4), 321–341 (2005)
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: CoCoNUT: an efficient system for the comparison and analysis of genomes. BMC Bioinf. 9(1), 476 (2008). https://doi.org/10.1186/1471-2105-9-476
Backurs, A., Indyk, P.: Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, pp. 51–58 (2015)
de Berg, M., Cheong, O., van Kreveld, M.J., Overmars, M.H.: Computational Geometry: Algorithms and applications, 3rd edn. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-77974-2
Bray, N., Dubchak, I., Pachter, L.: AVID: a global alignment program. Genome Res. 13(1), 97–102 (2003)
Chaisson, M.J., Tesler, G.: Mapping single molecule sequencing reads using basic local alignment with successive refinement ((BLASR): application and theory. BMC Bioinf. 13(1), 238 (2012). https://doi.org/10.1186/1471-2105-13-238
Delcher, A.L., Kasif, S., et al.: Alignment of whole genomes. Nucleic Acids Res. 27(11), 2369–2376 (1999)
Eppstein, D., Galil, Z., Giancarlo, R., Italiano, G.F.: Sparse dynamic programming i: linear cost functions. J. ACM (JACM) 39(3), 519–545 (1992)
Eppstein, D., Galil, Z., et al.: Sparse dynamic programming ii: convex and concave cost functions. J.. ACM (JACM) 39(3), 546–567 (1992)
Hoppenworth, G., Bentley, J.W., Gibney, D., Thankachan, S.V.: The fine-grained complexity of median and center string problems under edit distance. In: 28th Annual European Symposium on Algorithms, ESA 2020, Pisa, Italy, vol. 173, pp. 61:1–61:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020)
Jain, C., Gibney, D., Thankachan, S.V.: Co-linear chaining with overlaps and gap costs. bioRxiv (2021). https://doi.org/10.1101/2021.02.03.429492
Jain, C., Rhie, A., Hansen, N., Koren, S., Phillippy, A.M.: A long read mapping method for highly repetitive reference sequences. bioRxiv (2020)
Kalikar, S., Jain, C., Md, V., Misra, S.: Accelerating long-read analysis on modern CPUs. bioRxiv (2021)
Kurtz, S., et al.: Versatile and open software for comparing large genomes. Genome Biol. 5(2), R12 (2004)
Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018)
Li, H., Feng, X., Chu, C.: The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21(1), 265 (2020). https://doi.org/10.1186/s13059-020-02168-z
Mäkinen, V., Sahlin, K.: Chaining with overlaps revisited. In: 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, 17–19 June 2020, Copenhagen, Denmark, vol. 161, pp. 25:1–25:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020)
Mäkinen, V., Tomescu, A.I., Kuosmanen, A., Paavilainen, T., Gagie, T., Chikhi, R.: Sparse dynamic programming on DAGs with small width. ACM Trans. Algorithms 15(2), 29:1-29:21 (2019). https://doi.org/10.1145/3301312
Marçais, G., Delcher, A.L., et al.: MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14(1), e1005944 (2018)
Morgenstern, B.: A simple and space-efficient fragment-chaining algorithm for alignment of DNA and protein sequences. Appl. Math. Lett. 15(1), 11–16 (2002)
Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM (JACM) 46(3), 395–415 (1999)
Myers, G., Miller, W.: Chaining multiple-alignment fragments in sub-quadratic time. In: SODA. vol. 95, pp. 38–47 (1995)
Otto, C., Hoffmann, S., Gorodkin, J., Stadler, P.F.: Fast local fragment chaining using sum-of-pair gap costs. Algorithms Mol. Biol. 6(1), 4 (2011). https://doi.org/10.1186/1748-7188-6-4
Ren, J., Chaisson, M.J.: lra: a long read aligner for sequences and contigs. PLOS Comput. Biol. 17(6), e1009078 (2021)
Sahlin, K., Mäkinen, V.: Accurate spliced alignment of long RNA sequencing reads. Bioinformatics 37(24), 4643–4651 (2021)
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85 (2003)
Sedlazeck, F.J., et al.: Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15(6), 461–468 (2018)
Shibuya, T., Kurochkin, I.: Match chaining algorithms for cDNA mapping. In: Benson, G., Page, R.D.M. (eds.) WABI 2003. LNCS, vol. 2812, pp. 462–475. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39763-2_33
Šošić, M., Šikić, M.: Edlib: a C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33(9), 1394–1395 (2017)
Ukkonen, E.: Algorithms for approximate string matching. Inf. Control 64(1–3), 100–118 (1985)
Uricaru, R., et al.: Novel definition and algorithm for chaining fragments with proportional overlaps. J. Comput. Biol. 18(9), 1141–1154 (2011)
Vyverman, M., De Baets, B., et al.: essaMEM: finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics 29(6), 802–804 (2013)
Wilbur, W.J., Lipman, D.J.: Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. 80(3), 726–730 (1983)
Acknowledgements
This research is supported in part by the U.S. National Science Foundation (NSF) grants CCF-1704552, CCF-1816027, CCF-2112643, CCF-2146003, and funding from the Indian Institute of Science.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jain, C., Gibney, D., Thankachan, S.V. (2022). Co-linear Chaining with Overlaps and Gap Costs. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-04749-7_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04748-0
Online ISBN: 978-3-031-04749-7
eBook Packages: Computer ScienceComputer Science (R0)