Abstract
We present a novel algorithmic framework for solving approximate sequence matching problems that permit a bounded total number k of mismatches, insertions, and deletions. The core of the framework relies on transforming an approximate matching problem into a corresponding exact matching problem on suitably edited string suffixes, while carefully controlling the required number of such edited suffixes to enable the design of efficient algorithms. For a total input size of n, our framework limits the number of generated edited suffixes to no more than a factor of \(O(\log ^k n)\) of the input size (for any constant k), and restricts the algorithm to linear space usage by overlapping the generation and processing of edited suffixes. Our framework improves the best known upper bound of \(n^2 k^{1.5}/ 2^{\varOmega (\sqrt{{\log n}/{k}})}\) for the classic k-edit longest common substring problem [Abboud, Williams, and Yu; SODA 2015] to yield the first strictly sub-quadratic time algorithm that runs in \(O(n\log ^k n)\) time and O(n) space for any constant k. We present similar subquadratic time and linear space algorithms for (i) computing the alignment-free distance between two genomes based on the k-edit average common substring measure, (ii) mapping reads/read fragments to a reference genome while allowing up to k edits, and (iii) computing all-pair maximal k-edit common substrings (also, suffix/prefix overlaps), which has applications in clustering and assembly. We expect our algorithmic framework to be a broadly applicable theoretical tool, and may inspire the design of practical heuristics and software.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In this paper, we use LCS to denote the longest common substring. Note that LCS is frequently used in literature to refer to the longest common subsequence instead.
- 2.
Find the longest substring of a sequence that matches with a substring of another sequence, allowing \(\le k\) edits.
- 3.
Throughout the analysis, we treat k as a constant for brevity. However, with a tighter analysis (deferred to full version), we can bound the time and space by \(O(n(c \log n)^k/k!)\) and \(O(c^k n)\), respectively for a constant c without making any such assumption on the value of k.
- 4.
The child with the largest number of leaves in its subtree (ties broken arbitrarily) among its siblings.
References
Abboud, A., Williams, R., Yu, H.: More applications of the polynomial method to algorithm design. In: Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 218–230 (2015)
Abboud, A., Williams, V.V., Weimann, O.: Consequences of faster alignment of sequences. In: Esparza, J., Fraigniaud, P., Husfeldt, T., Koutsoupias, E. (eds.) ICALP 2014. LNCS, vol. 8572, pp. 39–51. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-43948-7_4
Aluru, S., Apostolico, A., Thankachan, S.V.: Efficient alignment free sequence comparison with bounded mismatches. In: Przytycka, T.M. (ed.) RECOMB 2015. LNCS, vol. 9029, pp. 1–12. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16706-0_1
Apostolico, A.: Maximal words in sequence comparisons based on subword composition. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) Algorithms and Applications. LNCS, vol. 6060, pp. 34–44. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12476-1_2
Apostolico, A., Guerra, C., Landau, G.M., Pizzi, C.: Sequence similarity measures based on bounded hamming distance. Theoret. Comput. Sci. 638, 76–90 (2016)
Bonham-Carter, O., Steele, J., Bastola, D.: Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Briefings Bioinform. 15(6), 890–905 (2013)
Brown, M.R., Tarjan, R.E.: A fast merging algorithm. J. ACM 26(2), 211–226 (1979)
Burkhardt, S., Kärkkäinen, J.: Better filtering with gapped q-grams. Fundam. Inform. 56(1–2), 51–70 (2003)
Burstein, D., Ulitsky, I., Tuller, T., Chor, B.: Information theoretic approaches to whole genome phylogenies. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2005. LNCS, vol. 3500, pp. 283–295. Springer, Heidelberg (2005). https://doi.org/10.1007/11415770_22
Chang, G., Wang, T.: Phylogenetic analysis of protein sequences based on distribution of length about common substring. Protein J. 30(3), 167–172 (2011)
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th Annual ACM Symposium on Theory of computing (STOC), pp. 91–100. ACM (2004)
Comin, M., Verzotto, D.: Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms Mol. Biol. 7(1), 1 (2012)
Domazet-Lošo, M., Haubold, B.: Efficient estimation of pairwise distances between genomes. Bioinformatics 25(24), 3221–3227 (2009)
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Guyon, F., Brochier-Armanet, C., Guénoche, A.: Comparison of alignment free string distances for complete genome phylogeny. Adv. Data Anal. Classif. 3(2), 95–108 (2009)
Kucherov, G., Tsur, D.: Improved filters for the approximate suffix-prefix overlap problem. In: Moura, E., Crochemore, M. (eds.) SPIRE 2014. LNCS, vol. 8799, pp. 139–148. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11918-2_14
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357–359 (2012)
Leimeister, C.-A., Morgenstern, B.: kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30(14), 2000–2008 (2014)
Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)
Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Briefings Bioinform. 11(5), 473–483 (2010)
Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)
Manzini, G.: Longest common prefix with mismatches. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 299–310. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23826-5_29
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM (JACM) 23(2), 262–272 (1976)
Pizzi, C.: A filtering approach for alignment-free biosequences comparison with mismatches. In: Pop, M., Touzet, H. (eds.) WABI 2015. LNCS, vol. 9289, pp. 231–242. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48221-6_17
Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22(3), 549–556 (2012)
Sleator, D.D., Tarjan, R.E.: A data structure for dynamic trees. J. Comput. Syst. Sci. 26(3), 362–391 (1983)
Thankachan, S.V., Apostolico, A., Aluru, S.: A provably efficient algorithm for the k-mismatch average common substring problem. J. Comput. Biol. 23(6), 472–482 (2016)
Thankachan, S.V., Chockalingam, S.P., Liu, Y., Apostolico, A., Aluru, S.: ALFRED: a practical method for alignment-free distance computation. J. Comput. Biol. 23(6), 452–460 (2016)
Thankachan, S.V., Chockalingam, S.P., Liu, Y., Krishnan, A., Aluru, S.: A greedy alignment-free distance estimator for phylogenetic inference. In: Proceedings of 5th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS) (2015)
Välimäki, N., Ladra, S., Mäkinen, V.: Approximate all-pairs suffix/prefix overlaps. Inf. Comput. 213, 49–58 (2012)
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory (SWAT), pp. 1–11 (1973)
Acknowledgments
This research is supported in part by the U.S. National Science Foundation under CCF-1704552 and CCF-1703489.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Thankachan, S.V., Aluru, C., Chockalingam, S.P., Aluru, S. (2018). Algorithmic Framework for Approximate Matching Under Bounded Edits with Applications to Sequence Analysis. In: Raphael, B. (eds) Research in Computational Molecular Biology. RECOMB 2018. Lecture Notes in Computer Science(), vol 10812. Springer, Cham. https://doi.org/10.1007/978-3-319-89929-9_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-89929-9_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-89928-2
Online ISBN: 978-3-319-89929-9
eBook Packages: Computer ScienceComputer Science (R0)