Algorithmic Framework for Approximate Matching Under Bounded Edits with Applications to Sequence Analysis

Thankachan, Sharma V.; Aluru, Chaitanya; Chockalingam, Sriram P.; Aluru, Srinivas

doi:10.1007/978-3-319-89929-9_14

Sharma V. Thankachan¹⁴,
Chaitanya Aluru¹⁵,
Sriram P. Chockalingam¹⁶ &
…
Srinivas Aluru^16,17

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10812))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

Abstract

We present a novel algorithmic framework for solving approximate sequence matching problems that permit a bounded total number k of mismatches, insertions, and deletions. The core of the framework relies on transforming an approximate matching problem into a corresponding exact matching problem on suitably edited string suffixes, while carefully controlling the required number of such edited suffixes to enable the design of efficient algorithms. For a total input size of n, our framework limits the number of generated edited suffixes to no more than a factor of \(O(\log ^k n)\) of the input size (for any constant k), and restricts the algorithm to linear space usage by overlapping the generation and processing of edited suffixes. Our framework improves the best known upper bound of \(n^2 k^{1.5}/ 2^{\varOmega (\sqrt{{\log n}/{k}})}\) for the classic k-edit longest common substring problem [Abboud, Williams, and Yu; SODA 2015] to yield the first strictly sub-quadratic time algorithm that runs in \(O(n\log ^k n)\) time and O(n) space for any constant k. We present similar subquadratic time and linear space algorithms for (i) computing the alignment-free distance between two genomes based on the k-edit average common substring measure, (ii) mapping reads/read fragments to a reference genome while allowing up to k edits, and (iii) computing all-pair maximal k-edit common substrings (also, suffix/prefix overlaps), which has applications in clustering and assembly. We expect our algorithmic framework to be a broadly applicable theoretical tool, and may inspire the design of practical heuristics and software.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In this paper, we use LCS to denote the longest common substring. Note that LCS is frequently used in literature to refer to the longest common subsequence instead.
2.
Find the longest substring of a sequence that matches with a substring of another sequence, allowing \(\le k\) edits.
3.
Throughout the analysis, we treat k as a constant for brevity. However, with a tighter analysis (deferred to full version), we can bound the time and space by \(O(n(c \log n)^k/k!)\) and \(O(c^k n)\), respectively for a constant c without making any such assumption on the value of k.
4.
The child with the largest number of leaves in its subtree (ties broken arbitrarily) among its siblings.

References

Abboud, A., Williams, R., Yu, H.: More applications of the polynomial method to algorithm design. In: Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 218–230 (2015)
Google Scholar
Abboud, A., Williams, V.V., Weimann, O.: Consequences of faster alignment of sequences. In: Esparza, J., Fraigniaud, P., Husfeldt, T., Koutsoupias, E. (eds.) ICALP 2014. LNCS, vol. 8572, pp. 39–51. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-43948-7_4
Chapter Google Scholar
Aluru, S., Apostolico, A., Thankachan, S.V.: Efficient alignment free sequence comparison with bounded mismatches. In: Przytycka, T.M. (ed.) RECOMB 2015. LNCS, vol. 9029, pp. 1–12. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16706-0_1
Chapter Google Scholar
Apostolico, A.: Maximal words in sequence comparisons based on subword composition. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) Algorithms and Applications. LNCS, vol. 6060, pp. 34–44. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12476-1_2
Chapter Google Scholar
Apostolico, A., Guerra, C., Landau, G.M., Pizzi, C.: Sequence similarity measures based on bounded hamming distance. Theoret. Comput. Sci. 638, 76–90 (2016)
Article MathSciNet Google Scholar
Bonham-Carter, O., Steele, J., Bastola, D.: Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Briefings Bioinform. 15(6), 890–905 (2013)
Article Google Scholar
Brown, M.R., Tarjan, R.E.: A fast merging algorithm. J. ACM 26(2), 211–226 (1979)
Article MathSciNet Google Scholar
Burkhardt, S., Kärkkäinen, J.: Better filtering with gapped q-grams. Fundam. Inform. 56(1–2), 51–70 (2003)
MathSciNet MATH Google Scholar
Burstein, D., Ulitsky, I., Tuller, T., Chor, B.: Information theoretic approaches to whole genome phylogenies. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2005. LNCS, vol. 3500, pp. 283–295. Springer, Heidelberg (2005). https://doi.org/10.1007/11415770_22
Chapter Google Scholar
Chang, G., Wang, T.: Phylogenetic analysis of protein sequences based on distribution of length about common substring. Protein J. 30(3), 167–172 (2011)
Article MathSciNet Google Scholar
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th Annual ACM Symposium on Theory of computing (STOC), pp. 91–100. ACM (2004)
Google Scholar
Comin, M., Verzotto, D.: Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms Mol. Biol. 7(1), 1 (2012)
Article Google Scholar
Domazet-Lošo, M., Haubold, B.: Efficient estimation of pairwise distances between genomes. Bioinformatics 25(24), 3221–3227 (2009)
Article Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Book Google Scholar
Guyon, F., Brochier-Armanet, C., Guénoche, A.: Comparison of alignment free string distances for complete genome phylogeny. Adv. Data Anal. Classif. 3(2), 95–108 (2009)
Article MathSciNet Google Scholar
Kucherov, G., Tsur, D.: Improved filters for the approximate suffix-prefix overlap problem. In: Moura, E., Crochemore, M. (eds.) SPIRE 2014. LNCS, vol. 8799, pp. 139–148. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11918-2_14
Chapter Google Scholar
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357–359 (2012)
Article Google Scholar
Leimeister, C.-A., Morgenstern, B.: kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30(14), 2000–2008 (2014)
Article Google Scholar
Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)
Article Google Scholar
Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Briefings Bioinform. 11(5), 473–483 (2010)
Article Google Scholar
Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)
Article Google Scholar
Manzini, G.: Longest common prefix with mismatches. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 299–310. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23826-5_29
Chapter Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM (JACM) 23(2), 262–272 (1976)
Article MathSciNet Google Scholar
Pizzi, C.: A filtering approach for alignment-free biosequences comparison with mismatches. In: Pop, M., Touzet, H. (eds.) WABI 2015. LNCS, vol. 9289, pp. 231–242. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48221-6_17
Chapter Google Scholar
Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22(3), 549–556 (2012)
Article Google Scholar
Sleator, D.D., Tarjan, R.E.: A data structure for dynamic trees. J. Comput. Syst. Sci. 26(3), 362–391 (1983)
Article MathSciNet Google Scholar
Thankachan, S.V., Apostolico, A., Aluru, S.: A provably efficient algorithm for the k-mismatch average common substring problem. J. Comput. Biol. 23(6), 472–482 (2016)
Article MathSciNet Google Scholar
Thankachan, S.V., Chockalingam, S.P., Liu, Y., Apostolico, A., Aluru, S.: ALFRED: a practical method for alignment-free distance computation. J. Comput. Biol. 23(6), 452–460 (2016)
Article MathSciNet Google Scholar
Thankachan, S.V., Chockalingam, S.P., Liu, Y., Krishnan, A., Aluru, S.: A greedy alignment-free distance estimator for phylogenetic inference. In: Proceedings of 5th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS) (2015)
Google Scholar
Välimäki, N., Ladra, S., Mäkinen, V.: Approximate all-pairs suffix/prefix overlaps. Inf. Comput. 213, 49–58 (2012)
Article MathSciNet Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory (SWAT), pp. 1–11 (1973)
Google Scholar

Download references

Acknowledgments

This research is supported in part by the U.S. National Science Foundation under CCF-1704552 and CCF-1703489.

Author information

Authors and Affiliations

Department of Computer Science, University of Central Florida, Orlando, FL, USA
Sharma V. Thankachan
Department of Computer Science, Princeton University, Princeton, NJ, USA
Chaitanya Aluru
School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
Sriram P. Chockalingam & Srinivas Aluru
Institute for Data Engineering and Science, Georgia Institute of Technology, Atlanta, GA, USA
Srinivas Aluru

Authors

Sharma V. Thankachan
View author publications
You can also search for this author in PubMed Google Scholar
Chaitanya Aluru
View author publications
You can also search for this author in PubMed Google Scholar
Sriram P. Chockalingam
View author publications
You can also search for this author in PubMed Google Scholar
Srinivas Aluru
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sharma V. Thankachan .

Editor information

Editors and Affiliations

Computer Science Department, Princeton University, Princeton, New Jersey, USA
Benjamin J. Raphael

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Thankachan, S.V., Aluru, C., Chockalingam, S.P., Aluru, S. (2018). Algorithmic Framework for Approximate Matching Under Bounded Edits with Applications to Sequence Analysis. In: Raphael, B. (eds) Research in Computational Molecular Biology. RECOMB 2018. Lecture Notes in Computer Science(), vol 10812. Springer, Cham. https://doi.org/10.1007/978-3-319-89929-9_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-89929-9_14
Published: 18 April 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-89928-2
Online ISBN: 978-3-319-89929-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics