Algorithmic Framework for Approximate Matching Under Bounded Edits with Applications to Sequence Analysis

  • Sharma V. Thankachan
  • Chaitanya Aluru
  • Sriram P. Chockalingam
  • Srinivas Aluru
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10812)

Abstract

We present a novel algorithmic framework for solving approximate sequence matching problems that permit a bounded total number k of mismatches, insertions, and deletions. The core of the framework relies on transforming an approximate matching problem into a corresponding exact matching problem on suitably edited string suffixes, while carefully controlling the required number of such edited suffixes to enable the design of efficient algorithms. For a total input size of n, our framework limits the number of generated edited suffixes to no more than a factor of \(O(\log ^k n)\) of the input size (for any constant k), and restricts the algorithm to linear space usage by overlapping the generation and processing of edited suffixes. Our framework improves the best known upper bound of \(n^2 k^{1.5}/ 2^{\varOmega (\sqrt{{\log n}/{k}})}\) for the classic k-edit longest common substring problem [Abboud, Williams, and Yu; SODA 2015] to yield the first strictly sub-quadratic time algorithm that runs in \(O(n\log ^k n)\) time and O(n) space for any constant k. We present similar subquadratic time and linear space algorithms for (i) computing the alignment-free distance between two genomes based on the k-edit average common substring measure, (ii) mapping reads/read fragments to a reference genome while allowing up to k edits, and (iii) computing all-pair maximal k-edit common substrings (also, suffix/prefix overlaps), which has applications in clustering and assembly. We expect our algorithmic framework to be a broadly applicable theoretical tool, and may inspire the design of practical heuristics and software.

Notes

Acknowledgments

This research is supported in part by the U.S. National Science Foundation under CCF-1704552 and CCF-1703489.

References

  1. 1.
    Abboud, A., Williams, R., Yu, H.: More applications of the polynomial method to algorithm design. In: Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 218–230 (2015)Google Scholar
  2. 2.
    Abboud, A., Williams, V.V., Weimann, O.: Consequences of faster alignment of sequences. In: Esparza, J., Fraigniaud, P., Husfeldt, T., Koutsoupias, E. (eds.) ICALP 2014. LNCS, vol. 8572, pp. 39–51. Springer, Heidelberg (2014).  https://doi.org/10.1007/978-3-662-43948-7_4CrossRefGoogle Scholar
  3. 3.
    Aluru, S., Apostolico, A., Thankachan, S.V.: Efficient alignment free sequence comparison with bounded mismatches. In: Przytycka, T.M. (ed.) RECOMB 2015. LNCS, vol. 9029, pp. 1–12. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-16706-0_1CrossRefGoogle Scholar
  4. 4.
    Apostolico, A.: Maximal words in sequence comparisons based on subword composition. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) Algorithms and Applications. LNCS, vol. 6060, pp. 34–44. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-12476-1_2CrossRefGoogle Scholar
  5. 5.
    Apostolico, A., Guerra, C., Landau, G.M., Pizzi, C.: Sequence similarity measures based on bounded hamming distance. Theoret. Comput. Sci. 638, 76–90 (2016)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Bonham-Carter, O., Steele, J., Bastola, D.: Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Briefings Bioinform. 15(6), 890–905 (2013)CrossRefGoogle Scholar
  7. 7.
    Brown, M.R., Tarjan, R.E.: A fast merging algorithm. J. ACM 26(2), 211–226 (1979)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Burkhardt, S., Kärkkäinen, J.: Better filtering with gapped q-grams. Fundam. Inform. 56(1–2), 51–70 (2003)MathSciNetMATHGoogle Scholar
  9. 9.
    Burstein, D., Ulitsky, I., Tuller, T., Chor, B.: Information theoretic approaches to whole genome phylogenies. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2005. LNCS, vol. 3500, pp. 283–295. Springer, Heidelberg (2005).  https://doi.org/10.1007/11415770_22CrossRefGoogle Scholar
  10. 10.
    Chang, G., Wang, T.: Phylogenetic analysis of protein sequences based on distribution of length about common substring. Protein J. 30(3), 167–172 (2011)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th Annual ACM Symposium on Theory of computing (STOC), pp. 91–100. ACM (2004)Google Scholar
  12. 12.
    Comin, M., Verzotto, D.: Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms Mol. Biol. 7(1), 1 (2012)CrossRefGoogle Scholar
  13. 13.
    Domazet-Lošo, M., Haubold, B.: Efficient estimation of pairwise distances between genomes. Bioinformatics 25(24), 3221–3227 (2009)CrossRefGoogle Scholar
  14. 14.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)CrossRefGoogle Scholar
  15. 15.
    Guyon, F., Brochier-Armanet, C., Guénoche, A.: Comparison of alignment free string distances for complete genome phylogeny. Adv. Data Anal. Classif. 3(2), 95–108 (2009)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Kucherov, G., Tsur, D.: Improved filters for the approximate suffix-prefix overlap problem. In: Moura, E., Crochemore, M. (eds.) SPIRE 2014. LNCS, vol. 8799, pp. 139–148. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-11918-2_14CrossRefGoogle Scholar
  17. 17.
    Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357–359 (2012)CrossRefGoogle Scholar
  18. 18.
    Leimeister, C.-A., Morgenstern, B.: kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30(14), 2000–2008 (2014)CrossRefGoogle Scholar
  19. 19.
    Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)CrossRefGoogle Scholar
  20. 20.
    Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Briefings Bioinform. 11(5), 473–483 (2010)CrossRefGoogle Scholar
  21. 21.
    Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)CrossRefGoogle Scholar
  22. 22.
    Manzini, G.: Longest common prefix with mismatches. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 299–310. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-23826-5_29CrossRefGoogle Scholar
  23. 23.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM (JACM) 23(2), 262–272 (1976)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Pizzi, C.: A filtering approach for alignment-free biosequences comparison with mismatches. In: Pop, M., Touzet, H. (eds.) WABI 2015. LNCS, vol. 9289, pp. 231–242. Springer, Heidelberg (2015).  https://doi.org/10.1007/978-3-662-48221-6_17CrossRefGoogle Scholar
  25. 25.
    Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22(3), 549–556 (2012)CrossRefGoogle Scholar
  26. 26.
    Sleator, D.D., Tarjan, R.E.: A data structure for dynamic trees. J. Comput. Syst. Sci. 26(3), 362–391 (1983)MathSciNetCrossRefGoogle Scholar
  27. 27.
    Thankachan, S.V., Apostolico, A., Aluru, S.: A provably efficient algorithm for the k-mismatch average common substring problem. J. Comput. Biol. 23(6), 472–482 (2016)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Thankachan, S.V., Chockalingam, S.P., Liu, Y., Apostolico, A., Aluru, S.: ALFRED: a practical method for alignment-free distance computation. J. Comput. Biol. 23(6), 452–460 (2016)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Thankachan, S.V., Chockalingam, S.P., Liu, Y., Krishnan, A., Aluru, S.: A greedy alignment-free distance estimator for phylogenetic inference. In: Proceedings of 5th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS) (2015)Google Scholar
  30. 30.
    Välimäki, N., Ladra, S., Mäkinen, V.: Approximate all-pairs suffix/prefix overlaps. Inf. Comput. 213, 49–58 (2012)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory (SWAT), pp. 1–11 (1973)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Sharma V. Thankachan
    • 1
  • Chaitanya Aluru
    • 2
  • Sriram P. Chockalingam
    • 3
  • Srinivas Aluru
    • 3
    • 4
  1. 1.Department of Computer ScienceUniversity of Central FloridaOrlandoUSA
  2. 2.Department of Computer SciencePrinceton UniversityPrincetonUSA
  3. 3.School of Computational Science and EngineeringGeorgia Institute of TechnologyAtlantaUSA
  4. 4.Institute for Data Engineering and ScienceGeorgia Institute of TechnologyAtlantaUSA

Personalised recommendations