Least Random Suffix/Prefix Matches in Output-Sensitive Time

  • Niko Välimäki
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7354)

Abstract

We study the problem of finding suffix/prefix matches (overlaps) when given a set of r strings of total length n. Gusfield et al. (1992) gave an algorithm to find the longest exact overlaps between all string-pairs in the optimal O(n + t o utput) time, where t o utput ≤ r 2 is the number of non-zero length overlaps found. So far the best worst-case time for finding approximate overlaps within edit distance k has been O(knr) (Landau et al. 1998), which gives Ω(r 2) time regardless of the output size. We propose the first output-sensitive algorithm to find either the longest or the least random approximate overlaps. Given the maximum edit distance k allowed in an overlap, the approximate overlaps can be found in linear space and in O((n + t o utput) polylog(n)) time for any constant k. If all input strings are shorter than \(\log n/(k^\frac{1}{k}\sigma)\), we achieve the time complexity O(n log k n + t o utput) for any k. For strings longer than εlog k r, we improve the previous best worst-case time from O(knr) to \(O(\frac{c^k}{k!}nr)\) for moderate k and constants c > 1 and ε > 0.

Keywords

Edit Distance Input String Output Size Lower Common Ancestor Short String 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Chan, H.-L., Lam, T.-W., Sung, W.-K., Tam, S.-L., Wong, S.-S.: A Linear Size Index for Approximate Pattern Matching. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 49–59. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  2. 2.
    Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proc. STOC 2004, pp. 91–100. ACM (2004)Google Scholar
  3. 3.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), 20–44 (2007)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press (1997)Google Scholar
  5. 5.
    Gusfield, D., Landau, G.M., Schieber, B.: An efficient algorithm for the all pairs suffix-prefix problem. Inf. Process. Lett. 41(4), 181–185 (1992)MathSciNetMATHCrossRefGoogle Scholar
  6. 6.
    Hagerup, T., Miltersen, P.B., Pagh, R.: Deterministic dictionaries. J. Algorithms 41(1), 69–85 (2001)MathSciNetMATHCrossRefGoogle Scholar
  7. 7.
    Jokinen, P., Ukkonen, E.: Two Algorithms for Approximate String Matching in Static Texts. In: Tarlecki, A. (ed.) MFCS 1991. LNCS, vol. 520, pp. 240–248. Springer, Heidelberg (1991)CrossRefGoogle Scholar
  8. 8.
    Kececioglu, J.D., Myers, E.W.: Combinatiorial algorithms for dna sequence assembly. Algorithmica 13(1/2), 7–51 (1995)MathSciNetMATHCrossRefGoogle Scholar
  9. 9.
    Landau, G.M., Myers, E.W., Schmidt, J.P.: Incremental string comparison. SIAM J. Comput. 27(2), 557–582 (1998)MathSciNetMATHCrossRefGoogle Scholar
  10. 10.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  11. 11.
    Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Trans. Algorithms 4, 32:1–32:38 (2008)Google Scholar
  12. 12.
    Metzker, M.L.: Sequencing technologies - the next generation. Nature Reviews Genetics 11(1), 31–46 (2010)CrossRefGoogle Scholar
  13. 13.
    Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24(4), 19–27 (2001)Google Scholar
  14. 14.
    Ohlebusch, E., Gog, S.: Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem. Inf. Process. Lett. 110(3), 123–128 (2010)MathSciNetMATHCrossRefGoogle Scholar
  15. 15.
    Rasmussen, K.R., Stoye, J., Myers, E.W.: Efficient q-gram filters for finding all e-matches over a given length. J. of Computational Biology 13(2), 296–308 (2006)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)MathSciNetMATHCrossRefGoogle Scholar
  17. 17.
    Schieber, B., Vishkin, U.: On finding lowest common ancestors: simplification and parallelization. SIAM Journal on Computing 17(6), 1253–1262 (1988)MathSciNetMATHCrossRefGoogle Scholar
  18. 18.
    Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27(1), 379–423 (1948)MathSciNetMATHGoogle Scholar
  19. 19.
    Ukkonen, E.: Algorithms for approximate string matching. Information and Control 64(1-3), 100–118 (1985)MathSciNetMATHCrossRefGoogle Scholar
  20. 20.
    Ukkonen, E.: Finding approximate patterns in strings. J. Algorithms 6(1), 132–137 (1985)MathSciNetMATHCrossRefGoogle Scholar
  21. 21.
    Välimäki, N., Ladra, S., Mäkinen, V.: Approximate all-pairs suffix/prefix overlaps. Information and Computation 213, 49–58 (2012); CPM 2010 Special IssueGoogle Scholar
  22. 22.
    Willard, D.E.: Log-logarithmic worst-case range queries are possible in space Theta(N). Inf. Process. Lett. 17(2), 81–84 (1983)MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Niko Välimäki
    • 1
  1. 1.Helsinki Institute for Information Technology, Department of Computer ScienceUniversity of HelsinkiFinland

Personalised recommendations