Least Random Suffix/Prefix Matches in Output-Sensitive Time
Abstract
We study the problem of finding suffix/prefix matches (overlaps) when given a set of r strings of total length n. Gusfield et al. (1992) gave an algorithm to find the longest exact overlaps between all string-pairs in the optimal O(n + t o utput) time, where t o utput ≤ r 2 is the number of non-zero length overlaps found. So far the best worst-case time for finding approximate overlaps within edit distance k has been O(knr) (Landau et al. 1998), which gives Ω(r 2) time regardless of the output size. We propose the first output-sensitive algorithm to find either the longest or the least random approximate overlaps. Given the maximum edit distance k allowed in an overlap, the approximate overlaps can be found in linear space and in O((n + t o utput) polylog(n)) time for any constant k. If all input strings are shorter than \(\log n/(k^\frac{1}{k}\sigma)\), we achieve the time complexity O(n log k n + t o utput) for any k. For strings longer than εlog k r, we improve the previous best worst-case time from O(knr) to \(O(\frac{c^k}{k!}nr)\) for moderate k and constants c > 1 and ε > 0.
Keywords
Edit Distance Input String Output Size Lower Common Ancestor Short StringPreview
Unable to display preview. Download preview PDF.
References
- 1.Chan, H.-L., Lam, T.-W., Sung, W.-K., Tam, S.-L., Wong, S.-S.: A Linear Size Index for Approximate Pattern Matching. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 49–59. Springer, Heidelberg (2006)CrossRefGoogle Scholar
- 2.Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proc. STOC 2004, pp. 91–100. ACM (2004)Google Scholar
- 3.Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), 20–44 (2007)MathSciNetCrossRefGoogle Scholar
- 4.Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press (1997)Google Scholar
- 5.Gusfield, D., Landau, G.M., Schieber, B.: An efficient algorithm for the all pairs suffix-prefix problem. Inf. Process. Lett. 41(4), 181–185 (1992)MathSciNetMATHCrossRefGoogle Scholar
- 6.Hagerup, T., Miltersen, P.B., Pagh, R.: Deterministic dictionaries. J. Algorithms 41(1), 69–85 (2001)MathSciNetMATHCrossRefGoogle Scholar
- 7.Jokinen, P., Ukkonen, E.: Two Algorithms for Approximate String Matching in Static Texts. In: Tarlecki, A. (ed.) MFCS 1991. LNCS, vol. 520, pp. 240–248. Springer, Heidelberg (1991)CrossRefGoogle Scholar
- 8.Kececioglu, J.D., Myers, E.W.: Combinatiorial algorithms for dna sequence assembly. Algorithmica 13(1/2), 7–51 (1995)MathSciNetMATHCrossRefGoogle Scholar
- 9.Landau, G.M., Myers, E.W., Schmidt, J.P.: Incremental string comparison. SIAM J. Comput. 27(2), 557–582 (1998)MathSciNetMATHCrossRefGoogle Scholar
- 10.Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)MathSciNetGoogle Scholar
- 11.Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Trans. Algorithms 4, 32:1–32:38 (2008)Google Scholar
- 12.Metzker, M.L.: Sequencing technologies - the next generation. Nature Reviews Genetics 11(1), 31–46 (2010)CrossRefGoogle Scholar
- 13.Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24(4), 19–27 (2001)Google Scholar
- 14.Ohlebusch, E., Gog, S.: Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem. Inf. Process. Lett. 110(3), 123–128 (2010)MathSciNetMATHCrossRefGoogle Scholar
- 15.Rasmussen, K.R., Stoye, J., Myers, E.W.: Efficient q-gram filters for finding all e-matches over a given length. J. of Computational Biology 13(2), 296–308 (2006)MathSciNetCrossRefGoogle Scholar
- 16.Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)MathSciNetMATHCrossRefGoogle Scholar
- 17.Schieber, B., Vishkin, U.: On finding lowest common ancestors: simplification and parallelization. SIAM Journal on Computing 17(6), 1253–1262 (1988)MathSciNetMATHCrossRefGoogle Scholar
- 18.Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27(1), 379–423 (1948)MathSciNetMATHGoogle Scholar
- 19.Ukkonen, E.: Algorithms for approximate string matching. Information and Control 64(1-3), 100–118 (1985)MathSciNetMATHCrossRefGoogle Scholar
- 20.Ukkonen, E.: Finding approximate patterns in strings. J. Algorithms 6(1), 132–137 (1985)MathSciNetMATHCrossRefGoogle Scholar
- 21.Välimäki, N., Ladra, S., Mäkinen, V.: Approximate all-pairs suffix/prefix overlaps. Information and Computation 213, 49–58 (2012); CPM 2010 Special IssueGoogle Scholar
- 22.Willard, D.E.: Log-logarithmic worst-case range queries are possible in space Theta(N). Inf. Process. Lett. 17(2), 81–84 (1983)MathSciNetMATHCrossRefGoogle Scholar