Skip to main content

Improved Space-Time Tradeoffs for Approximate Full-Text Indexing with One Edit Error

Abstract

In this paper we are interested in indexing texts for substring matching queries with one edit error. That is, given a text T of n characters over an alphabet of size σ, we are asked to build a data structure that answers the following query: find all the occ substrings of the text that are at edit distance at most 1 from a given string q of length m. In this paper we show two new results for this problem. The first result, suitable for an unbounded alphabet, uses O(nlogε n) (where ε is any constant such that 0<ε<1) words of space and answers to queries in time O(m+occ). This improves simultaneously in space and time over the result of Cole et al. The second result, suitable only for a constant alphabet, relies on compressed text indices and comes in two variants: the first variant uses O(nlogε n) bits of space and answers to queries in time O(m+occ), while the second variant uses O(nloglogn) bits of space and answers to queries in time O((m+occ)loglogn). This second result improves on the previously best results for constant alphabets achieved in Lam et al. and Chan et al.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    In this paper logx stands for ⌈log2(max(x,2))⌉.

  2. 2.

    A suffix link connects a suffix tree node associated with a factor cp (with c being a character) to the suffix tree node associated with the factor p.

  3. 3.

    PA can be represented in compressed suffix array representation of [15], since PA is actually the suffix array of the reverse of the text.

  4. 4.

    Actually the result in Belazzougui et al. [4] states a space usage O(nb 1/clogb) but assumes a constant alphabet size. However, it is easy to see that the same data structure just works for arbitrary σ in which case it uses O(n(b 1/c(logb+loglogσ))) bits of space.

  5. 5.

    Note that insertion before position 1 is equivalent to insertion after position 0 in which case q[1,i] will be the empty string.

  6. 6.

    Note that the query time bound uses the essential fact that the function H can be computed on any prefix p′ of p in constant time.

  7. 7.

    This is evident for substitutions. It is also true for deletion, since only the last character in a run of equal characters is deleted. For insertions, the only problematic case is when inserting the same character in a run of equal characters of length at least 1, and this case is avoided since we only insert such a character at the end of the run.

  8. 8.

    The term 4n+o(n) comes from the use of a succinct index for range minimum queries (RMQ). The term can be reduced to 2n+o(n) if the recent optimal solution of Fischer and Heun [16] is used. The term σ comes from the use of a bitvector of σ bits that needs to be writable. The bit-vector is used to avoid reporting a single color more than once.

  9. 9.

    Note that a query on all but the first non-empty substitution store takes constant time per reported character, since it involves querying the prefix-sum and the 1D colored range reporting data structures which answer in, respectively, constant time and constant time per element.

  10. 10.

    Each character could occupy up to logn bits which means that we need at least Ω(1) time to read each character in a our RAM model with w=Θ(logn).

References

  1. 1.

    Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Text indexing and dictionary matching with one error. J. Algorithms 37(2), 309–325 (2000)

    MATH  MathSciNet  Article  Google Scholar 

  2. 2.

    Belazzougui, D.: Faster and space-optimal edit distance “1” dictionary. In: Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 154–167 (2009)

    Chapter  Google Scholar 

  3. 3.

    Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. In: Proceedings of the 19th Annual European Symposium on Algorithms (ESA), pp. 748–759 (2011)

    Google Scholar 

  4. 4.

    Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Fast prefix search in little space, with applications. In: Proceedings of the 18th Annual European Symposium on Algorithms (ESA), pp. 427–438 (2010)

    Google Scholar 

  5. 5.

    Bille, P., Gørtz, I.L., Sach, B., Vildhøj, H.W.: Time-space trade-offs for longest common extensions. In: Proceedings of the 23rd Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 293–305 (2012)

    Chapter  Google Scholar 

  6. 6.

    Brodal, G.S., Gąsieniec, L.: Approximate dictionary queries. In: Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 65–74 (1996)

    Chapter  Google Scholar 

  7. 7.

    Buchsbaum, A.L., Goodrich, M.T., Westbrook, J.: Range searching over tree cross products. In: Proceedings of the 8th Annual European Symposium on Algorithms (ESA), pp. 120–131 (2000)

    Google Scholar 

  8. 8.

    Chan, H.-L., Lam, T.W., Sung, W.-K., Tam, S.-L., Wong, S.-S.: Compressed indexes for approximate string matching. Algorithmica 58(2), 263–281 (2010)

    MATH  MathSciNet  Article  Google Scholar 

  9. 9.

    Chan, H.-L., Lam, T.-W., Sung, W.-K., Tam, S.-L., Wong, S.-S.: A linear size index for approximate pattern matching. J. Discrete Algorithms 9(4), 358–364 (2011)

    MATH  MathSciNet  Article  Google Scholar 

  10. 10.

    Clark, D.R., Munro, J.I.: Efficient suffix trees on secondary storage (extended abstract). In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 383–391 (1996)

    Google Scholar 

  11. 11.

    Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th ACM Symposium on Theory of Computing (STOC), pp. 91–100 (2004)

    Google Scholar 

  12. 12.

    Crochemore, M., Rytter, W.: Jewels of Stringology: Text Algorithms. World Scientific, Singapore (2003)

    Google Scholar 

  13. 13.

    Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM 21(2), 246–260 (1974)

    MATH  MathSciNet  Article  Google Scholar 

  14. 14.

    Fano, R.M.: On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, Project MAC, MIT, Cambridge, Mass., n.d. (1971)

  15. 15.

    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    MathSciNet  Article  Google Scholar 

  16. 16.

    Fischer, J.: Optimal succinctness for range minimum queries. In: Proceedings of the 9th Latin American Theoretical Informatics Symposium (LATIN), pp. 158–169 (2010)

    Google Scholar 

  17. 17.

    Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)

    Article  Google Scholar 

  18. 18.

    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)

    MATH  MathSciNet  Article  Google Scholar 

  19. 19.

    Gusfield, D.: Algorithms on Strings, Trees, and Sequences—Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    MATH  Book  Google Scholar 

  20. 20.

    Harel, D., Tarjan, R.E.: Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13(2), 338–355 (1984)

    MATH  MathSciNet  Article  Google Scholar 

  21. 21.

    Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS), pp. 549–554 (1989)

    Chapter  Google Scholar 

  22. 22.

    Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)

    MATH  MathSciNet  Article  Google Scholar 

  23. 23.

    Knuth, D.E.: The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley, Reading (1973)

    Google Scholar 

  24. 24.

    Lam, T.W., Sung, W.-K., Wong, S.-S.: Improved approximate string matching using compressed suffix data structures. In: Proceedings of the 16th International Symposium on Algorithms and Computation (ISAAC), pp. 339–348 (2005)

    Google Scholar 

  25. 25.

    Lam, T.W., Sung, W.-K., Wong, S.-S.: Improved approximate string matching using compressed suffix data structures. Algorithmica 51(3), 298–314 (2008)

    MATH  MathSciNet  Article  Google Scholar 

  26. 26.

    Maaß, M.G., Nowak, J.: Text indexing with errors. In: Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 21–32 (2005)

    Chapter  Google Scholar 

  27. 27.

    Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    MATH  MathSciNet  Article  Google Scholar 

  28. 28.

    McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)

    MATH  MathSciNet  Article  Google Scholar 

  29. 29.

    Munro, J.I.: Tables. In: Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), pp. 37–42 (1996)

    Chapter  Google Scholar 

  30. 30.

    Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 657–666 (2002)

    Google Scholar 

  31. 31.

    Raman, R., Raman, V., Satti, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4) (2007)

  32. 32.

    Rao, S.S.: Time-space trade-offs for compressed suffix arrays. Inf. Process. Lett. 82(6), 307–311 (2002)

    MATH  Article  Google Scholar 

  33. 33.

    Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5(1), 12–22 (2007)

    MATH  MathSciNet  Article  Google Scholar 

  34. 34.

    Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual Symposium on Switching and Automata Theory (FOCS), pp. 1–11 (1973)

    Chapter  Google Scholar 

Download references

Acknowledgements

The author wishes to thank the anonymous reviewers for their helpful comments and corrections and Travis and Meg Gagie for their many helpful corrections and suggestions.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Djamal Belazzougui.

Additional information

Most of this work was done when the author was a student at LIAFA, University Paris Diderot, Paris 7. The work was partially supported by the French ANR project MAPPI (project number ANR-2010-COSI-004).

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Belazzougui, D. Improved Space-Time Tradeoffs for Approximate Full-Text Indexing with One Edit Error. Algorithmica 72, 791–817 (2015). https://doi.org/10.1007/s00453-014-9873-9

Download citation

Keywords

  • Compressed index
  • Edit distance
  • Approximate string matching