Improved Space-Time Tradeoffs for Approximate Full-Text Indexing with One Edit Error

Belazzougui, Djamal

doi:10.1007/s00453-014-9873-9

Improved Space-Time Tradeoffs for Approximate Full-Text Indexing with One Edit Error

Published: 29 January 2014

Volume 72, pages 791–817, (2015)
Cite this article

Algorithmica Aims and scope Submit manuscript

Djamal Belazzougui¹

292 Accesses
5 Citations
Explore all metrics

Abstract

In this paper we are interested in indexing texts for substring matching queries with one edit error. That is, given a text T of n characters over an alphabet of size σ, we are asked to build a data structure that answers the following query: find all the occ substrings of the text that are at edit distance at most 1 from a given string q of length m. In this paper we show two new results for this problem. The first result, suitable for an unbounded alphabet, uses O(nlog^ε n) (where ε is any constant such that 0<ε<1) words of space and answers to queries in time O(m+occ). This improves simultaneously in space and time over the result of Cole et al. The second result, suitable only for a constant alphabet, relies on compressed text indices and comes in two variants: the first variant uses O(nlog^ε n) bits of space and answers to queries in time O(m+occ), while the second variant uses O(nloglogn) bits of space and answers to queries in time O((m+occ)loglogn). This second result improves on the previously best results for constant alphabets achieved in Lam et al. and Chan et al.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Longest Common Substring with Approximately k Mismatches

Article Open access 16 February 2019

Tomasz Kociumaka, Jakub Radoszewski & Tatiana Starikovskaya

On the Practical Power of Automata in Pattern Matching

Article Open access 06 April 2024

Ora Amir, Amihood Amir, … David Sarne

Data dependencies for query optimization: a survey

Article Open access 14 June 2021

Jan Kossmann, Thorsten Papenbrock & Felix Naumann

Notes

In this paper logx stands for ⌈log₂(max(x,2))⌉.
A suffix link connects a suffix tree node associated with a factor cp (with c being a character) to the suffix tree node associated with the factor p.
PA can be represented in compressed suffix array representation of [15], since PA is actually the suffix array of the reverse of the text.
Actually the result in Belazzougui et al. [4] states a space usage O(nb ^1/clogb) but assumes a constant alphabet size. However, it is easy to see that the same data structure just works for arbitrary σ in which case it uses O(n(b ^1/c(logb+loglogσ))) bits of space.
Note that insertion before position 1 is equivalent to insertion after position 0 in which case q[1,i] will be the empty string.
Note that the query time bound uses the essential fact that the function H can be computed on any prefix p′ of p in constant time.
This is evident for substitutions. It is also true for deletion, since only the last character in a run of equal characters is deleted. For insertions, the only problematic case is when inserting the same character in a run of equal characters of length at least 1, and this case is avoided since we only insert such a character at the end of the run.
The term 4n+o(n) comes from the use of a succinct index for range minimum queries (RMQ). The term can be reduced to 2n+o(n) if the recent optimal solution of Fischer and Heun [16] is used. The term σ comes from the use of a bitvector of σ bits that needs to be writable. The bit-vector is used to avoid reporting a single color more than once.
Note that a query on all but the first non-empty substitution store takes constant time per reported character, since it involves querying the prefix-sum and the 1D colored range reporting data structures which answer in, respectively, constant time and constant time per element.
Each character could occupy up to logn bits which means that we need at least Ω(1) time to read each character in a our RAM model with w=Θ(logn).

References

Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Text indexing and dictionary matching with one error. J. Algorithms 37(2), 309–325 (2000)
Article MATH MathSciNet Google Scholar
Belazzougui, D.: Faster and space-optimal edit distance “1” dictionary. In: Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 154–167 (2009)
Chapter Google Scholar
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. In: Proceedings of the 19th Annual European Symposium on Algorithms (ESA), pp. 748–759 (2011)
Google Scholar
Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Fast prefix search in little space, with applications. In: Proceedings of the 18th Annual European Symposium on Algorithms (ESA), pp. 427–438 (2010)
Google Scholar
Bille, P., Gørtz, I.L., Sach, B., Vildhøj, H.W.: Time-space trade-offs for longest common extensions. In: Proceedings of the 23rd Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 293–305 (2012)
Chapter Google Scholar
Brodal, G.S., Gąsieniec, L.: Approximate dictionary queries. In: Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 65–74 (1996)
Chapter Google Scholar
Buchsbaum, A.L., Goodrich, M.T., Westbrook, J.: Range searching over tree cross products. In: Proceedings of the 8th Annual European Symposium on Algorithms (ESA), pp. 120–131 (2000)
Google Scholar
Chan, H.-L., Lam, T.W., Sung, W.-K., Tam, S.-L., Wong, S.-S.: Compressed indexes for approximate string matching. Algorithmica 58(2), 263–281 (2010)
Article MATH MathSciNet Google Scholar
Chan, H.-L., Lam, T.-W., Sung, W.-K., Tam, S.-L., Wong, S.-S.: A linear size index for approximate pattern matching. J. Discrete Algorithms 9(4), 358–364 (2011)
Article MATH MathSciNet Google Scholar
Clark, D.R., Munro, J.I.: Efficient suffix trees on secondary storage (extended abstract). In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 383–391 (1996)
Google Scholar
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th ACM Symposium on Theory of Computing (STOC), pp. 91–100 (2004)
Google Scholar
Crochemore, M., Rytter, W.: Jewels of Stringology: Text Algorithms. World Scientific, Singapore (2003)
Google Scholar
Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM 21(2), 246–260 (1974)
Article MATH MathSciNet Google Scholar
Fano, R.M.: On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, Project MAC, MIT, Cambridge, Mass., n.d. (1971)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Article MathSciNet Google Scholar
Fischer, J.: Optimal succinctness for range minimum queries. In: Proceedings of the 9th Latin American Theoretical Informatics Symposium (LATIN), pp. 158–169 (2010)
Google Scholar
Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)
Article Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)
Article MATH MathSciNet Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences—Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Harel, D., Tarjan, R.E.: Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13(2), 338–355 (1984)
Article MATH MathSciNet Google Scholar
Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS), pp. 549–554 (1989)
Chapter Google Scholar
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)
Article MATH MathSciNet Google Scholar
Knuth, D.E.: The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley, Reading (1973)
Google Scholar
Lam, T.W., Sung, W.-K., Wong, S.-S.: Improved approximate string matching using compressed suffix data structures. In: Proceedings of the 16th International Symposium on Algorithms and Computation (ISAAC), pp. 339–348 (2005)
Google Scholar
Lam, T.W., Sung, W.-K., Wong, S.-S.: Improved approximate string matching using compressed suffix data structures. Algorithmica 51(3), 298–314 (2008)
Article MATH MathSciNet Google Scholar
Maaß, M.G., Nowak, J.: Text indexing with errors. In: Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 21–32 (2005)
Chapter Google Scholar
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MATH MathSciNet Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
Article MATH MathSciNet Google Scholar
Munro, J.I.: Tables. In: Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), pp. 37–42 (1996)
Chapter Google Scholar
Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 657–666 (2002)
Google Scholar
Raman, R., Raman, V., Satti, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4) (2007)
Rao, S.S.: Time-space trade-offs for compressed suffix arrays. Inf. Process. Lett. 82(6), 307–311 (2002)
Article MATH Google Scholar
Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5(1), 12–22 (2007)
Article MATH MathSciNet Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual Symposium on Switching and Automata Theory (FOCS), pp. 1–11 (1973)
Chapter Google Scholar

Download references

Acknowledgements

The author wishes to thank the anonymous reviewers for their helpful comments and corrections and Travis and Meg Gagie for their many helpful corrections and suggestions.

Author information

Authors and Affiliations

Helsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki, Finland
Djamal Belazzougui

Authors

Djamal Belazzougui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Djamal Belazzougui.

Additional information

Most of this work was done when the author was a student at LIAFA, University Paris Diderot, Paris 7. The work was partially supported by the French ANR project MAPPI (project number ANR-2010-COSI-004).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Belazzougui, D. Improved Space-Time Tradeoffs for Approximate Full-Text Indexing with One Edit Error. Algorithmica 72, 791–817 (2015). https://doi.org/10.1007/s00453-014-9873-9

Download citation

Received: 13 March 2011
Accepted: 20 January 2014
Published: 29 January 2014
Issue Date: July 2015
DOI: https://doi.org/10.1007/s00453-014-9873-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved Space-Time Tradeoffs for Approximate Full-Text Indexing with One Edit Error

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

On the Practical Power of Automata in Pattern Matching

Data dependencies for query optimization: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improved Space-Time Tradeoffs for Approximate Full-Text Indexing with One Edit Error

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

On the Practical Power of Automata in Pattern Matching

Data dependencies for query optimization: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation