Skip to main content
Log in

Compressed String Dictionary Search with Edit Distance One

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

In this paper we present different solutions for the problem of indexing a dictionary of strings in compressed space. Given a pattern \(P\), the index has to report all the strings in the dictionary having edit distance at most one with \(P\). Our first solution is able to solve queries in (almost optimal) \(O(|P|+occ)\) time where \(occ\) is the number of strings in the dictionary having edit distance at most one with \(P\). The space complexity of this solution is bounded in terms of the \(k\)th order entropy of the indexed dictionary. A second solution further improves this space complexity at the cost of increasing the query time. Finally, we propose randomized solutions (Monte Carlo and Las Vegas) which achieve simultaneously the time complexity of the first solution and the space complexity of the second one.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. However, they can be easily extended to deal with the more general edit distance.

  2. Actually, the paper [11] described only a solution for binary alphabet. However, it is not hard to obtain the claimed space and time complexities also for non-constant alphabet sizes.

  3. Notice that just accessing each symbol of these candidate strings would cost \(O(p + p \cdot occ)\) time in total which is much higher than our claimed complexity.

  4. Recall that we are still assuming that we can check in \(O(1)\) whether a candidate string belongs to \( D \).

  5. Observe that similar considerations hold also for substitutions with the difference that we skip the \(i\)th symbol in factorizations of the form \(P=P[1,i-1] \cdot P[i] \cdot P[i+1, p]\).

  6. Checks for other types of errors are done in a similar way.

  7. We notice that the number of distinct lengths and, thus, compressed permuterm indexes is \(O(\sqrt{n})\).

  8. Notice that this case occurs only when \(P_iP[i]\in D \). In order to properly deal with this case, the value of \(\ell \) is not increased after a successful backward step if it already reached the maximal value \(p+2\).

  9. If we have false positives then the same character may be checked and thus potentially reported twice. To avoid this case, we can use a dynamic hash table at query time which stores all the characters reported so-far. Whenever we find that a character has been already reported, then the query stops and does not report more characters, since a correct query answer can not return the same character twice at the same position.

References

  1. Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Text indexing and dictionary matching with one error. J. Algorithms 37(2), 309–325 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  2. Barbay, J., He, M., Munro, J.I., Satti, S.R.: Succinct indexes for strings, binary relations and multilabeled trees. ACM Trans. Algorithms 7(4), 52 (2011)

    Article  MathSciNet  Google Scholar 

  3. Barbay, J., Claude, F., Gagie, T., Navarro, G., Nekrich, Y.: Efficient fully-compressed sequence representations. Algorithmica 69(1), 232–268 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  4. Belazzougui, D.: Faster and space-optimal edit distance “1” dictionary. In: Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 154–167 (2009)

  5. Belazzougui, D.: Improved space-time tradeoffs for approximate full-text indexing with one edit error. Algorithmica (2014). doi:10.1007/s00453-014-9873-9

  6. Belazzougui, D., Navarro, G.: New lower and upper bounds for representing sequences. In: Proceedings of the 20th Annual European Symposium on Algorithms (ESA), pp. 181–192 (2012)

  7. Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms 10(4), 23 (2014)

    Article  MathSciNet  Google Scholar 

  8. Belazzougui, D., Venturini, R.: Compressed string dictionary look-up with edit distance one. In: Proceedings of the 23rd Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 280–292 (2012)

  9. Belazzougui, D., Venturini, R.: Compressed static functions with applications. In: Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 229–240 (2013)

  10. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  11. Brodal, G.S., Ga̧sieniec, L.: Approximate dictionary queries. In: Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, pp. 65–74. Springer (1996)

  12. Brodal, G.S., Srinivasan, V.: Improved bounds for dictionary look-up with one error. Inf. Process. Lett. 75(1–2), 57–59 (2000)

    Article  Google Scholar 

  13. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)

  14. Chazelle, B., Kilian, J., Rubinfeld, R., Tal, A.: The Bloomier filter: an efficient data structure for static support lookup tables. In: Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 30–39 (2004)

  15. Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC), pp. 91–100 (2004)

  16. Dietzfelbinger, M., Gil, J., Matias, Y., Pippenger, N.: Polynomial hash functions are reliable (extended abstract). In: Proceeding of the 19th International Colloquium on Automata, Languages and Programming (ICALP), pp. 235–246 (1992)

  17. Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM 21, 246–260 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  18. Fano, RM.: On the number of bits required to implement anassociative memory. Memorandum 61, Computer Structures Group, Project MAC (1971)

  19. Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. ACM J. Exp. Algorithmics 13, 12 (2008)

    Google Scholar 

  20. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  21. Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci. 372(1), 115–121 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  22. Ferragina, P., Venturini, R.: The compressed permuterm index. ACM Trans. Algorithms 7(1), 10 (2010)

    Article  MathSciNet  Google Scholar 

  23. Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  24. Hagerup, T., Tholey, T.: Efficient minimal perfect hashing in nearly minimal space. In: Proceedings of the 18th Annual Symposium on Theoretical Aspects of Computer Science (STACS), pp. 317–326 (2001)

  25. Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  26. Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM 48(3), 407–430 (2001)

    Article  MathSciNet  Google Scholar 

  27. Munro, J.I., Raman, V.: Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31(3), 762–776 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  28. Navarro, G., Mäkinen, V.: Compressed full text indexes. ACM Comput. Surv. 39(1), 2 (2007)

    Article  Google Scholar 

  29. Ohlebusch, E., Gog, S., Kügel, A.: Computing matching statistics and maximal exact matches on compressed full-text indexes. In: SPIRE, pp. 347–358 (2010)

  30. Pagh, A., Pagh, R., Rao, S.S.: An optimal bloom filter replacement. In: Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 823–829 (2005)

  31. Russo, L.M.S., Navarro, G., Oliveira, A.L.: Fully compressed suffix trees. ACM Trans. Algorithms 7(4), 53 (2011)

    Article  MathSciNet  Google Scholar 

  32. Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  33. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Burlington (1999)

    Google Scholar 

  34. Yao, A.C.-C., Yao, F.F.: Dictionary look-up with one error. J. Algorithms 25(1), 194–202 (1997)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rossano Venturini.

Additional information

The work is an extended version of the paper [8] appeared in Proceedings of 23rd Annual Symposium on Combinatorial Pattern Matching, 2012. This work has been partially supported by Academy of Finland under Grant 250345 (CoECGR), the French ANR-2010-COSI-004 MAPPI project, PRIN ARS Technomedia 2012, the Midas EU Project, Grant Agreement No. 318786, and the eCloud EU Project, Grant Agreement No. 325091.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Belazzougui, D., Venturini, R. Compressed String Dictionary Search with Edit Distance One. Algorithmica 74, 1099–1122 (2016). https://doi.org/10.1007/s00453-015-9990-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-015-9990-0

Keywords

Navigation