Advertisement

Super-Linear Indices for Approximate Dictionary Searching

  • Leonid Boytsov
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7404)

Abstract

We present experimental analysis of approximate search algorithms that involve indexing of deletion neighborhoods. These methods require huge indices whose sizes grow exponentially with respect to the maximum allowable number of errors k. Despite extraordinary space requirements, the super-linear indices are of great interest, because they provide some of the shortest retrieval times.

A straightforward implementation that creates a hash index directly over residual strings (obtained by deletions from dictionary words) is not space efficient. Rather than memorizing complete residual strings, we record only deleted characters and their respective positions. These data are indexed using a perfect hash function computed for a set of residual dictionary strings [2].

We carry out an experimental evaluation of this approach against several well-known benchmarks (including FastSS, which stores residual strings directly [3]). Experiments show that our implementation has a comparable or superior performance to that of the fastest benchmarks. At the same time, our implementation requires 4-8 times less space as compared to FastSS.

Keywords

wildcard neighborhood generation reduced alphabet neighborhood generation Mor-Fraenkel method perfect hashing FastSS 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Behm, A., Vernica, R., Alsubaiee, S., Ji, S., Lu, J., Jin, L., Lu, Y., Li, C.: UCI Flamingo Package 4.0 (2010)Google Scholar
  2. 2.
    Belazzougui, D.: Faster and Space-Optimal Edit Distance “1” Dictionary. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009. LNCS, vol. 5577, pp. 154–167. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  3. 3.
    Bocek, T., Hunt, E., Stiller, B.: Fast similarity search in large dictionaries, Technical report No. ifi-2007.02, Department of Informatics (IFI), University of Zurich (2007)Google Scholar
  4. 4.
    Botelho, F.C.: Near-Optimal Space Perfect Hashing Algorithms. PhD thesis, Graduate Program in Computer Science, Federal University of Minas Gerais, Brazil (2008)Google Scholar
  5. 5.
    Boytsov, L.: Indexing methods for approximate dictionary searching: Comparative analysis. J. Exp. Algorithmics 16, 1.1:1.1–1.1:1.91 (2011)Google Scholar
  6. 6.
    Brisaboa, N.R., Ladra, S., Navarro, G.: Directly Addressable Variable-Length Codes. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 122–130. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  7. 7.
    Cole, R., Gottlieb, L.A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: STOC 2004: Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, pp. 91–100. ACM (2004)Google Scholar
  8. 8.
    Damerau, F.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964)CrossRefGoogle Scholar
  9. 9.
    Doster, W.: Contextual postprocessing system for cooperation with a multiple-choice character-recognition system. IEEE Trans. Comput. 26, 1090–1101 (1977)zbMATHCrossRefGoogle Scholar
  10. 10.
    Gorin, R.E.: SPELL: Spelling check and correction program, Online documentation: Describes operation of PDP-10 SPELL program (1971), http://pdp-10.trailing-edge.com/decuslib10-03/01/43,50270/spell.doc.html (accessed May 28, 2012)
  11. 11.
    Hall, P., Dowling, G.: Approximate string matching. ACM Computing Surveys 12(4), 381–402 (1980)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Karch, D., Luxen, D., Sanders, P.: Improved fast similarity search in dictionaries. CoRR abs/1008.1191 (2010)Google Scholar
  13. 13.
    Knuth, D.: The Art of Computer Programming. Sorting and Searching., 1st edn., vol. 3. Addison-Wesley (1973)Google Scholar
  14. 14.
    Kukich, K.: Technique for automatically correcting words in text. ACM Computing Surveys 24(2), 377–439 (1992)CrossRefGoogle Scholar
  15. 15.
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR 163(4), 845–848 (1965)MathSciNetGoogle Scholar
  16. 16.
    Mihov, S., Schulz, K.U.: Fast approximate string search in large dictionaries. Computational Linguistics 30(4), 451–477 (2004)MathSciNetzbMATHCrossRefGoogle Scholar
  17. 17.
    Mor, M., Fraenkel, A.S.: A hash code method for detecting and correcting spelling errors. Communications of the ACM 25(12), 935–938 (1982)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)CrossRefGoogle Scholar
  19. 19.
    Owolabi, O.: Dictionary organizations for efficient similarity retrieval. Journal of Systems and Software 34(2), 127–132 (1996)CrossRefGoogle Scholar
  20. 20.
    Sankoff, D.: The early introduction of dynamic programming into computational biology. Bioinformatics 16(1), 41–47 (2000)CrossRefGoogle Scholar
  21. 21.
    Ukkonen, E.: Approximate String Matching Over Suffix Trees. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) CPM 1993. LNCS, vol. 684, pp. 228–242. Springer, Heidelberg (1993)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Leonid Boytsov
    • 1
  1. 1.North BethesdaUSA

Personalised recommendations