Approximate String Matching with Lempel-Ziv Compressed Indexes

  • Luís M. S. Russo
  • Gonzalo Navarro
  • Arlindo L. Oliveira
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4726)

Abstract

A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on self-indexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in computational biology applications. We present an ASM algorithm that works on top of a Lempel-Ziv self-index. We consider the so-called hybrid indexes, which are the best in practice for this problem. We show that a Lempel-Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to the Lempel-Ziv index. We show experimentally that our algorithm has a competitive performance and provides a useful space-time tradeoff compared to classical indexes.

Keywords

Classical Index Approximate Match Reverse Trie Approximate String Match Text Context 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)CrossRefGoogle Scholar
  2. 2.
    Chang, W.I., Marr, T.G.: Approximate string matching and local similarity. In: Crochemore, M., Gusfield, D. (eds.) CPM 1994. LNCS, vol. 807, pp. 259–273. Springer, Heidelberg (1994)Google Scholar
  3. 3.
    Fredriksson, K., Navarro, G.: Average-optimal single and multiple approximate string matching. ACM Journal of Experimental Algorithmics 9(1.4) (2004)Google Scholar
  4. 4.
    Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24(4), 19–27 (2001)Google Scholar
  5. 5.
    Cole, R., Gottlieb, L.A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: STOC, pp. 91–100 (2004)Google Scholar
  6. 6.
    Maaß, M., Nowak, J.: Text indexing with errors. In: CPM, pp. 21–32 (2005)Google Scholar
  7. 7.
    Chan, H.L., Lam, T.W., Sung, W.K., Tam, S.L., Wong, S.S.: A linear size index for approximate pattern matching. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 49–59. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    Coelho, L., Oliveira, A.: Dotted suffix trees: a structure for approximate text indexing. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 329–336. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. 9.
    Weiner, P.: Linear pattern matching algorithms. In: IEEE 14th Annual Symposium on Switching and Automata Theory, pp. 1–11. IEEE Computer Society Press, Los Alamitos (1973)Google Scholar
  10. 10.
    Manber, U., Myers, E.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 935–948 (1993)Google Scholar
  11. 11.
    Gonnet, G.: A tutorial introduction to Computational Biochemistry using Darwin. Technical report, Informatik E.T.H., Zuerich, Switzerland (1992)Google Scholar
  12. 12.
    Ukkonen, E.: Approximate string matching over suffix trees. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) Combinatorial Pattern Matching. LNCS, vol. 684, pp. 228–242. Springer, Heidelberg (1993)CrossRefGoogle Scholar
  13. 13.
    Cobbs, A.: Fast approximate matching using suffix trees. In: Galil, Z., Ukkonen, E. (eds.) Combinatorial Pattern Matching. LNCS, vol. 937, pp. 41–54. Springer, Heidelberg (1995)Google Scholar
  14. 14.
    Sutinen, E., Tarhio, J.: Filtration with q-samples in approximate string matching. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 50–63. Springer, Heidelberg (1996)Google Scholar
  15. 15.
    Navarro, G., Baeza-Yates, R.: A practical q-gram index for text retrieval allowing errors. CLEI Electronic Journal 1(2) (1998)Google Scholar
  16. 16.
    Myers, E.W.: A sublinear algorithm for approximate keyword searching. Algorithmica 12(4/5), 345–374 (1994)MATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. Journal of Discrete Algorithms 1(1), 205–239 (2000)MathSciNetGoogle Scholar
  18. 18.
    Navarro, G., Sutinen, E., Tarhio, J.: Indexing text with approximate q-grams. J. Discrete Algorithms 3(2-4), 157–175 (2005)MATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Kurtz, S.: Reducing the space requirement of suffix trees. Pract. Exper. 29(13), 1149–1171 (1999)CrossRefGoogle Scholar
  20. 20.
    Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003)MATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Ferragina, P., Manzini, G.: Indexing compressed text. Journal of the ACM 52(4), 552–581 (2005)CrossRefMathSciNetGoogle Scholar
  22. 22.
    Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discrete Algorithms 2(1), 87–114 (2004)MATHCrossRefMathSciNetGoogle Scholar
  23. 23.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)MATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) article 2 (2007)Google Scholar
  25. 25.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)CrossRefMathSciNetGoogle Scholar
  26. 26.
    Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: South American Workshop on String Processing, pp. 141–155. Carleton University Press (1996)Google Scholar
  27. 27.
    Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-Index. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 318–329. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  28. 28.
    Russo, L.M.S., Oliveira, A.L.: A compressed self-index using a Ziv-Lempel dictionary. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 163–180. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  29. 29.
    Huynh, T., Hon, W., Lam, T., Sung, W.: Approximate string matching using compressed suffix arrays. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 434–444. Springer, Heidelberg (2004)Google Scholar
  30. 30.
    Lam, T., Sung, W., Wong, S.: Improved approximate string matching using compressed suffix data structures. In: Deng, X., Du, D.-Z. (eds.) ISAAC 2005. LNCS, vol. 3827, pp. 339–348. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  31. 31.
    Morales, P.: Solución de consultas complejas sobre un indice de texto comprimido (solving complex queries over a compressed text index). Undergraduate thesis, Dept. of Computer Science, University of Chile, G. Navarro, advisor (2005)Google Scholar
  32. 32.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)MATHCrossRefMathSciNetGoogle Scholar
  33. 33.
    Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM 46(3), 395–415 (1999)MATHCrossRefMathSciNetGoogle Scholar
  34. 34.
    Navarro, G., Baeza-Yates, R.: Very fast and simple approximate string matching. Information Processing Letters 72, 65–70 (1999)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Luís M. S. Russo
    • 1
  • Gonzalo Navarro
    • 2
  • Arlindo L. Oliveira
    • 1
  1. 1.INESC-ID, R. Alves Redol 9, 1000 LisboaPortugal
  2. 2.Dept. of Computer Science, University of Chile 

Personalised recommendations