SPIRE 2007: String Processing and Information Retrieval pp 264-275 | Cite as
Approximate String Matching with Lempel-Ziv Compressed Indexes
Abstract
A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on self-indexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in computational biology applications. We present an ASM algorithm that works on top of a Lempel-Ziv self-index. We consider the so-called hybrid indexes, which are the best in practice for this problem. We show that a Lempel-Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to the Lempel-Ziv index. We show experimentally that our algorithm has a competitive performance and provides a useful space-time tradeoff compared to classical indexes.
Keywords
Classical Index Approximate Match Reverse Trie Approximate String Match Text ContextPreview
Unable to display preview. Download preview PDF.
References
- 1.Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)CrossRefGoogle Scholar
- 2.Chang, W.I., Marr, T.G.: Approximate string matching and local similarity. In: Crochemore, M., Gusfield, D. (eds.) CPM 1994. LNCS, vol. 807, pp. 259–273. Springer, Heidelberg (1994)Google Scholar
- 3.Fredriksson, K., Navarro, G.: Average-optimal single and multiple approximate string matching. ACM Journal of Experimental Algorithmics 9(1.4) (2004)Google Scholar
- 4.Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24(4), 19–27 (2001)Google Scholar
- 5.Cole, R., Gottlieb, L.A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: STOC, pp. 91–100 (2004)Google Scholar
- 6.Maaß, M., Nowak, J.: Text indexing with errors. In: CPM, pp. 21–32 (2005)Google Scholar
- 7.Chan, H.L., Lam, T.W., Sung, W.K., Tam, S.L., Wong, S.S.: A linear size index for approximate pattern matching. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 49–59. Springer, Heidelberg (2006)CrossRefGoogle Scholar
- 8.Coelho, L., Oliveira, A.: Dotted suffix trees: a structure for approximate text indexing. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 329–336. Springer, Heidelberg (2006)CrossRefGoogle Scholar
- 9.Weiner, P.: Linear pattern matching algorithms. In: IEEE 14th Annual Symposium on Switching and Automata Theory, pp. 1–11. IEEE Computer Society Press, Los Alamitos (1973)Google Scholar
- 10.Manber, U., Myers, E.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 935–948 (1993)Google Scholar
- 11.Gonnet, G.: A tutorial introduction to Computational Biochemistry using Darwin. Technical report, Informatik E.T.H., Zuerich, Switzerland (1992)Google Scholar
- 12.Ukkonen, E.: Approximate string matching over suffix trees. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) Combinatorial Pattern Matching. LNCS, vol. 684, pp. 228–242. Springer, Heidelberg (1993)CrossRefGoogle Scholar
- 13.Cobbs, A.: Fast approximate matching using suffix trees. In: Galil, Z., Ukkonen, E. (eds.) Combinatorial Pattern Matching. LNCS, vol. 937, pp. 41–54. Springer, Heidelberg (1995)Google Scholar
- 14.Sutinen, E., Tarhio, J.: Filtration with q-samples in approximate string matching. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 50–63. Springer, Heidelberg (1996)Google Scholar
- 15.Navarro, G., Baeza-Yates, R.: A practical q-gram index for text retrieval allowing errors. CLEI Electronic Journal 1(2) (1998)Google Scholar
- 16.Myers, E.W.: A sublinear algorithm for approximate keyword searching. Algorithmica 12(4/5), 345–374 (1994)MATHCrossRefMathSciNetGoogle Scholar
- 17.Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. Journal of Discrete Algorithms 1(1), 205–239 (2000)MathSciNetGoogle Scholar
- 18.Navarro, G., Sutinen, E., Tarhio, J.: Indexing text with approximate q-grams. J. Discrete Algorithms 3(2-4), 157–175 (2005)MATHCrossRefMathSciNetGoogle Scholar
- 19.Kurtz, S.: Reducing the space requirement of suffix trees. Pract. Exper. 29(13), 1149–1171 (1999)CrossRefGoogle Scholar
- 20.Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003)MATHCrossRefMathSciNetGoogle Scholar
- 21.Ferragina, P., Manzini, G.: Indexing compressed text. Journal of the ACM 52(4), 552–581 (2005)CrossRefMathSciNetGoogle Scholar
- 22.Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discrete Algorithms 2(1), 87–114 (2004)MATHCrossRefMathSciNetGoogle Scholar
- 23.Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)MATHCrossRefMathSciNetGoogle Scholar
- 24.Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) article 2 (2007)Google Scholar
- 25.Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)CrossRefMathSciNetGoogle Scholar
- 26.Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: South American Workshop on String Processing, pp. 141–155. Carleton University Press (1996)Google Scholar
- 27.Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-Index. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 318–329. Springer, Heidelberg (2006)CrossRefGoogle Scholar
- 28.Russo, L.M.S., Oliveira, A.L.: A compressed self-index using a Ziv-Lempel dictionary. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 163–180. Springer, Heidelberg (2006)CrossRefGoogle Scholar
- 29.Huynh, T., Hon, W., Lam, T., Sung, W.: Approximate string matching using compressed suffix arrays. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 434–444. Springer, Heidelberg (2004)Google Scholar
- 30.Lam, T., Sung, W., Wong, S.: Improved approximate string matching using compressed suffix data structures. In: Deng, X., Du, D.-Z. (eds.) ISAAC 2005. LNCS, vol. 3827, pp. 339–348. Springer, Heidelberg (2005)CrossRefGoogle Scholar
- 31.Morales, P.: Solución de consultas complejas sobre un indice de texto comprimido (solving complex queries over a compressed text index). Undergraduate thesis, Dept. of Computer Science, University of Chile, G. Navarro, advisor (2005)Google Scholar
- 32.Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)MATHCrossRefMathSciNetGoogle Scholar
- 33.Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM 46(3), 395–415 (1999)MATHCrossRefMathSciNetGoogle Scholar
- 34.Navarro, G., Baeza-Yates, R.: Very fast and simple approximate string matching. Information Processing Letters 72, 65–70 (1999)CrossRefMathSciNetGoogle Scholar