Advertisement

A Lempel-Ziv Text Index on Secondary Storage

  • Diego Arroyuelo
  • Gonzalo Navarro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4580)

Abstract

Full-text searching consists in locating the occurrences of a given pattern P[1..m] in a text T[1..u], both sequences over an alphabet of size σ. In this paper we define a new index for full-text searching on secondary storage, based on the Lempel-Ziv compression algorithm and requiring 8uH k  + o(ulogσ) bits of space, where H k denotes the k-th order empirical entropy of T, for any k = o(log σ u). Our experimental results show that our index is significantly smaller than any other practical secondary-memory data structure: 1.4–2.3 times the text size including the text, which means 39%–65% the size of traditional indexes like String B-trees [Ferragina and Grossi, JACM 1999]. In exchange, our index requires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to count pattern occurrences, the space can be reduced to about 1.04–1.68 times the text size, requiring about 20–60 disk accesses, depending on the pattern length.

Keywords

Main Memory Space Requirement Pattern Occurrence Secondary Memory Disk Access 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Heidelberg (1985)Google Scholar
  2. 2.
    Kurtz, S.: Reducing the space requeriments of suffix trees. Softw. Pract. Exper. 29(13), 1149–1171 (1999)CrossRefGoogle Scholar
  3. 3.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. JACM 48(3), 407–430 (2001)CrossRefMathSciNetGoogle Scholar
  4. 4.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys (to appear)Google Scholar
  5. 5.
    Ferragina, P., Manzini, G.: Indexing compressed texts. JACM 54(4), 552–581 (2005)CrossRefMathSciNetGoogle Scholar
  6. 6.
    Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM TOIS 18(2), 113–139 (2000)CrossRefGoogle Scholar
  7. 7.
    Ferragina, P., Grossi, R.: The String B-tree: a new data structure for string search in external memory and its applications. JACM 46(2), 236–280 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Ferragina, P., Grossi, R.: Fast string searching in secondary storage: theoretical developments and experimental results. In: Proc. SODA, pp. 373–382 (1996)Google Scholar
  9. 9.
    Clark, D., Munro, J.I.: Efficient suffix trees on secondary storage. In: Proc. SODA, pp. 383–391 (1996)Google Scholar
  10. 10.
    Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays. In: Proc. ISAAC, pp. 681–692 (2004)Google Scholar
  11. 11.
    Sadakane, K.: Succinct representations of lcp information and improvements in the compressed suffix arrays. In: Proc. SODA, pp. 225–232 (2002)Google Scholar
  12. 12.
    Navarro, G.: Indexing text using the Ziv-Lempel trie. J. of Discrete Algorithms 2(1), 87–114 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE TIT 24(5), 530–536 (1978)zbMATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Kosaraju, R., Manzini, G.: Compression of low entropy strings with Lempel-Ziv algorithms. SIAM J.Comp. 29(3), 893–911 (1999)CrossRefMathSciNetGoogle Scholar
  15. 15.
    Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-index. In: Proc. CPM, pp. 319–330 (2006)Google Scholar
  16. 16.
    Arroyuelo, D., Navarro, G.: Space-efficient construction of LZ-index. In: Proc. ISAAC pp. 1143–1152 (2005)Google Scholar
  17. 17.
    Munro, I., Raman, V.: Succinct representation of balanced parentheses and static trees. SIAM J.Comp. 31(3), 762–776 (2001)zbMATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    Munro, I.: Tables. In: Chandru, V., Vinay, V. (eds.) Foundations of Software Technology and Theoretical Computer Science. LNCS, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)Google Scholar
  19. 19.
    Arroyuelo, D., Navarro, G.: A Lempel-Ziv text index on secondary storage. Technical Report TR/DCC-2004, -4, Dept. of Computer Science, Universidad de Chile (2007), ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/lzidisk.ps.gz
  20. 20.
    Morrison, D.R.: Patricia – practical algorithm to retrieve information coded in alphanumeric. JACM 15(4), 514–534 (1968)CrossRefMathSciNetGoogle Scholar
  21. 21.
    Harman, D.: Overview of the third text REtrieval conference. In: Proc. Third Text REtrieval Conference (TREC-3), NIST Special Publication, pp. 500–207 (1995)Google Scholar
  22. 22.
    Baeza-Yates, R., Barbosa, E.F., Ziviani, N.: Hierarchies of indices for text searching. Inf. Systems 21(6), 497–514 (1996)CrossRefGoogle Scholar
  23. 23.
    Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM J. Comp. 22(5), 935–948 (1993)zbMATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    González, R., Navarro, G.: Compressed text indexes with fast locate. In: Proc. of CPM’07. LNCS (to appear, 2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Diego Arroyuelo
    • 1
  • Gonzalo Navarro
    • 1
  1. 1.Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, SantiagoChile

Personalised recommendations