Advertisement

Smaller Self-indexes for Natural Language

  • Nieves R. Brisaboa
  • Gonzalo Navarro
  • Alberto Ordóñez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7608)

Abstract

Self-indexes for natural-language texts, where these are regarded as token (word or separator) sequences, achieve very attractive space and search time. However, they suffer from a space penalty due to their large vocabulary. In this paper we show that by replacing the Huffman encoding they implicitly use by the slightly weaker Hu-Tucker encoding, which respects the lexical order of the vocabulary, both their space and time are improved.

Keywords

Text Collection Natural Language Text Wavelet Tree Large Alphabet Source Code Repository 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Barbay, J., Gagie, T., Navarro, G., Nekrich, Y.: Alphabet Partitioning for Compressed Rank/Select and Applications. In: Cheong, O., Chwa, K.-Y., Park, K. (eds.) ISAAC 2010, Part II. LNCS, vol. 6507, pp. 315–326. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  2. 2.
    Barbay, J., Navarro, G.: Compressed representations of permutations, and applications. In: STACS 2009, pp. 111–122 (2009)Google Scholar
  3. 3.
    Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)Google Scholar
  4. 4.
    Claude, F., Navarro, G.: Practical Rank/Select Queries over Arbitrary Sequences. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 176–187. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  5. 5.
    Fariña, A., Brisaboa, N., Navarro, G., Claude, F., Places, A., Rodríguez, E.: Word-based self-indexes for natural language text. ACM Trans. Inf. Sys. 30(1), 1–34 (2012)CrossRefGoogle Scholar
  6. 6.
    Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: FOCS, pp. 390–398 (2000)Google Scholar
  7. 7.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Alg. 3(2), art. 20 (2007)Google Scholar
  8. 8.
    González, R., Grabowski, S., Mäkinen, V., Navarro, G.: Practical implementation of rank and select queries. In: WEA (posters). pp. 27–38 (2005)Google Scholar
  9. 9.
    Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: SODA, pp. 841–850 (2003)Google Scholar
  10. 10.
    Grossi, R., Vitter, J., Xu, B.: Wavelet trees: From theory to practice. In: CCP. pp. 210–221 (2011)Google Scholar
  11. 11.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: STOC, pp. 397–406 (2000)Google Scholar
  12. 12.
    Hu, T.C., Tucker, A.C.: Optimal computer search trees and variable-length alphabetical codes. SIAM J. Appl. Math. 21(4), 514–532 (1971)MathSciNetzbMATHCrossRefGoogle Scholar
  13. 13.
    Huffman, D.A.: A method for the construction of minimum-redundancy codes. In: Proc. I.R.E., vol. 40, pp. 1098–1101 (1952)Google Scholar
  14. 14.
    Jacobson, G.: Space-efficient static trees and graphs. In: FOCS, pp. 549–554 (1989)Google Scholar
  15. 15.
    Knuth, D.E.: The Art of Computer Programming. Vol. 3: Sorting and Searching, 2nd edn. Addison-Wesley (1998)Google Scholar
  16. 16.
    Mäkinen, V., Navarro, G.: Implicit Compression Boosting with Applications to Self-indexing. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 229–241. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  17. 17.
    Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM J. Comp. 22(5), 935–948 (1993)MathSciNetzbMATHCrossRefGoogle Scholar
  18. 18.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1), art. 2 (2007)Google Scholar
  19. 19.
    Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: SODA, pp. 233–242 (2002)Google Scholar
  20. 20.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Nieves R. Brisaboa
    • 1
  • Gonzalo Navarro
    • 2
  • Alberto Ordóñez
    • 1
  1. 1.Database Lab.Univ. of A CoruñaSpain
  2. 2.Dept. of Computer ScienceUniv. of ChileChile

Personalised recommendations