Enhanced Byte Codes with Restricted Prefix Properties

  • J. Shane Culpepper
  • Alistair Moffat
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3772)

Abstract

Byte codes have a number of properties that make them attractive for practical compression systems: they are relatively easy to construct; they decode quickly; and they can be searched using standard byte-aligned string matching techniques. In this paper we describe a new type of byte code in which the first byte of each codeword completely specifies the number of bytes that comprise the suffix of the codeword. Our mechanism gives more flexible coding than previous constrained byte codes, and hence better compression. The structure of the code also suggests a heuristic approximation that allows savings to be made in the prelude that describes the code. We present experimental results that compare our new method with previous approaches to byte coding, in terms of both compression effectiveness and decoding throughput speeds.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brisaboa, N.R., Fariña, A., Navarro, G., Esteller, M.F.: (S,C)-dense coding: An optimized compression code for natural language text databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 122–136. Springer, Heidelberg (2003a)CrossRefGoogle Scholar
  2. Brisaboa, N.R., Fariña, A., Navarro, G., Paramá, J.R.: Efficiently decodable and searchable natural language adaptive compression. In: Proc. 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil. ACM Press, New York (2005) (to appear)Google Scholar
  3. Brisaboa, N.R., Iglesias, E.L., Navarro, G., Paramá, J.R.: An efficient compression code for text databases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 468–481. Springer, Heidelberg (2003b)CrossRefGoogle Scholar
  4. Chen, D., Chiang, Y.-J., Memon, N., Wu, X.: Optimal alphabet partitioning for semi-adaptive coding of sources of unknown sparse distributions. In: Storer, J.A., Cohn, M. (eds.) Proc. 2003 IEEE Data Compression Conference, pp. 372–381. IEEE Computer Society Press, Los Alamitos (2003)CrossRefGoogle Scholar
  5. de Moura, E.S., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems 18(2), 113–139 (2000)CrossRefGoogle Scholar
  6. Golomb, S.W.: Run-length encodings. IEEE Transactions on Information Theory IT–12(3), 399–401 (1966)CrossRefMathSciNetGoogle Scholar
  7. Liddell, M., Moffat, A.: Decoding prefix codes (December 2004); Submitted, Preliminary version published. In: Proc. IEEE Data Compression Conference, pp. 392–401 (2003)Google Scholar
  8. Rautio, J., Tanninen, J., Tarhio, J.: String matching with stopper encoding and code splitting. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 42–51. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  9. Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Beaulieu, M., Baeza-Yates, R., Myaeng, S.H., Järvelin, K. (eds.) Proc. 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, pp. 222–229. ACM Press, New York (2002)CrossRefGoogle Scholar
  10. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • J. Shane Culpepper
    • 1
  • Alistair Moffat
    • 1
  1. 1.NICTA Victoria Laboratory, Department of Computer Science and Software EngineeringThe University of MelbourneAustralia

Personalised recommendations