Advertisement

A text compression scheme that allows fast searching directly in the compressed file

  • Udi Manber
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 807)

Abstract

A new text compression scheme is presented in this paper. The main purpose of this scheme is to speed up string matching by searching the compressed file directly. The scheme requires no modification of the string-matching algorithm, which is used as a black box; any string-matching procedure can be used. Instead, the pattern is modified; only the outcome of the matching of the modified pattern against the compressed file is decompressed. Since the compressed file is smaller than the original file, the search is faster both in terms of I/O time and processing time than a search in the original file. For typical text files, we achieve about 30% reduction of space and slightly less of search time. A 30% space saving is not competitive with good text compression schemes, and thus should not be used where space is the predominant concern. The intended applications of this scheme are files that are searched often, such as catalogs, bibliographic files, and address books. Such files are typically not compressed, but with this scheme they can remain compressed indefinitely, saving space while allowing faster search at the same time. A particular application to an information retrieval system that we developed is also discussed.

Keywords

Compression Algorithm Information Retrieval System String Match Inverted Index Fast Search 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [AB92a]
    Amir, A, and G. Benson, ‘Two-dimensional periodicity and its application,” Proc. of the 3rd Symp. on Discrete Algorithms, Orlando Florida (January 1992), pp. 440–452.Google Scholar
  2. [AB92b]
    Amir, A, and G. Benson, “Efficient two dimensional compressed matching,” Proc. of the Data Compression Conference, Snowbird Utah (March 1992), pp. 279–288.Google Scholar
  3. [ABF94]
    Amir, A, G. Benson, and M. Farach, “Let sleeping files lie: pattern matching in Z-compressed files,” Proc. of the 5rd Symp. on Discrete Algorithms, (January 1994), to appear.Google Scholar
  4. [AC75]
    Aho, A. V., and M. J. Corasick, “Efficient string matching: an aid to bibliographic search”, Communications of the ACM, 18 (June 1975), pp. 333–340.Google Scholar
  5. [BER76]
    Bitner J. R., G. Erlich, and E. M. Reingold, “Efficient generation of the binary reflected Gray code and its applications,” Communications of the ACM, 19 (September 1976), pp. 517–521.Google Scholar
  6. [BCW90]
    Bell, T. G., J. G. Cleary, and I. H. Witten, Text Compression, Prentice-Hall, Englewood Cliffs, NJ (1990).Google Scholar
  7. [BM77]
    Boyer R. S., and J. S. Moore, “A fast string searching algorithm,” Communications of the ACM, 20 (October 1977), pp. 762–772.Google Scholar
  8. [EV88]
    Eilam-Tsoreff T., and U. Vishkin, “Matching patterns in a string subject to multilinear transformations,” Proc. of the Int. Workshop on Sequences, Combinatorics, Compression, Security, and Transmission, Salerno, Italy (June 1988).Google Scholar
  9. [Fa93]
    Farach M., private communication (October 1993).Google Scholar
  10. [GJ79]
    Garey M. R., and D. S. Johnson, Computers and Intractability, A Guide to the Theory of NP-completeness, W. H. Freeman, San Francisco, CA, 1979.Google Scholar
  11. [GM94]
    B. Gopal, and U. Manber, “A Fixed-Dictionary Approach to Fast Searching in Compressed Files,” submitted for publication.Google Scholar
  12. [Je76]
    Jewell G. C., “Text compaction for information retrieval systems,” IEEE SMC Newsletter, 5 (February 1976).Google Scholar
  13. [KBD89]
    Klein, S.T., A. Bookstein, and S. Deerwester, “Storing text retrieval systems on CD-ROM: compression and encryption considerations,” ACM Trans. on Information Systems, 7 (July 1989), pp. 230–245.Google Scholar
  14. [MW94]
    Manber U. and S. Wu, “GLIMPSE: A Tool to Search Through Entire File Systems,” Usenix Winter 1994 Technical Conference, San Francisco (January 1994), pp. 23–32.Google Scholar
  15. [WBN92]
    Witten, I. H., T. C. Bell, and C. G. Nevill, “Models for compression in fulltext retrieval systems,” Proc. of the Data Compression Conference, Snowbird, Utah (April 1991), pp. 23–32.Google Scholar
  16. [We84]
    Welch, T. A., “A technique for high-performance data compression,” IEEE Computer, 17 (June 1984), pp. 8–19.Google Scholar
  17. [WM92]
    Wu S., and U. Manber, “Fast Text Searching Allowing Errors,” Communications of the ACM 35 (October 1992), pp. 83–91.Google Scholar
  18. [ZL77]
    Ziv, J. and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans, on Information Theory, IT-23 (May 1977). pp. 337–343.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1994

Authors and Affiliations

  • Udi Manber
    • 1
  1. 1.Department of Computer ScienceUniversity of ArizonaTucson

Personalised recommendations