Advertisement

Acta Informatica

, Volume 20, Issue 4, pp 371–389 | Cite as

Is text compression by prefixes and suffixes practical?

  • A. S. Fraenkel
  • M. Mor
  • Y. Perl
Article

Summary

One approach to text compression is to replace high-frequency variable-length fragments of words by fixed-length codes pointing to a compression table containing these high-frequency fragments. It is shown that the problem of optimal fragment compression is NP-hard even if the fragments are restricted to prefixes and suffixes. This seems to be a simplest fragment compression problem which is NP-hard, since a polynomial algorithm for compressing by prefixes only (or suffixes only) has been found recently. Various compression heuristics based on using both prefixes and suffixes have been tested on large Hebrew and English texts. The best of these heuristics produce a net compression of some 37% for Hebrew and 45% for English using a prefix/suffix compression table of size 256.

Keywords

Information System Operating System Data Structure Communication Network Information Theory 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Choueka, Y., Fraenkel, A.S., Perl, Y.: Polynomial construction of optimal prefix tables for text compression. Proc. 19th Annual Allerton Conference on Communication, Control and Computing, pp. 762–768, Oct. 1981Google Scholar
  2. 2.
    Cooper, D., Lynch, M.F.: Text compression using variable-to-fixed-length encodings. Tech. Report, Postgraduate School of Librarian-ship and Information Science, University of Sheffield, Western Bank, Sheffield S10 2TN, EnglandGoogle Scholar
  3. 3.
    Fraenkel, A.S.: All about the Responsa Retrieval Project you always wanted to know but were afraid to ask, Expanded Summary. Proc. 3rd Symp. Legal Data Process. in Europe (Oslo 1975), pp. 131–141, Council of Europe, Strasbourg (1976). Reprinted in Jurimetrics J. 16 (3), 149–156 (1976); Informatica e Diritto II, 362–370 (1976)Google Scholar
  4. 4.
    Gotlieb, D., Hagerth, S.A., Lehot, P.G.H., Rabinowitz, H.S.: A classification of compression methods and their usefulness for a large data processing center. National Comp. Conference 44, 453–458 (1975)Google Scholar
  5. 5.
    Hagamen, W.D., Linden, D.J., Long, H.S., Weber, J.C.: Encoding verbal information as unique numbers. IBM Syst. J. 11, 278–315 (1972)Google Scholar
  6. 6.
    Knuth. D.E.: The Art of Computer Programming, Vol. 1: Fundamental Algorithms, Addison-Wesley. Reading, MA, Second Printing, 1973Google Scholar
  7. 7.
    Lichtenstein, D.: Planar satisfiability and its uses. SIAM J. on Computing 11, 329–343 (1982)Google Scholar
  8. 8.
    Lynch, M.F.: Compression of bibliographic files using an adoption of run-length coding. Inform. Stor. Retr. 9, 207–214 (1973)Google Scholar
  9. 9.
    Maier, D., Storer, J.A.: A note on the complexity of the superstring problem. Extended Abstract, Proc. Conference on Information Sciences and Systems, Dept. of Elect. Engr., The Johns Hopkins University, Baltimore, MD, pp. 52–56, 1978Google Scholar
  10. 10.
    Mayne, A., James, E.B.: Information compression by factorising common strings. Computer J. 18, 157–160 (1975)Google Scholar
  11. 11.
    McCarthy, J.P.: Automatic file compression, Intern. Computing Symp. 1973, North-Holland, Amsterdam, pp. 511–516, 1974Google Scholar
  12. 12.
    Peterson, J.L.: Computer programs for detecting and correcting spelling errors. CACM 23, 676–687 (1980)Google Scholar
  13. 13.
    Radhakrishnan, T.: Selection of prefix and postfix word fragments for data compression. Inform. Process. & Management 14, 97–106 (1978)Google Scholar
  14. 14.
    Rodeh, M., Pratt, V.R., Even, S.: Linear algorithm for data compression via string matching. JACM 28, 16–24 (1981)Google Scholar
  15. 15.
    Rubin, F.: Experiments in text file compression. CACM 19, 617–623 (1976)Google Scholar
  16. 16.
    Schuegraf, E.J., Heaps, H.S.: Selection of equifrequent word fragments for information retrieval. Inform. Stor. Retr. 9, 697–711 (1973)Google Scholar
  17. 17.
    Storer, J.A.: Toward an abstract theory of data compression. Extended Abstract. Proc. Conference on Information Sciences and Systems, Dept. of Elect. Engr., The Johns Hopkins University, Baltimore, MD, pp. 391–399, 1978Google Scholar
  18. 18.
    Storer, J.A., Szymanski, T.G.: The macro model for data compression. Extended Abstract. Proc. Tenth Annual ACM Symposium on Theory of Computing, San Diego, CA, pp. 30–39, 1978Google Scholar
  19. 19.
    Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. JACM 29, 928–951 (1982)Google Scholar
  20. 20.
    Wagner, R.A.: Common phrases and minimum-space text storage. CACM 16, 148–152 (1973)Google Scholar
  21. 21.
    Walker, V.R.: Compaction of names by x-grams. Proc. Amer. Soc. Inform. Sci. 6, 129–135 (1969)Google Scholar
  22. 22.
    Yannakoudakis, E.J., Goyal, P., Huggill, J.A.: The generation and use of text fragments for data compression. Inform. Process. & Management 18, 15–21 (1982)Google Scholar
  23. 23.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Information Theory IT-23, 337–343 (1977)Google Scholar

Copyright information

© Springer-Verlag 1983

Authors and Affiliations

  • A. S. Fraenkel
    • 1
  • M. Mor
    • 1
  • Y. Perl
    • 2
  1. 1.Department of Applied MathematicsThe Weizmann Institute of ScienceRehovotIsrael
  2. 2.Department of Mathematics and Computer ScienceBar-Ilan UniversityRamat GanIsrael

Personalised recommendations