Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Text compression methods

  • 59 Accesses

  • 1 Citations

Abstract

Methods and algorithms for physical compression of text are considered. Their comparative efficiency is analyzed, allowing for the tradeoff between volume compression and the coding/decoding time and for their robustness to various changes in the statistical characteristics of the texts being processed.

This is a preview of subscription content, log in to check access.

Literature cited

  1. 1.

    R. N. Ainon, “Storing text using integer codes,” Proc. 11th Int. Conf. on Comput. Ling. COLING '86, 25–29 Aug. 1986, Bonn (1986), pp. 418–420.

  2. 2.

    P. A. Alsberg, “Space and time savings through large data base compression and dynamic restructuring,” Proc. IEEE,63, No. 8, 1114–1112 (1975).

  3. 3.

    I. J. Barton, S. E. Creasy, M. F. Lynch, and M. J. Snell, “An information-theoretic approach to text searching in direct access systems,” Commun. ACM,17, 345–350 (1974).

  4. 4.

    I. J. Barton, M. F. Lynch, J. H. Petrie, and M. J. Snell, “Variable-length character string analysis of three databases and their application for file compression,” Informatics I, Proc. 1st Informatics Conf., Durham Univ., 1973, ASLIB, London (1974), pp. 154–162.

  5. 5.

    M. A. Bassiouni, “Data compression in scientific and statistical databases,” IEEE Trans. Software Eng.,11, No. 10, 1047–1058 (1985).

  6. 6.

    M. A. Bassiouni and B. Ok, “Double encoding — a technique for reducing storage requirements of text,” Inform. Syst.,11, No. 2, 117–184 (1986).

  7. 7.

    M. A. Bassiouni and K. A. Hazboun, “A multi-group technique for data compression,” Proc. ASM SIGMOD Int. Conf. on Management of Data (1982), pp. 284–292.

  8. 8.

    M. A. Bassiouni and K. A. Hazboun, “Utilization of character reference locality for efficient storage of databases,” Proc. 2nd Int. Workshop on Statistical Database Management, Los Altos, Ca. (1983), pp. 338–334.

  9. 9.

    D. Batory, “Index coding: a compression technique for large statistical databases,” Proc. 2nd Int. Workshop on Statistical Database Management (1983), pp. 306–314.

  10. 10.

    R. W. Bemer, “Do it by the numbers — digital shorthand,” Commun. ACM,3, No. 10, 530–536 (1960).

  11. 11.

    L. S. Bobrov and S. L. Hakimi, “Graph theoretic prefix codes and their synchronizing properties,” Inform. Contr.,16, No. 1, 70–94 (1969).

  12. 12.

    A. Bookstein and G. Faulty, “A mathematical model for estimating the effectiveness of bigram coding,” Inform. Process. Manag.,12, No. 2, 111–116 (1976).

  13. 13.

    A. D. Booth, “A ‘law’ of occurrences of words of low frequency,” Inform. Contr.,10, 386–393 (1967).

  14. 14.

    E. V. Brack, D. Cooper, and M. F. Lynch, “The stability of symbol sets produced by variety generation from bibliographic data,” Program,12, No. 2, 64–77 (1978).

  15. 15.

    S. D. Bradley, “Optimizing a scheme for run-length encoding,” Proc. IEEE (letters),57, No. 1, 108–109 (1969).

  16. 16.

    J. M. Bray, V. P. Nelson, B. A. deMain, and J. D. Irwin, “Data compression techniques ease storage problems,” Comput. Design,24, No. 14, 102–106 (1985).

  17. 17.

    T. Ch. Chen and I. T. Ho, “Storage efficient representation of decimal data,” Commun. ACM,18, No. 1, 49–52 (1975).

  18. 18.

    K. Choros, P. Majewska, and A. Sieminski, “Bibliographic data base comperssion,” Int. Forum on Information and Documentation,12, No. 1, 28–32 (1987).

  19. 19.

    Y. Choueka, A. S. Fraenkel, and Y. Perl, “Polynomial construction of optimal prefix tables for text compression,” Proc. 19th Annual Allerton Conf. on Communication, Control, and Computing (1981), pp. 762–768.

  20. 20.

    A. C. Clare, E. M. Cook, and M. F. Lynch, “The identification of varible-length equifrequent character strings in natural language data base,” Computer J.,15, No. 3, 259–262 (1972).

  21. 21.

    D. Colombo and J. E. Rush, “Use of word fragments in computer based retrieval systems,” J. Chem. Document.,9, No. 1, 47–50 (1969).

  22. 22.

    J. B. Conuel, “A Huffman-Shannon-Fano code,” Proc. IEEE, 1046–1047 (1973).

  23. 23.

    D. Cooper, M. A. Emily, M. F. Lynch, and A. R. Yeates, “Compression of continuous prose texts using variety generation,” J. ASIS,31, No. 3, 201–207 (1980).

  24. 24.

    D. Cooper and M. F. Lynch, “Compression of Wiswester Line Notation using variety generation,” J. Chem. Inform. Comput. Sci.,19, No. 3, 165–169 (1979).

  25. 25.

    D. Cooper and M. F. Lynch, “Text compression using variable-to-fixed length encodings,” J. ASIS,33, No. 1, 18–31 (1982).

  26. 26.

    D. Cooper, “The storage problem,” Mech. Transl.,5, No. 2, 74–83 (1958).

  27. 27.

    G. V. Cormack, “Data compression on a database system,” Commun. ACM,28, No. 12, 1336–1342 (1985).

  28. 28.

    D. Cortesi, “An effective text compression algorithm,” BYTE,7, No. 1, 397–403 (1982).

  29. 29.

    T. M. Cover, “Enumerative source coding,” IEEE Trans. Inform. Theory,19, 73 (1973).

  30. 30.

    S. E. Creasy, M. F. Lynch, and J. H. Petrie, “Compression of biblioraphic database using a variable-to-fixed length bit string transformations,” Program,8, No. 4, 191–195 (1974).

  31. 31.

    N. Faller, “An adaptive system for data compression,” 7th Assilomar Conf. on Circuits, Systems, and Computers (1973), pp. 593–597.

  32. 32.

    R. M. Fano, Transmission of Information, MIT Press (1961).

  33. 33.

    A. S. Fraenkel and M. Mor, “Combinatorial compression and partitioning of large dictionaries,” Comput. J.,26, No. 4, 336–343 (1983).

  34. 34.

    A. S. Fraenkel, M. Mor, and Y. Perl, “Is the compression by prefixes and suffixes practical?” Acta Inform.,20, No. 4, 371–379 (1983).

  35. 35.

    R. G. Gallager, “Variations on a theme by Huffman,” IEEE Trans. Inform. Theory,24, No. 6, 668–674 (1978).

  36. 36.

    F. Gey, F. McCarthy, D. Merril, and H. Holmes, “Computer independent data compression for large statistical databases,” Proc. 2nd Int. Workshop on Statistical Database Management (1983), pp. 296–305.

  37. 37.

    E. N. Gilbert and E. F. Moore, “Variable length binary encoding,” Bell Syst. Tech. J.,38, 913–967 (1959).

  38. 38.

    S. W. Golomb, “Run-length encoding,” IEEE Trans. Inform. Theory,12, 399–401 (1966).

  39. 39.

    D. Gotlieb, S. A. Hagerth, P. G. H. Lehot, and H. S. Rabinowitz, “A classification of compression methods and their usefulness for a large data processing center,” Proc. Nat. Comput. Conf., 1975, Vol. 44, 453–458, AFIPS Press, Montville, N.J. (1975).

  40. 40.

    P. Goyal, “Coding methods for text string search on compressed databases,” Inform. Syst.,8, No. 3, 231–233 (1983).

  41. 41.

    D. W. Grooms, Data Compression, NTIS, Springfield, Va. (1977).

  42. 42.

    M. Guazzo, “A general minimum redundancy source coding algorithm,” IEEE Trans. Inform. Theory,26, No. 1, 15–25 (1980).

  43. 43.

    W. D. Hagamen et al., “Encoding verbal information as unique numbers,” IBM Syst. J.,11, No. 4, 278–315 (1972).

  44. 44.

    B. Hahn, “A new technique for compression and storage of data,” Commun. ACM,17, No. 8, 434–436 (1974).

  45. 45.

    G. Hallman, “How to squeeze data into smaller spaces,” Can. Datasystems,10, No. 9, 27–29 (1978).

  46. 46.

    H. S. Heaps, “Storage analysis of compression coding for document data bases,” INFOR J.,10, No. 1, 47–61 (1972).

  47. 47.

    H. S. Heaps, “Data compression for large document data base,” Conf. on Large Data Bases Sponsored by NAS/NRC, Committee on Chemical Information, National Academy of Sciences (1974).

  48. 48.

    H. S. Heaps, “Data compression for large document data bases,” J. Chem. Inform. Comput. Sci.,15, No. 1, 32–39 (1975).

  49. 49.

    H. S. Heaps and T. Radhakrishnan, “Compaction of diagnosis messages for compilers,” Software Pract. Exper.,7, 139 (1977).

  50. 50.

    D. Huffman, “A method for the construction of minimum redundancy codes,” Proc. IRE,40, 1098–1101 (1952).

  51. 51.

    M. Jakobsson, “Huffman coding in bit-vector compression,” Inform. Process. Lett.,7, No. 6, 304–307 (1978).

  52. 52.

    J. Jelinek and K. S. Schneider, “On variable-length-to-block coding,” IEEE Trans. Inform. Theory,18, No. 6, 765–774 (1972).

  53. 53.

    G. C. Jewell, “Text compaction for information retrieval systems,” IEEE Systems, Man, and Cybernetics Society News-letters,6, 4–7 (1976).

  54. 54.

    C. B. Johnes, “An efficient coding system for long source sequences,” IEEE Trans. Inform. Theory,27, 280–291 (1981).

  55. 55.

    J. Katajainen, M. Penttonen, and J. Teuhola, “Syntax-directed compression of program files,” Software Pract. Exper.,16, No. 3, 269–276 (1986).

  56. 56.

    G. G. Langdon and J. Rissanen, “A double adaptive file compression algorithm,” IEEE Trans. Commun.,31, 1253–1255 (1983).

  57. 57.

    C. A. Lynch and E. B. Brownrigg, “Application of data compression technique to large bibliographic databases,” Proc. 7th Int. Conf. on Very Large Data Bases (1981), pp. 435–447.

  58. 58.

    M. F. Lynch, J. H. Pétrie, and M. J. Snell, “Analysis of the microstructure of titles in the INSPEC database,” Inform. Storage and Retrieval,9, No. 6, 331–337 (1973).

  59. 59.

    M. F. Lynch, “Variety generation — a reinterpretation of Shannon's mathematical theory of communication and its implications for information science,” J. ASIS,28, No. 1, 19–25 (1977).

  60. 60.

    M. F. Lynch and S. D. Rawson, “Equifrequent character strings — a novel text characterization method,” Proc. 3rd Int. Symp. on the Use of Computers in Linguistics and Linguistic Research, Cardiff 1974, Univ. of Wales Press (1976), pp. 47–58.

  61. 61.

    P. B. Maggs, “Compression of legal texts for more economical storage,” Jurimetrics J.,14, 254–261 (1974).

  62. 62.

    B. M. Mandelbrot, “On the theory of word frequencies and on related Markovian models of discourse,” AMS Symp. on Applied Math.,72, 190–219 (1960).

  63. 63.

    B. A. Marron and P. A. D. de Maine, “Automatic data compression,” Commun. ACM,10, No. 11, 711–715 (1967).

  64. 64.

    J. Martin, Data Compaction in Computer Data Base Organization, Prentice Hall, Englewood Cliffs, N.J. (1977), pp. 572–587.

  65. 65.

    A. Mayne and E. B. James, “Information compression by factorizing common strings,” Comput. J.,17, No. 2, 157–160 (1975).

  66. 66.

    J. P. McCarthy, “Automatic file compression,” Int. Computing Symp., Davos 1973, North-Holland, Amsterdam (1974), pp. 511–516.

  67. 67.

    D. R. McIntyre and M. A. Pechura, “Data compression using static Huffman code-decode tables,” Commun. ACM,28, No. 6, 612–616 (1985).

  68. 68.

    V. S. Miller, “Data compression algorithms,” Proc. Symp. Appl. Math.,34, 107–118 (1986).

  69. 69.

    J. E. Mulford and R. K. Riddall, “Data compression technique for economic processing of large commercial files,” ACM Symp. on Inform. Storage and Retrieval, 1971, College Park, Md. (1971), pp. 207–215.

  70. 70.

    O. Nevelainen, M. Jacobson, and R. Berg, “Compression of clustered inverted files,” in: Math. Foundations of Computer Sci., Springer, Berlin (1978), pp. 393–402.

  71. 71.

    W. R. Nugent, “Compression word coding techniques for information retrieval,” J. Lib. Autom.,1, No. 4, 250–260 (1968).

  72. 72.

    M. Pechura, “File archival techiques using data compression,” Commun. ACM,25, No. 9, 605–609 (1982).

  73. 73.

    T. Radhakrishnan, “Selection of prefix and postfix word fragments for data compression,” Inform. Process. Manag.,14, No. 2, 97–106 (1978).

  74. 74.

    C. Ramamoorthy, “Document compaction by variable length encoding,” Proc. ASIS,1, 507–513 (1964).

  75. 75.

    H. K. Reghbati, “An overview of data compression techniques,” Computer,14, 71–75 (1981).

  76. 76.

    J. Rissanen, “Generalized Kraft inequality and arithmetic coding,” IBM J. Res. Dev.,20, 198–203 (1976).

  77. 77.

    J. Rissanen, “Arithmetic coding as number representations,” Acta Pol. Scand. Math.,31, 44–51 (1979).

  78. 78.

    J. Rissanen and G. G. Langdon, “Arithmetic coding,” IBM J. Res. Dev.,23, 149–162 (1979).

  79. 79.

    J. Rissanen and G. G. Langdon, “Universal modeling and coding,” IEEE Trans. Inform. Theory,27, 12–23 (1981).

  80. 80.

    M. Rodeh, V. R. Pratt, and S. Even, “Linear algorithm for data compression via matching,” J. ACM,28, No. 1, 16–24 (1981).

  81. 81.

    F. Rubin, “Experiments in text file compression,” Commun. ACM,19, No. 11, 617–623 (1976).

  82. 82.

    S. S. Ruth and P. J. Kreutzer, “Data compression for large business files,” Datamation,18, No. 8, 62–66 (1972).

  83. 83.

    J. Schalwijk, “An algorithm for source coding,” IEEE Trans. Inform. Theory,18, 395 (1972).

  84. 84.

    W. D. Schieber and G. W. Thomas, “An algorithm for compaction of alphanumeric data,” J. Lib. Autom.,4, No. 4, 198–206 (1971).

  85. 85.

    E. J. Schuegraf, “A survey of data compression methods for nonnumeric records,” Can. J. Inform. Sci.,2, No. 1, 93–105 (1976).

  86. 86.

    E. J. Schuegraf and H. S. Heaps, “Selection of equifrequent word fragments for information retrieval,” Inform. Storage and Retrieval,9, No. 12, 697–711 (1973).

  87. 87.

    E. J. Schuegraf and H. S. Heaps, “A comparison of algorithms for database compression by use of fragments as language elements,” Inform. Storage and Retrieval,10, No. 9, 309–319 (1974).

  88. 88.

    E. S. Schwanz, “Dictionary for minimum redundancy coding,” J. ACM, No. 10, 413–439 (1963).

  89. 89.

    E. S. Schwartz and A. J. Kleinbomer, “A language element for compression coding,” Inform. Contr.,10, No. 3, 315–333 (1967).

  90. 90.

    D. G. Severance, “A practitioner's guide to data base compression,” Inform. Syst.,8, No. 1, 51–62 (1983).

  91. 91.

    C. E. Shannon, “A mathematical theory of communication,” Bell. Syst. Tech. J.,27, 379–423 (1948).

  92. 92.

    C. E. Shannon, “Prediction and entropy of printed English,” Bell. Syst. Tech. J.,30, 50–64 (1951).

  93. 93.

    M. Snyderman and B. Hunt, “The myriad virtues of text compaction,” Datamation, No. 16, 36–40 (1970).

  94. 94.

    J. Storer and T. Szymanski, “Data compression via textual substitution,” J. ACM,29, No. 4, 928–951 (1982).

  95. 95.

    H. Tanaka and S. Kaneku, “Data compression approach to cryptography,” Proc. 1979 Carnahan Conf. on Crime Countermeasures, Lexington, Kentucky (1979), pp. 159–163.

  96. 96.

    T. C. Ting, “Compacting homogeneous text for minimzing storage space,” Int. J. Comput. Inform. Sci.,6, No. 3, 211–221 (1977).

  97. 97.

    R. Tropper, “Binary-coded text, a text compression method,” BYTE,7, No. 4, 398–413 (1982).

  98. 98.

    B. F. Varn, “Optimal variable length codes,” Inform. Contr.,19, 289–301 (1971).

  99. 99.

    R. A. Wagner, “Common phrases and minimum text storage,” Commun. ACM,16, No. 3, 148–153 (1973).

  100. 100.

    R. A. Wagner, “An algorithm for extracting phrases in a space optimal fashion,” Commun. ACM,16, No. 4, 183–185 (1973).

  101. 101.

    V. R. Walker, “Compaction of names by X-grams,” Proc. ASIS,6, 129–135 (1969).

  102. 102.

    S. F. Weiss and R. L. Vernor, “A word-based compression technique for text files,” J. Lib. Autom.,11, No. 2, 97–105 (1978).

  103. 103.

    T. Welch, “A technique for high-performance data compression,” Computer,17, No. 6, 8–19 (1984).

  104. 104.

    M. Wells, “File compression using variable length encodings,” Comput. J.,15, No. 4, 308–313 (1972).

  105. 105.

    H. E. White, “Printed English compression by dictionary encodings,” Proc. IEEE,55, No. 3, 390–396 (1967).

  106. 106.

    P. W. Williams, “Criteria for choosing subsets to obtain maximum relative entropy,” Comput. J.,21, No. 1, 57–65 (1978).

  107. 107.

    J. L. Wisniewski, “Effective text compression with simultaneous bigram and trigram encoding,” J. Inform. Sci., No. 13, 159–164 (1987).

  108. 108.

    J. L. Wisniewski, “Compression of index-term dictionary in an inverted-file-oriented data base: some effective algorithms,” Inform. Process. Manag., No. 6, 493–501 (1986).

  109. 109.

    J. G. Wolf, “Recording of natural language for economy of transmission and storage,” Comput. J.,21, No. 1, 42–44 (1978).

  110. 110.

    J. G. Wolf, “An algorithm for the segmentation of an artificial language dialogue,” British J. Psychol.,66, 79–90 (1975).

  111. 111.

    D. Yavuz, “Zipf's law and entropy,” IEEE Trans. Inform. Theory,20, No. 5, 650–657 (1974).

  112. 112.

    T. Y. Young and P. S. Liu, “Overhead storage considerations and a multilinear method for data file compression,” IEEE Trans. Software Eng.,6, No. 4, 340–347 (1980).

  113. 113.

    G. K. Zipf, Human Behaviour and the Principle of Least Effort, Addison Wesley, Cambridge, Mass. (1949).

  114. 114.

    J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans. Inform. Theory,23, No. 3, 337–343 (1977).

  115. 115.

    J. Ziv and A. Lempel, “Compression of individual sequences via variable rate coding,” IEEE Trans. Inform. Theory,24, No. 5, 530–536 (1978).

  116. 116.

    V. A. Amel'kin, Numeration Coding Methods [in Russian], Nauka, Novosibirsk (1986).

  117. 117.

    G. G. Belonogov and R. G. Kotov, Computerized Information-Retrieval Systems [in Russian], Sovet-skoe Radio, Moscow (1968).

  118. 118.

    E. V. Bondar', “Data compression methods for production MIS database organization,” in: Methods of MIS Design and Organization [in Russian], Kiev (1985), pp. 66–70.

  119. 119.

    E. E. Brizgal and L. N. Meierovich, “Compression of patent databases,” in: Methods of MIS Design and Organization [in Russian], Kiev (1985), pp. 58–61.

  120. 120.

    B. Ya. Kavalerchik and V. I. Grishkan, “On improving computer system productivity,” USiM, No. 4, 23–27 (1986).

  121. 121.

    G. S. Kolesnikov, V. I. Pakhomov, V. T. Perevertkin, and A. V. Shambazov, “One algorithm for text compression,” in: Computer Software [in Russian], Moscow (1984), pp. 107–117.

  122. 122.

    A. N. Kolmogorov, “Three approaches to the definition of the concept of ‘amount of information’,” Prob. Peredachi Inform.,1, No. 1, 11–14 (1965).

  123. 123.

    E. I. Korunets, “One approach to compression of binary data files,” in: Methods of MIS Design and Organization [in Russian], Kiev (1985), pp. 101–107.

Download references

Additional information

Translated from Itogi Nauki i Tekhniki, Seriya Teoriya Veroyatnostei, Matematicheskaya Statistika, Teoreticheskaya Kibernetika, Vol. 28, pp. 85–109, 1988.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Bol'shakov, I.A., Smirnov, A.V. Text compression methods. J Math Sci 56, 2249–2262 (1991). https://doi.org/10.1007/BF01099202

Download citation

Keywords

  • Statistical Characteristic
  • Compression Method
  • Comparative Efficiency
  • Volume Compression
  • Physical Compression