DZ A text compression algorithm for natural languages

Revuz, Dominique; Zipstein, Marc

doi:10.1007/3-540-56024-6_16

Dominique Revuz¹ &
Marc Zipstein¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 644))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

126 Accesses
1 Citations

Abstract

«Texts written in a natural language are essentially made of words of this language». We use this obvious fact, together with an extensive lexicon to define a good model of the statistical behavior of letters in texts. This model is used with the arithmetic coding scheme to build an efficient universal data compression method. Initially our method was specialized in the compression of French texts. However it can be easily adapted to other languages. Tests show that the compression ratio obtained by our method is on the average 30% on French texts. On the same texts Ziv & Lempel's method yields an average ratio of 40%. On other kinds of test files (English text, executable files, sources) the use of an order 1 Markov chain leads to results of the same order as Ziv & Lempel's. We present a new approach to dynamic dictionary construction for natural language compression. The fact well known to linguists that the number of different words is small, makes a dynamic construction possible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Bibliography

D. ABRAHAMSON, An adaptive dependency source model for data compression, Commun. ACM, 32,1 (1989), 77–83.
Google Scholar
T.C. BELL, J. G. CLEARY, I. H. WITTEN, Text Compression, Prentice Hall advanced reference series, 1990, ISBN 0-13-911991-4.
Google Scholar
P.F BROWN, S.A. DELLA PIETRA, V. J. DELLA PIETRA, J.C. LAI, R.L. MERCER, An Estimate of an Upper Bound for the Entropy of English, Preprint (1991).
Google Scholar
G.V. CORMACK, R.N.S. HORSPOOL, Data compression using dynamic Markov modelling, Comput. J., 30,6 (1987), 541–550.
Google Scholar
M. GUAZZO, A general minimum-redundancy source-coding algorithm, I.E.E.E. Trans. on Inform. Theory, 26,1 (1980), 15–25,January.
Google Scholar
D. A. HUFFMAN, A method for the construction of minimum redundancy codes, Proc. IRE, 40 (1952), 1098–1101,September.
Google Scholar
D. REVUZ, Dictionnaires et Lexiques Méthodes et Algorithmes, These de Doctorat Université Paris 7, (1991).
Google Scholar
J. RISSANEN, Generalized Kraft inequality and Arithmetic coding, IBM J. Res. Dev., 20 (1976), 198–203,May.
Google Scholar
J. RISSANEN, G.G. LANGDON Jr., Arithmetic coding, IBM J. Res. Dev., 23,2 (1979), 149–162,March.
Google Scholar
T.A. WELCH, A technique for high-performance data compression, IEEE Computer, 17,6 (1984), 8–19,June.
Google Scholar
I.H. WITTEN, R.M. NEAL, J.G. CLEARY, Arithmetic coding for data compression, Commun. ACM, 30,6 (1987), 520–540, June.
Google Scholar
J. ZIV, A. LEMPEL, A Universal algorithm for sequential data compression, I.E.E.E. Trans. Inform.Theory, 23,3 (1977), 337–343, May.
Google Scholar
J. ZIV, A. LEMPEL, Compression of individual sequences via variable-rate coding, I.E.E.E. Trans. Inform.Theory, 24,5 (1978), 530–536, September.
Google Scholar

Download references

Author information

Authors and Affiliations

Institut Gaspard Monge, Université de Marne La Vallée, 2, allée Jean-Renoir, 93160, Noisy le Grand, France
Dominique Revuz & Marc Zipstein

Authors

Dominique Revuz
View author publications
You can also search for this author in PubMed Google Scholar
Marc Zipstein
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alberto Apostolico Maxime Crochemore Zvi Galil Udi Manber

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Revuz, D., Zipstein, M. (1992). DZ A text compression algorithm for natural languages. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds) Combinatorial Pattern Matching. CPM 1992. Lecture Notes in Computer Science, vol 644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-56024-6_16

Download citation

DOI: https://doi.org/10.1007/3-540-56024-6_16
Published: 04 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-56024-1
Online ISBN: 978-3-540-47357-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics