Abstract
«Texts written in a natural language are essentially made of words of this language». We use this obvious fact, together with an extensive lexicon to define a good model of the statistical behavior of letters in texts. This model is used with the arithmetic coding scheme to build an efficient universal data compression method. Initially our method was specialized in the compression of French texts. However it can be easily adapted to other languages. Tests show that the compression ratio obtained by our method is on the average 30% on French texts. On the same texts Ziv & Lempel's method yields an average ratio of 40%. On other kinds of test files (English text, executable files, sources) the use of an order 1 Markov chain leads to results of the same order as Ziv & Lempel's. We present a new approach to dynamic dictionary construction for natural language compression. The fact well known to linguists that the number of different words is small, makes a dynamic construction possible.
Preview
Unable to display preview. Download preview PDF.
Bibliography
D. ABRAHAMSON, An adaptive dependency source model for data compression, Commun. ACM, 32,1 (1989), 77–83.
T.C. BELL, J. G. CLEARY, I. H. WITTEN, Text Compression, Prentice Hall advanced reference series, 1990, ISBN 0-13-911991-4.
P.F BROWN, S.A. DELLA PIETRA, V. J. DELLA PIETRA, J.C. LAI, R.L. MERCER, An Estimate of an Upper Bound for the Entropy of English, Preprint (1991).
G.V. CORMACK, R.N.S. HORSPOOL, Data compression using dynamic Markov modelling, Comput. J., 30,6 (1987), 541–550.
M. GUAZZO, A general minimum-redundancy source-coding algorithm, I.E.E.E. Trans. on Inform. Theory, 26,1 (1980), 15–25,January.
D. A. HUFFMAN, A method for the construction of minimum redundancy codes, Proc. IRE, 40 (1952), 1098–1101,September.
D. REVUZ, Dictionnaires et Lexiques Méthodes et Algorithmes, These de Doctorat Université Paris 7, (1991).
J. RISSANEN, Generalized Kraft inequality and Arithmetic coding, IBM J. Res. Dev., 20 (1976), 198–203,May.
J. RISSANEN, G.G. LANGDON Jr., Arithmetic coding, IBM J. Res. Dev., 23,2 (1979), 149–162,March.
T.A. WELCH, A technique for high-performance data compression, IEEE Computer, 17,6 (1984), 8–19,June.
I.H. WITTEN, R.M. NEAL, J.G. CLEARY, Arithmetic coding for data compression, Commun. ACM, 30,6 (1987), 520–540, June.
J. ZIV, A. LEMPEL, A Universal algorithm for sequential data compression, I.E.E.E. Trans. Inform.Theory, 23,3 (1977), 337–343, May.
J. ZIV, A. LEMPEL, Compression of individual sequences via variable-rate coding, I.E.E.E. Trans. Inform.Theory, 24,5 (1978), 530–536, September.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1992 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Revuz, D., Zipstein, M. (1992). DZ A text compression algorithm for natural languages. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds) Combinatorial Pattern Matching. CPM 1992. Lecture Notes in Computer Science, vol 644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-56024-6_16
Download citation
DOI: https://doi.org/10.1007/3-540-56024-6_16
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-56024-1
Online ISBN: 978-3-540-47357-2
eBook Packages: Springer Book Archive