Word-Based Fixed and Flexible List Compression
We present a dictionary based lossless text compression scheme where we keep frequent words in separate lists (list_n contains words of length n). We pursued two alternatives in terms of the lengths of the lists. In the "fixed" approach all lists have equal number of words whereas in the "flexible" approach no such constraint is imposed. Results clearly show that the "flexible" scheme is much better in all test cases possibly due to the fact that it can accomodate short, medium or long word lists reflecting on the word length distributions of a particular language. Our approach encodes a word as a prefix (the length of the word) and the body of the word (as an index in the corresponding list). For prefix encoding we have employed both a static encoding and a dynamic encoding (Huffman) using the word length statistics of the source language. Dynamic prefix encoding clearly outperformed its static counterpart in all cases. A language with a higher average word length can, theoretically, benefit more from a word-list based compression approach as compared to one with a lower average word length. We have put this hypothesis to test using Turkish and English languages with average word lengths of 6.1 and 4.4, respectively. Our results strongly support the validity of this hypothesis.
Unable to display preview. Download preview PDF.
- 1.Witten, I., Moffat, A., Bell, T.C.: Managing Gigabytes – Compressing and Indexing Documents and Images, San Francisco, CA, USA (1999)Google Scholar
- 2.Nelson, M.: The Data Compression Book. NewYork, USA, ch. 3 (1996)Google Scholar
- 3.Diri, B.: A Text Compression System Based on the Morphology of Turkish Language. In: International Symposium on Computer and Information Sciences (ISCIS) XV, October 11-13. Yildiz Technical University, Istanbul (2000)Google Scholar
- 5.Teahan, W.J.: Modelling English Text. In: The Entropy of English Using PPM Based Models, ch. 8, p. 140 (1998)Google Scholar
- 6.Celikel, E., Dincer, B.T.: Improving the Compression Performance of Turkish Texts with PoS Tags. In: International Conference on Information and Knowledge Engineering (IKE 2004), Las Vegas, NV, USA, pp. 519–523 (2004)Google Scholar
- 7.Dalkılıç, M.E., Dalkılıç, G.: Some Measurable Language Characteristics of Printed Turkish. In: International Symposium on Computer and Information Sciences (ISCIS) XVI, Antalya, November 5-7 (2001)Google Scholar
- 8.Diri, B.: A System for Turkish Texts Based on the Analysis of Turkish Language Structure and Providing Dynamic Compression with Word-based Lossless Recovery (in Turkish) PhD thesis. Yildiz Technical University, Istanbul (1999)Google Scholar
- 9.Koltuksuz, A.H.: Cryptanalitic Measures of Turkish for Symmetrical Cryptosystems (in Turkish) PhD Thesis. Ege University Department of Computer Engineering, Izmir, Turkey (1995)Google Scholar