Advertisement

Mapping Words into Codewords on PPM

  • Joaquín Adiego
  • Pablo de la Fuente
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4209)

Abstract

We describe a simple and efficient scheme which allows words to be managed in PPM modelling when a natural language text file is being compressed. The main idea for managing words is to assign them codes to make them easier to manipulate. A general technique is used to obtain this objective: a dictionary mapping on PPM modelling. In order to test our idea, we are implementing three prototypes: one implements the basic dictionary mapping on PPM, another implements the dictionary mapping with the separate alphabets model and the last one implements the dictionary with the spaceless words model. This technique can be applied directly or it can be combined with some word compression model. The results for files of 1 Mb. and over are better than those achieved by the character PPM which was taken as a base. The comparison between different prototypes shows that the best option is to use a word based PPM in conjunction with the spaceless word concept.

Keywords

Text Compression PPM Dictionary Algorithms Natural Language Processing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adiego, J., de la Fuente, P., Navarro, G.: Merging prediction by partial matching with structural contexts model. In: Proceedings of 14th Data Compression Conference (DCC 2004), p. 522 (2004)Google Scholar
  2. 2.
    Arnold, R., Bell, T.C.: A corpus for the evaluation of lossless compression algorithms. In: Proceedings of 7th Data Compression Conference (DCC 1997), pp. 201–210 (1997)Google Scholar
  3. 3.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, Addison-Wesley-Longman (May 1999)Google Scholar
  4. 4.
    Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice Hall, Englewood Cliffs (1990)Google Scholar
  5. 5.
    Bell, T.C., Moffat, A., Nevill-Manning, C., Witten, I.H., Zobel, J.: Data compression in full-text retrieval systems. Journal of the American Society for Information Science 44, 508–531 (1993)CrossRefGoogle Scholar
  6. 6.
    Bentley, J., Sleator, D., Tarjan, R., Wei, V.: A locally adaptive data compression scheme. Communications of the ACM 29, 320–330 (1986)zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)Google Scholar
  8. 8.
    Cheney, J.: Compressing XML with multiplexed hierarchical PPM models. In: Proceedings of 11th Data Compression Conference (DCC 2001), pp. 163–172 (2001)Google Scholar
  9. 9.
    Clearly, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications COM-32(4), 396–402 (1984)CrossRefGoogle Scholar
  10. 10.
    Dvorský, J., Pokorný, J., Snášel, V.: Word-based compression methods and indexing for text retrieval systems. In: Eder, J., Rozman, I., Welzer, T. (eds.) ADBIS 1999. LNCS, vol. 1691, pp. 75–84. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  11. 11.
    Harman, D.: Overview of the Third Text REtrieval Conference (NIST Special Publication 500-207). In: Proc. Third Text REtrieval Conference (TREC-3), pp. 1–19. NIST Special Publication 207-500 (1995)Google Scholar
  12. 12.
    Heaps, H.S.: Information Retrieval - Computational and Theoretical Aspects. Academic Press, London (1978)zbMATHGoogle Scholar
  13. 13.
    Horspool, R.N., Cormack, G.V.: Constructing word-based text compression algorithms. In: Proceedings of 2nd Data Compression Conference (DCC 1992), pp. 62–71 (1992)Google Scholar
  14. 14.
    Liefke, H., Suciu, D.: XMill: an efficient compressor for XML data. In: Proc. ACM SIGMOD 2000, pp. 153–164 (2000)Google Scholar
  15. 15.
    Moffat, A.: Word-based text compression. Software - Practice and Experience 19(2), 185–198 (1989)CrossRefGoogle Scholar
  16. 16.
    Moffat, A., Isal, R.Y.K.: Word-based text compression using the Burrows–Wheeler transform. Information Processing & Management 41(5), 1175–1192 (2005)zbMATHCrossRefGoogle Scholar
  17. 17.
    Moura, E., Navarro, G., Ziviani, N.: Indexing compressed text. In: Proceedings of the Fourth South American Workshop on String Processing, pp. 95–111 (1997)Google Scholar
  18. 18.
    Shkarin, D.: PPM: One step to practicality. In: Proceedings of 12th Data Compression Conference (DCC 2002), pp. 202–211 (2002)Google Scholar
  19. 19.
    Skibinski, P., Grabowski, S., Deorowicz, S.: Revisiting dictionary-based compression. Software–Practice and Experience 35(15), 1455–1476 (2005)CrossRefGoogle Scholar
  20. 20.
    Zipf, G.: Human Behaviour and the Principle of Least Effort. Addison–Wesley (1949)Google Scholar
  21. 21.
    Ziviani, N., Moura, E., Navarro, G., Baeza-Yates, R.: Compression: A key for next-generation text retrieval systems. IEEE Computer 33(11), 37–44 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Joaquín Adiego
    • 1
  • Pablo de la Fuente
    • 1
  1. 1.Depto. de InformáticaUniversidad de ValladolidValladolidSpain

Personalised recommendations