Visually Lossless HTML Compression

  • Przemysław Skibiński
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5802)

Abstract

The verbosity of the Hypertext Markup Language (HTML) remains one of its main weaknesses. This problem can be solved with the aid of HTML specialized compression algorithms. In this work, we describe a visually lossless HTML transform that, combined with generally used compression algorithms, allows to attain high compression ratios. Its core is a transform featuring substitution of words in an HTML document using a static English dictionary, effective encoding of dictionary indexes, numbers, and specific patterns.

Visually lossless compression means that the HTML document layout will be modified, but the document displayed in a browser will provide the exact fidelity with the original. The experimental results show that the proposed transform improves the HTML compression efficiency of general purpose compressors on average by 21% in the case of gzip, achieving comparable processing speed. Moreover, we show that the compression ratio of gzip can be improved by up to 32% for the price of higher memory requirements and much slower processing.

Keywords

HTML compression HTML transform semi-structural data compression 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adiego, J., de la Fuente, P.: Mapping Words into Codewords on PPM. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 181–192. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  2. 2.
    Adiego, J., de la Fuente, P., Navarro, G.: Using Structural Contexts to Compress Semistructured Text Collections. Information Processing and Management 43(3), 769–790 (2007)CrossRefGoogle Scholar
  3. 3.
    Burrows, M., Wheeler, D.J.: A block-sorting data compression algorithm. SRC Research Report 124. Digital Equipment Corporation, Palo Alto, CA, USA (1994)Google Scholar
  4. 4.
    Cheney, J.: Compressing XML with multiplexed hierarchical PPM models. In: Proceedings of the IEEE Data Compression Conference, Snowbird, UT, USA, pp. 163–172 (2001)Google Scholar
  5. 5.
    Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Trans. on Comm. 32(4), 396–402 (1984)CrossRefGoogle Scholar
  6. 6.
    Deutsch, P.: DEFLATE Compressed Data Format Specification version 1.3. RFC1951 (1996), http://www.ietf.org/rfc/rfc1951.txt
  7. 7.
    Huffman, D.A.: A Method for the Construction of Minimum-Redundancy Codes. In: Proc. IRE 40.9, September 1952, pp. 1098–1101 (1952)Google Scholar
  8. 8.
    Lánský, J., Žemlička, M.: Text Compression: Syllables. In: Proceedings of the Dateso 2005 Annual International Workshop on DAtabases, TExts, Specifications and Objects. CEUR-WS, vol. 129, pp. 32–45 (2005)Google Scholar
  9. 9.
    Mahoney, M.: About the Test Data (2006), http://cs.fit.edu/~mmahoney/compression/textdata.html
  10. 10.
    Mahoney, M.: Adaptive Weighing of Context Models for Lossless Data Compression. Technical Report TR-CS-2005-16, Florida Tech., USA (2005)Google Scholar
  11. 11.
    Nielsen, H.F.: HTTP Performance Overview (2003), http://www.w3.org/Protocols/HTTP/Performance/
  12. 12.
    Radhakrishnan, S.: Speed Web delivery with HTTP compression (2003), http://www-128.ibm.com/developerworks/web/library/wa-httpcomp/
  13. 13.
    Shkarin, D.: PPM: One Step to Practicality. In: Proceedings of the IEEE Data Compression Conference, Snowbird, UT, USA, pp. 202–211 (2002)Google Scholar
  14. 14.
    Skibiński, P.: Improving HTML Compression. To appear in Informatica (2009)Google Scholar
  15. 15.
    Skibiński, P., Grabowski, S.z.: Variable-length contexts for PPM. In: Proceedings of the IEEE Data Compression Conference, Snowbird, UT, USA, pp. 409–418 (2004)Google Scholar
  16. 16.
    Skibiński, P., Grabowski, S.z., Deorowicz, S.: Revisiting dictionary-based compression. Software – Practice and Experience 35(15), 1455–1476 (2005)CrossRefGoogle Scholar
  17. 17.
    Skibiński, P., Grabowski, S.z., Swacha, J.: Effective asymmetric XML compression. Software – Practice and Experience 38(10), 1027–1047 (2008)CrossRefGoogle Scholar
  18. 18.
    Sun, W., Zhang, N., Mukherjee, A.: Dictionary-based fast transform for text compression. In: Proceedings of international conference on Information Technology: Coding and Computing, ITCC, pp. 176–182 (2003)Google Scholar
  19. 19.
    Wan, R.: Browsing and Searching Compressed Documents. PhD dissertation, University of Melbourne (2003), http://www.bic.kyoto-u.ac.jp/proteome/rwan/docs/wan_phd_new.pdf
  20. 20.
    Ziv, J., Lempel, A.: A Universal Algorithm for Sequential Data Compression. IEEE Trans. Inform. Theory 23(3), 337–343 (1977)MATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Przemysław Skibiński
    • 1
  1. 1.Institute of Computer ScienceUniversity of WrocławWrocławPoland

Personalised recommendations