Skip to main content
Log in

Comparative Analysis of XML Compression Technologies

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

XML provides flexibility in publishing and exchanging heterogeneous data on the Web. However, the language is by nature verbose and thus XML documents are usually larger in size than other specifications containing the same data content. It is natural to expect that the data size will continue to grow as XML data proliferates on the Web. The size problem of XML documents hinders the applications of XML, since it substantially increases the costs of storing, processing and exchanging the data. The hindrance is more apparent in bandwidth- and memory-limited settings such as those applications related to mobile communication.

In this paper, we survey a range of recently proposed XML specific compression technologies and study their efforts and capabilities to overcome the size problem. First, by categorizing XML compression technologies into queriable and unqueriable compressors, we explain the efforts in the representative technologies that aim at utilizing the exposed structure information from the input XML documents. Second, we discuss the importance of queriable XML compressors and assess whether the compressed XML documents generated from these technologies are able to support direct querying on XML data. Finally, we present a comparative analysis of the state-of-the-art XML conscious compression technologies in terms of compression ratio, compression and decompression times, memory consumption, and query performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. G. Antoshenkov, “Dictionary-based order-preserving string compression,” VLDB Journal 6, 1997, 26–33.

    Google Scholar 

  2. A. Arion, A. Bonifati, G. Costa, S. D'Aguanno, I. Manolescu, and A. Pugliese, “Efficient query evaluation over compressed XML data,” in Proceedings of EDBT, 2004.

  3. A. Arion, A. Bonifati, G. Costa, S. D'Aguanno, I. Manolescu, and A. Pugliese, “XQueC: Pushing queries to compressed XML data,” in Proceedings of the 29th International Conference on Very Large Data Bases (VLDB'03), 2003.

  4. T. Bell, J. Cleary, and I. Witten, Text Compression, Prentice Hall, Englewood Cliffs, New Jersey, 1990.

  5. S. Boag et al., XQuery 1.0: An XML Query Language, Nov. 2002. http://www.w3.org/TR/xquery

  6. J. Bosak, Shakespeare 2.00. http://www.cs.wisc.edu/niagara/data/shakes/shakspre.htm

  7. P. Buneman, M. Grohe, and C. Koch, “Path queries on compressed XML,” in Proceedings of the 29th International Conference on Very Large Data Bases (VLDB'03), May 2003.

  8. P. Buneman, S. Khannay, K. Tajimaz, and W. C. Tan, “Archiving scientific data,” in Proceedings of SIGMOD, 2002.

  9. M. Burrows and D. J. Wheeler, “A block-sorting lossless data compression algorithm,” Technical Report, Digital Equipment Corporation, Palo Alto, California, 1994.

  10. M. Cannataro, C. Comito, and A. Pugliese, “Squeeze X: Synthesis and compression of XML data,” in IEEE Proceedings of the International Conference on Information Technology: Coding and Computing, 2002.

  11. M. Cannataro, C. Gianluca, A. Pugliese, and D. Sacca, “Semantic lossy compression of XML data,” in The 8th International Workshop on Knowledge Representation Meets Databases, 2001.

  12. J. Cheney, “Compressing XML with multiplexed hierarchical PPM models,” in Proceedings of the IEEE Data Compression Conference, 2000, pp. 163–172.

  13. J. Cheng and W. Ng, “XQzip: Querying compressed XML using structural indexing,” in Proceedings of EDBT, 2004.

  14. J. Clark, XML Path Language (XPath), 1999. http://www.w3.org/TR/xpath

  15. J. Cleary, W. Teahan, and I. Witten, “Unbounded length contexts for PPM,” in Proceeding of the IEEE Data Compression Conference, March 1995, pp. 52–61.

  16. J. G. Clearly and I. H. Witten, “Data compression using contexts for PPM”, Computer Journal 40(2/3), 1997, 67–75

    Google Scholar 

  17. DBLP. http://dblp.uni-trier.de/

  18. Document Object Model (DOM) Level 2 Specification Version 1.0, W3C Recommendation, November 2000. http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113

  19. Extensible Markup Language (XML) 1.0 (2nd Edition) W3C Recommendation, October 2000. http://www.w3.org/TR/REC-xml/

  20. J. Gailly and M. Adler, gzip 1.2.4. http://www.gzip.org/

  21. M. Girardot and N. Sundaresan, “Efficient representation and streaming of XML content over the internet medium,” in IEEE International Conference on Multimedia and Expo (I), 2000, pp. 67–70.

  22. M. Girardot and N. Sundaresan, “Millau: An encoding format for efficient representation and exchange of XML over the Web,” in Proceedings of the 9th International WWW Conference, 2000, pp. 747–765.

  23. R. Goldman and J. Widom, “DataGuide: Enabling query formation and optimization in semistructure databases,” in Proceedings of the International Conference on Very Large Data Bases, Athens, Greece, August, 1997, pp. 436–445.

  24. H. Hopcroft and J. Ullman, Introduction to Automata Theory, langauges, and Computation. Addison-Wesley, 1979.

  25. D. A. Huffman, “A method for construction of minimum-redundancy codes,” in Proceeding of the IRE, 1952.

  26. H. Ishikawa, S. Yokoyama, S. Isshiki, and M. Ohta, “Project Xanadu: XML- and active-database-unified approach to distributed e-commerce,” in Proceeding of the 12th International Workshop on Database and Expert Systems Applications, September 2001.

  27. R. Kaushik, P. Bohannon, J. F. Naughton, and H. F. Korth, “Covering indexes for branching path queries,” in Proceedings of SIGMOD 2002.

  28. W. Y. Lam, W. Ng, P. T. Wood, and M. Levene, “XCQ: XML Compression and querying system,” in Poster Proceedings, 12th International World-Wide Web Conference (WWW2003), May 2003.

  29. M. Levene and P. T. Wood, ”XML structure compression,” in Proceedings of the Second International Workshop on Web Dynamics, May 2002.

  30. H. Liefke and D. Suciu, “XMill: An efficient compressor for XML data,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000, pp. 153–164.

  31. Log Files—Apache HTTP Server. http://httpd.apache.org/docs/logs.html

  32. W. Ng and C. Ravishankar, “Block-Oriented Compression Techniques for Large Statistical Databases,” IEEE TKDE 9(2), 1997, 314–328

    Google Scholar 

  33. A. Marian and J. Simeon, “Projecting XML documents,” in Proceedings of VLDB 2003.

  34. J. M. Martinez, MPEG-7 Overview (version 9). http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm

  35. J. K. Min, M. J. Park, and C. W. Chung, “XPRESS: A queriable compression for XML data,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2003.

  36. W. Ng, “Evaluating the client side approach and the server side approach to the WWW and DBMSs integration,” in Proceedings of the 9th International Database Workshop, Heterogeneous and Internet Databases IDW’99, 1999, pp. 72–82.

  37. pkzip. http://www.pkware.com/

  38. SAX. http://www.saxproject.org/

  39. A. R. Schmidt, F. Waas, M. L. Kersten, M. J. Carey, I. Manolescu and R. Busse, “XMark: A benchmark for XML data management,” in Proceedings of VLDB, 2002.

  40. T. M. Cover and J. A. Thomas, Elements of Information Theory, WILEY-INTERSCIENCE, John Wiley & Sons, Inc., New York, 1991.

    Google Scholar 

  41. A. Silberschatz and P. Galvin, Operating Systems Concepts. 5th Edition, Addison-Wesley, 1998.

  42. N. Sundaresan and R. Moussa, “Algorithms and programming models for efficient representation of XML for internet applications,” in Proceedings of the 10th International WWW Conference, May 2001, pp. 366–375.

  43. SWISS-PROT Protein Knowledgebase. http://www.expasy.ch/sprot/

  44. The bzip2 and libbzip2 official home page. http://sources.redhat.com/bzip2/

  45. Software AG: Tamino XML Databases. http://www.softwareag.com/tamino

  46. P. M. Tolani and J. R. Haritsa, “XGRIND: A query-friendly XML compressor,” in IEEE Proceedings of the 18th International Conference on Data Engineering 2002.

  47. TPC-H: An ad-hoc, decision support benchmark. http://www.tpc.org/tpch/default.asp

  48. J. L. Tzeng, Transferring Data between XML Documents and PostgreSQL DBMS. http://www.cs.indiana.edu/jetzeng/jt-xmldb/

  49. WAP Binary XML Content Format, W3C NOTE, June 1999. http://www.w3c.org/TR/wbxml/

  50. M. A. Weiss. Data Structure and Algorithm Analysis in C++, 2nd Edition, Addison-Wesley, 1999.

  51. T. Westmann, D. Kossmann, S. Helmer and G. Moerkotte, “The implementation and performance of compressed databases,” SIGMOD Record 29(3), 2000, 55–67

    Google Scholar 

  52. Winzip, http://www.winzip.com/

  53. XMLZip—XML Solutions. http://www.xmls.com/

  54. J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Transactions on Information Theory, IT-23(3), 1977, 337–343.

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ng, W., Lam, WY. & Cheng, J. Comparative Analysis of XML Compression Technologies. World Wide Web 9, 5–33 (2006). https://doi.org/10.1007/s11280-005-1435-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-005-1435-2

Keywords

Navigation