Abstract
XML provides flexibility in publishing and exchanging heterogeneous data on the Web. However, the language is by nature verbose and thus XML documents are usually larger in size than other specifications containing the same data content. It is natural to expect that the data size will continue to grow as XML data proliferates on the Web. The size problem of XML documents hinders the applications of XML, since it substantially increases the costs of storing, processing and exchanging the data. The hindrance is more apparent in bandwidth- and memory-limited settings such as those applications related to mobile communication.
In this paper, we survey a range of recently proposed XML specific compression technologies and study their efforts and capabilities to overcome the size problem. First, by categorizing XML compression technologies into queriable and unqueriable compressors, we explain the efforts in the representative technologies that aim at utilizing the exposed structure information from the input XML documents. Second, we discuss the importance of queriable XML compressors and assess whether the compressed XML documents generated from these technologies are able to support direct querying on XML data. Finally, we present a comparative analysis of the state-of-the-art XML conscious compression technologies in terms of compression ratio, compression and decompression times, memory consumption, and query performance.
Similar content being viewed by others
References
G. Antoshenkov, “Dictionary-based order-preserving string compression,” VLDB Journal 6, 1997, 26–33.
A. Arion, A. Bonifati, G. Costa, S. D'Aguanno, I. Manolescu, and A. Pugliese, “Efficient query evaluation over compressed XML data,” in Proceedings of EDBT, 2004.
A. Arion, A. Bonifati, G. Costa, S. D'Aguanno, I. Manolescu, and A. Pugliese, “XQueC: Pushing queries to compressed XML data,” in Proceedings of the 29th International Conference on Very Large Data Bases (VLDB'03), 2003.
T. Bell, J. Cleary, and I. Witten, Text Compression, Prentice Hall, Englewood Cliffs, New Jersey, 1990.
S. Boag et al., XQuery 1.0: An XML Query Language, Nov. 2002. http://www.w3.org/TR/xquery
J. Bosak, Shakespeare 2.00. http://www.cs.wisc.edu/niagara/data/shakes/shakspre.htm
P. Buneman, M. Grohe, and C. Koch, “Path queries on compressed XML,” in Proceedings of the 29th International Conference on Very Large Data Bases (VLDB'03), May 2003.
P. Buneman, S. Khannay, K. Tajimaz, and W. C. Tan, “Archiving scientific data,” in Proceedings of SIGMOD, 2002.
M. Burrows and D. J. Wheeler, “A block-sorting lossless data compression algorithm,” Technical Report, Digital Equipment Corporation, Palo Alto, California, 1994.
M. Cannataro, C. Comito, and A. Pugliese, “Squeeze X: Synthesis and compression of XML data,” in IEEE Proceedings of the International Conference on Information Technology: Coding and Computing, 2002.
M. Cannataro, C. Gianluca, A. Pugliese, and D. Sacca, “Semantic lossy compression of XML data,” in The 8th International Workshop on Knowledge Representation Meets Databases, 2001.
J. Cheney, “Compressing XML with multiplexed hierarchical PPM models,” in Proceedings of the IEEE Data Compression Conference, 2000, pp. 163–172.
J. Cheng and W. Ng, “XQzip: Querying compressed XML using structural indexing,” in Proceedings of EDBT, 2004.
J. Clark, XML Path Language (XPath), 1999. http://www.w3.org/TR/xpath
J. Cleary, W. Teahan, and I. Witten, “Unbounded length contexts for PPM,” in Proceeding of the IEEE Data Compression Conference, March 1995, pp. 52–61.
J. G. Clearly and I. H. Witten, “Data compression using contexts for PPM”, Computer Journal 40(2/3), 1997, 67–75
Document Object Model (DOM) Level 2 Specification Version 1.0, W3C Recommendation, November 2000. http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113
Extensible Markup Language (XML) 1.0 (2nd Edition) W3C Recommendation, October 2000. http://www.w3.org/TR/REC-xml/
J. Gailly and M. Adler, gzip 1.2.4. http://www.gzip.org/
M. Girardot and N. Sundaresan, “Efficient representation and streaming of XML content over the internet medium,” in IEEE International Conference on Multimedia and Expo (I), 2000, pp. 67–70.
M. Girardot and N. Sundaresan, “Millau: An encoding format for efficient representation and exchange of XML over the Web,” in Proceedings of the 9th International WWW Conference, 2000, pp. 747–765.
R. Goldman and J. Widom, “DataGuide: Enabling query formation and optimization in semistructure databases,” in Proceedings of the International Conference on Very Large Data Bases, Athens, Greece, August, 1997, pp. 436–445.
H. Hopcroft and J. Ullman, Introduction to Automata Theory, langauges, and Computation. Addison-Wesley, 1979.
D. A. Huffman, “A method for construction of minimum-redundancy codes,” in Proceeding of the IRE, 1952.
H. Ishikawa, S. Yokoyama, S. Isshiki, and M. Ohta, “Project Xanadu: XML- and active-database-unified approach to distributed e-commerce,” in Proceeding of the 12th International Workshop on Database and Expert Systems Applications, September 2001.
R. Kaushik, P. Bohannon, J. F. Naughton, and H. F. Korth, “Covering indexes for branching path queries,” in Proceedings of SIGMOD 2002.
W. Y. Lam, W. Ng, P. T. Wood, and M. Levene, “XCQ: XML Compression and querying system,” in Poster Proceedings, 12th International World-Wide Web Conference (WWW2003), May 2003.
M. Levene and P. T. Wood, ”XML structure compression,” in Proceedings of the Second International Workshop on Web Dynamics, May 2002.
H. Liefke and D. Suciu, “XMill: An efficient compressor for XML data,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000, pp. 153–164.
Log Files—Apache HTTP Server. http://httpd.apache.org/docs/logs.html
W. Ng and C. Ravishankar, “Block-Oriented Compression Techniques for Large Statistical Databases,” IEEE TKDE 9(2), 1997, 314–328
A. Marian and J. Simeon, “Projecting XML documents,” in Proceedings of VLDB 2003.
J. M. Martinez, MPEG-7 Overview (version 9). http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm
J. K. Min, M. J. Park, and C. W. Chung, “XPRESS: A queriable compression for XML data,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2003.
W. Ng, “Evaluating the client side approach and the server side approach to the WWW and DBMSs integration,” in Proceedings of the 9th International Database Workshop, Heterogeneous and Internet Databases IDW’99, 1999, pp. 72–82.
pkzip. http://www.pkware.com/
A. R. Schmidt, F. Waas, M. L. Kersten, M. J. Carey, I. Manolescu and R. Busse, “XMark: A benchmark for XML data management,” in Proceedings of VLDB, 2002.
T. M. Cover and J. A. Thomas, Elements of Information Theory, WILEY-INTERSCIENCE, John Wiley & Sons, Inc., New York, 1991.
A. Silberschatz and P. Galvin, Operating Systems Concepts. 5th Edition, Addison-Wesley, 1998.
N. Sundaresan and R. Moussa, “Algorithms and programming models for efficient representation of XML for internet applications,” in Proceedings of the 10th International WWW Conference, May 2001, pp. 366–375.
SWISS-PROT Protein Knowledgebase. http://www.expasy.ch/sprot/
The bzip2 and libbzip2 official home page. http://sources.redhat.com/bzip2/
Software AG: Tamino XML Databases. http://www.softwareag.com/tamino
P. M. Tolani and J. R. Haritsa, “XGRIND: A query-friendly XML compressor,” in IEEE Proceedings of the 18th International Conference on Data Engineering 2002.
TPC-H: An ad-hoc, decision support benchmark. http://www.tpc.org/tpch/default.asp
J. L. Tzeng, Transferring Data between XML Documents and PostgreSQL DBMS. http://www.cs.indiana.edu/jetzeng/jt-xmldb/
WAP Binary XML Content Format, W3C NOTE, June 1999. http://www.w3c.org/TR/wbxml/
M. A. Weiss. Data Structure and Algorithm Analysis in C++, 2nd Edition, Addison-Wesley, 1999.
T. Westmann, D. Kossmann, S. Helmer and G. Moerkotte, “The implementation and performance of compressed databases,” SIGMOD Record 29(3), 2000, 55–67
Winzip, http://www.winzip.com/
XMLZip—XML Solutions. http://www.xmls.com/
J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Transactions on Information Theory, IT-23(3), 1977, 337–343.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Ng, W., Lam, WY. & Cheng, J. Comparative Analysis of XML Compression Technologies. World Wide Web 9, 5–33 (2006). https://doi.org/10.1007/s11280-005-1435-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-005-1435-2