Advertisement

World Wide Web

, Volume 9, Issue 1, pp 5–33 | Cite as

Comparative Analysis of XML Compression Technologies

  • Wilfred Ng
  • Wai-Yeung Lam
  • James Cheng
Article

Abstract

XML provides flexibility in publishing and exchanging heterogeneous data on the Web. However, the language is by nature verbose and thus XML documents are usually larger in size than other specifications containing the same data content. It is natural to expect that the data size will continue to grow as XML data proliferates on the Web. The size problem of XML documents hinders the applications of XML, since it substantially increases the costs of storing, processing and exchanging the data. The hindrance is more apparent in bandwidth- and memory-limited settings such as those applications related to mobile communication.

In this paper, we survey a range of recently proposed XML specific compression technologies and study their efforts and capabilities to overcome the size problem. First, by categorizing XML compression technologies into queriable and unqueriable compressors, we explain the efforts in the representative technologies that aim at utilizing the exposed structure information from the input XML documents. Second, we discuss the importance of queriable XML compressors and assess whether the compressed XML documents generated from these technologies are able to support direct querying on XML data. Finally, we present a comparative analysis of the state-of-the-art XML conscious compression technologies in terms of compression ratio, compression and decompression times, memory consumption, and query performance.

Keywords

XML compression query languages query processing Web applications 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    G. Antoshenkov, “Dictionary-based order-preserving string compression,” VLDB Journal 6, 1997, 26–33.Google Scholar
  2. [2]
    A. Arion, A. Bonifati, G. Costa, S. D'Aguanno, I. Manolescu, and A. Pugliese, “Efficient query evaluation over compressed XML data,” in Proceedings of EDBT, 2004.Google Scholar
  3. [3]
    A. Arion, A. Bonifati, G. Costa, S. D'Aguanno, I. Manolescu, and A. Pugliese, “XQueC: Pushing queries to compressed XML data,” in Proceedings of the 29th International Conference on Very Large Data Bases (VLDB'03), 2003.Google Scholar
  4. [4]
    T. Bell, J. Cleary, and I. Witten, Text Compression, Prentice Hall, Englewood Cliffs, New Jersey, 1990.Google Scholar
  5. [5]
    S. Boag et al., XQuery 1.0: An XML Query Language, Nov. 2002. http://www.w3.org/TR/xquery
  6. [6]
  7. [7]
    P. Buneman, M. Grohe, and C. Koch, “Path queries on compressed XML,” in Proceedings of the 29th International Conference on Very Large Data Bases (VLDB'03), May 2003.Google Scholar
  8. [8]
    P. Buneman, S. Khannay, K. Tajimaz, and W. C. Tan, “Archiving scientific data,” in Proceedings of SIGMOD, 2002.Google Scholar
  9. [9]
    M. Burrows and D. J. Wheeler, “A block-sorting lossless data compression algorithm,” Technical Report, Digital Equipment Corporation, Palo Alto, California, 1994.Google Scholar
  10. [10]
    M. Cannataro, C. Comito, and A. Pugliese, “Squeeze X: Synthesis and compression of XML data,” in IEEE Proceedings of the International Conference on Information Technology: Coding and Computing, 2002.Google Scholar
  11. [11]
    M. Cannataro, C. Gianluca, A. Pugliese, and D. Sacca, “Semantic lossy compression of XML data,” in The 8th International Workshop on Knowledge Representation Meets Databases, 2001.Google Scholar
  12. [12]
    J. Cheney, “Compressing XML with multiplexed hierarchical PPM models,” in Proceedings of the IEEE Data Compression Conference, 2000, pp. 163–172.Google Scholar
  13. [13]
    J. Cheng and W. Ng, “XQzip: Querying compressed XML using structural indexing,” in Proceedings of EDBT, 2004.Google Scholar
  14. [14]
    J. Clark, XML Path Language (XPath), 1999. http://www.w3.org/TR/xpath
  15. [15]
    J. Cleary, W. Teahan, and I. Witten, “Unbounded length contexts for PPM,” in Proceeding of the IEEE Data Compression Conference, March 1995, pp. 52–61.Google Scholar
  16. [16]
    J. G. Clearly and I. H. Witten, “Data compression using contexts for PPM”, Computer Journal 40(2/3), 1997, 67–75Google Scholar
  17. [17]
  18. [18]
    Document Object Model (DOM) Level 2 Specification Version 1.0, W3C Recommendation, November 2000. http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113
  19. [19]
    Extensible Markup Language (XML) 1.0 (2nd Edition) W3C Recommendation, October 2000. http://www.w3.org/TR/REC-xml/
  20. [20]
    J. Gailly and M. Adler, gzip 1.2.4. http://www.gzip.org/
  21. [21]
    M. Girardot and N. Sundaresan, “Efficient representation and streaming of XML content over the internet medium,” in IEEE International Conference on Multimedia and Expo (I), 2000, pp. 67–70.Google Scholar
  22. [22]
    M. Girardot and N. Sundaresan, “Millau: An encoding format for efficient representation and exchange of XML over the Web,” in Proceedings of the 9th International WWW Conference, 2000, pp. 747–765.Google Scholar
  23. [23]
    R. Goldman and J. Widom, “DataGuide: Enabling query formation and optimization in semistructure databases,” in Proceedings of the International Conference on Very Large Data Bases, Athens, Greece, August, 1997, pp. 436–445.Google Scholar
  24. [24]
    H. Hopcroft and J. Ullman, Introduction to Automata Theory, langauges, and Computation. Addison-Wesley, 1979.Google Scholar
  25. [25]
    D. A. Huffman, “A method for construction of minimum-redundancy codes,” in Proceeding of the IRE, 1952.Google Scholar
  26. [26]
    H. Ishikawa, S. Yokoyama, S. Isshiki, and M. Ohta, “Project Xanadu: XML- and active-database-unified approach to distributed e-commerce,” in Proceeding of the 12th International Workshop on Database and Expert Systems Applications, September 2001.Google Scholar
  27. [27]
    R. Kaushik, P. Bohannon, J. F. Naughton, and H. F. Korth, “Covering indexes for branching path queries,” in Proceedings of SIGMOD 2002.Google Scholar
  28. [28]
    W. Y. Lam, W. Ng, P. T. Wood, and M. Levene, “XCQ: XML Compression and querying system,” in Poster Proceedings, 12th International World-Wide Web Conference (WWW2003), May 2003.Google Scholar
  29. [29]
    M. Levene and P. T. Wood, ”XML structure compression,” in Proceedings of the Second International Workshop on Web Dynamics, May 2002.Google Scholar
  30. [30]
    H. Liefke and D. Suciu, “XMill: An efficient compressor for XML data,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000, pp. 153–164.Google Scholar
  31. [31]
    Log Files—Apache HTTP Server. http://httpd.apache.org/docs/logs.html
  32. [32]
    W. Ng and C. Ravishankar, “Block-Oriented Compression Techniques for Large Statistical Databases,” IEEE TKDE 9(2), 1997, 314–328Google Scholar
  33. [33]
    A. Marian and J. Simeon, “Projecting XML documents,” in Proceedings of VLDB 2003.Google Scholar
  34. [34]
    J. M. Martinez, MPEG-7 Overview (version 9). http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm
  35. [35]
    J. K. Min, M. J. Park, and C. W. Chung, “XPRESS: A queriable compression for XML data,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2003.Google Scholar
  36. [36]
    W. Ng, “Evaluating the client side approach and the server side approach to the WWW and DBMSs integration,” in Proceedings of the 9th International Database Workshop, Heterogeneous and Internet Databases IDW’99, 1999, pp. 72–82.Google Scholar
  37. [37]
  38. [38]
  39. [39]
    A. R. Schmidt, F. Waas, M. L. Kersten, M. J. Carey, I. Manolescu and R. Busse, “XMark: A benchmark for XML data management,” in Proceedings of VLDB, 2002.Google Scholar
  40. [40]
    T. M. Cover and J. A. Thomas, Elements of Information Theory, WILEY-INTERSCIENCE, John Wiley & Sons, Inc., New York, 1991.Google Scholar
  41. [41]
    A. Silberschatz and P. Galvin, Operating Systems Concepts. 5th Edition, Addison-Wesley, 1998.Google Scholar
  42. [42]
    N. Sundaresan and R. Moussa, “Algorithms and programming models for efficient representation of XML for internet applications,” in Proceedings of the 10th International WWW Conference, May 2001, pp. 366–375.Google Scholar
  43. [43]
    SWISS-PROT Protein Knowledgebase. http://www.expasy.ch/sprot/
  44. [44]
    The bzip2 and libbzip2 official home page. http://sources.redhat.com/bzip2/
  45. [45]
    Software AG: Tamino XML Databases. http://www.softwareag.com/tamino
  46. [46]
    P. M. Tolani and J. R. Haritsa, “XGRIND: A query-friendly XML compressor,” in IEEE Proceedings of the 18th International Conference on Data Engineering 2002.Google Scholar
  47. [47]
    TPC-H: An ad-hoc, decision support benchmark. http://www.tpc.org/tpch/default.asp
  48. [48]
    J. L. Tzeng, Transferring Data between XML Documents and PostgreSQL DBMS. http://www.cs.indiana.edu/jetzeng/jt-xmldb/
  49. [49]
    WAP Binary XML Content Format, W3C NOTE, June 1999. http://www.w3c.org/TR/wbxml/
  50. [50]
    M. A. Weiss. Data Structure and Algorithm Analysis in C++, 2nd Edition, Addison-Wesley, 1999.Google Scholar
  51. [51]
    T. Westmann, D. Kossmann, S. Helmer and G. Moerkotte, “The implementation and performance of compressed databases,” SIGMOD Record 29(3), 2000, 55–67Google Scholar
  52. [52]
  53. [53]
    XMLZip—XML Solutions. http://www.xmls.com/
  54. [54]
    J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Transactions on Information Theory, IT-23(3), 1977, 337–343.MathSciNetGoogle Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  • Wilfred Ng
    • 1
  • Wai-Yeung Lam
    • 1
  • James Cheng
    • 1
  1. 1.Department of Computer ScienceThe Hong Kong University of Science and Technology

Personalised recommendations