KeywordsData Item Query Processing Query Path Path Expression XPath Query
XML is an extremely verbose data format, with a high degree of redundant information, due to the same tags being repeated over and over for multiple data items, and due to both tags and data values being represented as strings. Viewed in relational database terms, XML stores the “schema” with each and every “record” in the repository. The size increase incurred by publishing data in XML format is estimated to be as much as 400 % , making it a prime target for compression. While standard general-purpose compressors, such as zip, gzip or bzip, typically compress XML data reasonably well, specialized XML compressors have been developed over the last decade that exploit the specific structural aspects of XML data. These new techniques fall into two classes: (i) Compression-oriented, where the goal is to maximize the compression ratio of the data, typically up to a factor of two better than the general-purpose compressors; and (ii) Query-oriented, where the goal is to integrate the compression strategy with an XPath query processor such that queries can be processed directly on the compressed data, selectively decompressing only the data relevant to the query result.
Subsequently, the focus shifted to the development of query-oriented techniques intended to support query processing directly on the compressed data. This stream of research began with Tolani and Haritsa  presenting in 2002 a system called XGrind, where compression is carried out at the granularity of individual element/attribute values using a simple context-free compression scheme – tags are encoded by integers while textual content is compressed using non-adaptive Huffman (or Arithmetic) coding. XGrind consciously maintains a homomorphic encoding, that is, the compressed document is still in XML format, the intention being that all existing XML-related tools (such as parsers, indexes, schema-checkers, etc.) could continue to be used on the compressed document.
In 2003, Min et al.  proposed a compressor called XPRESS that extended the homomorphic compression approach of XGrind to include effective evaluation of query path expressions and range queries on numerical attributes. Their scheme uses a reverse arithmetic path-encoding that encodes each path as an interval of real numbers between 0 and 1. The following year produced the XQueC system  from Arion et al., which supports cost-based tradeoffs between compact storage and efficient processing.
An excellent survey of the state-of-the-art in XML compressors is available in .
Separate structure from data: The structure consists of XML tags and attributes, organized as a tree. The data consists of a sequence of items (strings) representing element text contents and attribute values. The structure and the data are compressed separately.
Group data items with related meaning: Data items are logically or physically grouped into containers, and each container is compressed separately. By exploiting similarities between the values in a container, the compression improves substantially. Typically, data items are grouped based on the element type, but some systems (e.g., ) chose more elaborate grouping criteria.
Apply specialized compressors to containers: Some data items are plain-text, while others are numbers, dates, etc., and for each of these different domains, the system uses a specialized compressor.
Compression-Oriented XML Compressors
The architecture of the XMill compressor , which is typical of several XML compressors, is depicted in Fig. 1. The XML file is parsed by a SAX parser that sends tokens to the path processor. The purpose of the path processor is to separate the structure from the data, and to further separate the data items according to their semantics.
Next, the structure container and all data containers are compressed with gzip, then written to disk. Optionally, the data items in certain containers may be compressed with a user-defined semantic compressor. For example, numerical values or IP addresses can be binary encoded, dates can be represented using specialized data structures, etc.
By default, XMill groups data items based on their innermost element type. Users, however, can override this, by providing container expressions on the command line. The path processor uses these expressions to determine in which container to store each data item. Path expressions also determine which semantic compressor to apply (if any).
The amount of main memory holding all containers is fixed. When the limit is exhausted all containers are gzip-ed, written to disk, as one logical block, then the compression resumes. In effect, this partitions the input XML file into logical blocks that are compressed independently.
The decompressor, XDemill, is similar, but proceeds in reverse. It reads one block at a time in main memory, decompresses every container, then merges the XML tags with the data values to produce the XML output.
To illustrate the working of XMill, consider the following snippet of Web-server log data, where each entry represents one HTTP request:
After the document is parsed, the path processor separates it into the structure and the content, and further separates the content into different containers. The structure is obtained by removing all text values and attribute values and replacing them with their container number. Start-tags are dictionary-encoded, i.e., assigned an integer value, while all end-tags are replaced by the same, unique token. For illustration purposes, start-tags are denoted with T1, T2,…, the unique end-tag with/, and container numbers with C1, C2,…. In this example the structure of the entry element is:
Here T1 = apache:entry, T2 = apache:host, and so on, while/represents any end tag. Internally, each token is encoded as an integer: tags are positive, container numbers are negative, and \ is 0. Numbers between (−64, 63) take one byte, while numbers outside this range take two or four bytes; the example string is overall coded in 26 bytes. The structure is compressed using gzip, which is based on Ziv-Lempel’s LZ77 algorithm . This results in excellent compression, because LZ77 exploits very well the frequent repetitions in the structure container: the compressed structure usually amounts to only 1–3 % of the compressed file.
Next, data items are partitioned into containers, then compressed. Each container is associated with an XPath expression that defines which data items are stored in that container, and an optional semantic compressor for the items in that container. By default there is a container for each tag tag occurring in the XML file, the associated XPath expression is//tag, and there is no associated semantic compressor. Users may override this on the command line. For example:
creates two separate containers for address elements: one for those occurring under shipping, and one for those occurring under billing. If there exist address elements occurring under elements other than shipping or billing, then their content is stored in a third container. Note that the container expressions only need to be specified to the compressor, not to the decompressor. Overriding the default grouping of data items usually results in only modest improvements in the compression ratio. Much better improvements are achieved, however, with semantic compressors.
Query-Oriented XML Compressors
XGrind. The technique described in  is intended to simultaneously provide efficient query-processing performance and reasonable compression ratios. Basic requirements to achieve the former objective are (i) fine-grained compression at the element/attribute granularity of query predicates, and (ii) context-free compression assigning codes to data items independent of their location in the document. Algorithms such as LZ77 are not context-free, and therefore XGrind uses non-adaptive Huffman (or Arithmetic) coding, in which two passes are made over the XML document – the first to collect the statistics and the second to do the actual encoding. A separate character-frequency distribution table is used for each element and non-enumerated attribute, resulting in fine-grained characterization. The DTD is used to identify enumerated-type attributes and their values are encoded using a simple binary encoding scheme, while the compression of XML tags is similar to that of XMill. With this scheme, exact-match and prefix-match queries can be completely carried out directly on the compressed document, while range or partial-match queries only require on-the-fly decompression of the element/attribute values that feature in the query predicates.
A distinguishing feature of the XGrind compressor is that it ensures homomorphic compression – that is, its output, like its input, is semi-structured in nature. In fact, the compressed XML document can be viewed as the original XML document with its tags and element/attribute values replaced by their corresponding encodings. The advantage of doing so is that the variety of efficient techniques available for parsing/querying XML documents can also be used to process the compressed document. Second, indexes can be built on the compressed document similar to those built on regular XML documents. Third, updates to the XML document can be directly executed on the compressed version. Finally, a compressed document can be directly checked for validity against the compressed version of its DTD.
As a specific example of the utility of homomorphic compression, consider repositories of genomic data (e.g., ), which allow registered users to upload new genetic information to their archives. With homomorphic compression, such information could be compressed by the user, then uploaded, checked for validity, and integrated with the existing archives, all operations taking place completely in the compressed domain.
XPRESS interval scheme
For each node in the tree, the associated interval is incrementally computed from the parent node. The intervals generated by reverse arithmetic encoding guarantee that “If a path P is represented by the interval I, then all intervals for suffixes of P contain I.” Therefore, the interval for subsection which is (0.6, 0.9) contains the interval (0.69, 0.699) for the path book.section.subsection. Another feature of XPRESS is that it infers the data types of elements and for those that turn out to be numbers over large domains, the values are compressed by first converting them to binary and then using differential encoding, instead of the default string encoding. Recently, XPRESS has been extended in  to handle updates such as insertions or deletions of XML fragments.
Finally, an index-based compression approach that improves on both the compression ratio and the querry processing speed has been recently proposed in .
Data archiving, data exchange, query processing.
- 2.Cheney J. Compressing XML with multiplexed hierarchical PPM models. In: Proceedings Data Compression Conference; 2001. p. 163–72Google Scholar
- 3.Ferragina P, Luccio F, Manzini G, Muthukrishnan M. Compressing and searching XML data via two zips. In: Proceeding 15th International World Wide Web Conference; 2006. p. 751–60.Google Scholar
- 4.Girardot M, Sundaresan N. Millau: an encoding format for efficient representation and exchange of XML over the Web. In: Proceedings 9th International World Wide Web Conference; 2000.Google Scholar
- 5.Liefke H, Suciu D. An extensible compressor for XML data. ACM SIGMOD Rec. 2000;29(1):57–62.Google Scholar
- 6.Liefke H, Suciu D. XMill: an efficent compressor for XML data. In: Proceedings ACM SIGMOD International Conference on Management of Data; 2000. p. 153–64.Google Scholar
- 7.Min JK, Park M, Chung C. XPRESS: a queriable compression for XML data. In: Proceedings ACM SIGMOD International Conference on Management of Data; 2003. p. 122–33.Google Scholar
- 9.Tolani P, Haritsa J. XGRIND: a query-friendly XML compressor. In: Proceedings 18th International Conference on Data Engineering; 2002. p. 225–35.Google Scholar