Encyclopedia of Database Systems

Living Edition
| Editors: Ling Liu, M. Tamer Özsu

XML Compression

  • Dan SuciuEmail author
  • Jayant R. Haritsa
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4899-7993-3_783-2

Keywords

Data Item Query Processing Query Path Path Expression XPath Query 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Definition

XML is an extremely verbose data format, with a high degree of redundant information, due to the same tags being repeated over and over for multiple data items, and due to both tags and data values being represented as strings. Viewed in relational database terms, XML stores the “schema” with each and every “record” in the repository. The size increase incurred by publishing data in XML format is estimated to be as much as 400 % [14], making it a prime target for compression. While standard general-purpose compressors, such as zip, gzip or bzip, typically compress XML data reasonably well, specialized XML compressors have been developed over the last decade that exploit the specific structural aspects of XML data. These new techniques fall into two classes: (i) Compression-oriented, where the goal is to maximize the compression ratio of the data, typically up to a factor of two better than the general-purpose compressors; and (ii) Query-oriented, where the goal is to integrate the compression strategy with an XPath query processor such that queries can be processed directly on the compressed data, selectively decompressing only the data relevant to the query result.

Historical Background

Research into XML compression was initiated with Liefke and Suciu’s development in 2000 of a compressor called XMill [6]. It is based on three principles: separating the structure from the content of the XML document, bucketing the content based on their tags, and compressing the individual buckets separately. XMill is a compression-oriented technique, focusing solely on achieving high compression ratios and fast compression/decompression, ignoring the query processing aspects. Other compression-oriented schemes that appeared around the same time include Millau [4], designed for efficient encoding and streaming of XML structures; and XMLPPM, which implements an extended SAX parser for online processing of documents [2]. There are also several commercial offerings that have been featured on the Internet (e.g., [11, 13, 15]).
Fig. 1

Architecture of the XMill compressor [4]

Subsequently, the focus shifted to the development of query-oriented techniques intended to support query processing directly on the compressed data. This stream of research began with Tolani and Haritsa [9] presenting in 2002 a system called XGrind, where compression is carried out at the granularity of individual element/attribute values using a simple context-free compression scheme – tags are encoded by integers while textual content is compressed using non-adaptive Huffman (or Arithmetic) coding. XGrind consciously maintains a homomorphic encoding, that is, the compressed document is still in XML format, the intention being that all existing XML-related tools (such as parsers, indexes, schema-checkers, etc.) could continue to be used on the compressed document.

In 2003, Min et al. [7] proposed a compressor called XPRESS that extended the homomorphic compression approach of XGrind to include effective evaluation of query path expressions and range queries on numerical attributes. Their scheme uses a reverse arithmetic path-encoding that encodes each path as an interval of real numbers between 0 and 1. The following year produced the XQueC system [1] from Arion et al., which supports cost-based tradeoffs between compact storage and efficient processing.

An excellent survey of the state-of-the-art in XML compressors is available in [1].

Foundations

The basic principles for compressing XML documents are the following:
  • Separate structure from data: The structure consists of XML tags and attributes, organized as a tree. The data consists of a sequence of items (strings) representing element text contents and attribute values. The structure and the data are compressed separately.

  • Group data items with related meaning: Data items are logically or physically grouped into containers, and each container is compressed separately. By exploiting similarities between the values in a container, the compression improves substantially. Typically, data items are grouped based on the element type, but some systems (e.g., [1]) chose more elaborate grouping criteria.

  • Apply specialized compressors to containers: Some data items are plain-text, while others are numbers, dates, etc., and for each of these different domains, the system uses a specialized compressor.

Compression-Oriented XML Compressors

The architecture of the XMill compressor [6], which is typical of several XML compressors, is depicted in Fig. 1. The XML file is parsed by a SAX parser that sends tokens to the path processor. The purpose of the path processor is to separate the structure from the data, and to further separate the data items according to their semantics.

Next, the structure container and all data containers are compressed with gzip, then written to disk. Optionally, the data items in certain containers may be compressed with a user-defined semantic compressor. For example, numerical values or IP addresses can be binary encoded, dates can be represented using specialized data structures, etc.

By default, XMill groups data items based on their innermost element type. Users, however, can override this, by providing container expressions on the command line. The path processor uses these expressions to determine in which container to store each data item. Path expressions also determine which semantic compressor to apply (if any).

The amount of main memory holding all containers is fixed. When the limit is exhausted all containers are gzip-ed, written to disk, as one logical block, then the compression resumes. In effect, this partitions the input XML file into logical blocks that are compressed independently.

The decompressor, XDemill, is similar, but proceeds in reverse. It reads one block at a time in main memory, decompresses every container, then merges the XML tags with the data values to produce the XML output.

Example

To illustrate the working of XMill, consider the following snippet of Web-server log data, where each entry represents one HTTP request:

<apache:entry>
<apache:host> 202.239.238.16 </host>
<apache:requestLine>GET / HTTP/1.0 </apache:requestLine>
<apache:contentType> text/html </apache:contentType>
<apache:statusCode> 200 </apache:statusCode>
<apache:date> 1997/10/01-00:00:02 </apache:date>
<apache:byteCount> 4478 </apache:byteCount>
<apache:referer> http://www.so-net.jp/ </apache:referer>
<apache:userAgent> Mozilla/3.0 [ja] </apache:userAgent>
</apache:entry>

After the document is parsed, the path processor separates it into the structure and the content, and further separates the content into different containers. The structure is obtained by removing all text values and attribute values and replacing them with their container number. Start-tags are dictionary-encoded, i.e., assigned an integer value, while all end-tags are replaced by the same, unique token. For illustration purposes, start-tags are denoted with T1, T2,…, the unique end-tag with/, and container numbers with C1, C2,…. In this example the structure of the entry element is:

T1 T2 C1 / T3 C2 / T4 C3 / T5 C4 / T8 C7 / T8 C7 / T11 C10 / T12 C11 / /

Here T1 = apache:entry, T2 = apache:host, and so on, while/represents any end tag. Internally, each token is encoded as an integer: tags are positive, container numbers are negative, and \ is 0. Numbers between (−64, 63) take one byte, while numbers outside this range take two or four bytes; the example string is overall coded in 26 bytes. The structure is compressed using gzip, which is based on Ziv-Lempel’s LZ77 algorithm [10]. This results in excellent compression, because LZ77 exploits very well the frequent repetitions in the structure container: the compressed structure usually amounts to only 1–3 % of the compressed file.

Next, data items are partitioned into containers, then compressed. Each container is associated with an XPath expression that defines which data items are stored in that container, and an optional semantic compressor for the items in that container. By default there is a container for each tag tag occurring in the XML file, the associated XPath expression is//tag, and there is no associated semantic compressor. Users may override this on the command line. For example:

xmill -p //shipping/address -p //billing/address file.xml

creates two separate containers for address elements: one for those occurring under shipping, and one for those occurring under billing. If there exist address elements occurring under elements other than shipping or billing, then their content is stored in a third container. Note that the container expressions only need to be specified to the compressor, not to the decompressor. Overriding the default grouping of data items usually results in only modest improvements in the compression ratio. Much better improvements are achieved, however, with semantic compressors.

Query-Oriented XML Compressors

XGrind. The technique described in [9] is intended to simultaneously provide efficient query-processing performance and reasonable compression ratios. Basic requirements to achieve the former objective are (i) fine-grained compression at the element/attribute granularity of query predicates, and (ii) context-free compression assigning codes to data items independent of their location in the document. Algorithms such as LZ77 are not context-free, and therefore XGrind uses non-adaptive Huffman (or Arithmetic) coding, in which two passes are made over the XML document – the first to collect the statistics and the second to do the actual encoding. A separate character-frequency distribution table is used for each element and non-enumerated attribute, resulting in fine-grained characterization. The DTD is used to identify enumerated-type attributes and their values are encoded using a simple binary encoding scheme, while the compression of XML tags is similar to that of XMill. With this scheme, exact-match and prefix-match queries can be completely carried out directly on the compressed document, while range or partial-match queries only require on-the-fly decompression of the element/attribute values that feature in the query predicates.

A distinguishing feature of the XGrind compressor is that it ensures homomorphic compression – that is, its output, like its input, is semi-structured in nature. In fact, the compressed XML document can be viewed as the original XML document with its tags and element/attribute values replaced by their corresponding encodings. The advantage of doing so is that the variety of efficient techniques available for parsing/querying XML documents can also be used to process the compressed document. Second, indexes can be built on the compressed document similar to those built on regular XML documents. Third, updates to the XML document can be directly executed on the compressed version. Finally, a compressed document can be directly checked for validity against the compressed version of its DTD.

As a specific example of the utility of homomorphic compression, consider repositories of genomic data (e.g., [12]), which allow registered users to upload new genetic information to their archives. With homomorphic compression, such information could be compressed by the user, then uploaded, checked for validity, and integrated with the existing archives, all operations taking place completely in the compressed domain.

To illustrate the working of XGrind, consider the XML student document fragment along with its DTD shown in Figs. 2 and 3. An abstract view of its XGrind compressed version is shown in Fig. 4, in which nahuff(s) denotes the output of the Huffman-Compressor for an input data value s, while enum(s) denotes the output of the Enum-Encoder for an input data value s, which is an enumerated attribute. As is evident from Fig. 4, the compressed document output in the second pass is semi-structured in nature, and maintains the property of validity with respect to the compressed DTD.
Fig. 2

Fragment of student database

Fig. 3

DTD of student database

The compressed-domain query processing engine consists of a lexical analyzer that emits tokens for encoded tags, attributes, and data values, and a parser built on top of this lexical analyzer does the matching and dumping of the matched tree fragments. The parser maintains information about its current path location in the XML document and the contents of the set of XML nodes that it is currently processing. For exact-match or prefix-match queries, the query path and the query predicate are converted to the compressed-domain equivalents. During parsing of the compressed XML document, when the parser detects that the current path matches the query path, and that the compressed data value matches the compressed query predicate, it outputs the matched XML fragment. An interesting side-effect is that the matching is more efficient in the compressed domain as compared to the original domain, since the number of bytes to be processed have considerably decreased.
Fig. 4

Abstract view of compressed XGrind database

XPRESS. While maintaining the homomorphic feature of XGrind, XPRESS [7] significantly extends its scope by supporting both path expressions and range queries (on numeric element types) directly on the compressed data. Here, instead of representing the tag of each element with a single identifier, the element label path is encoded as a distinct interval in (0.0, 1.0). The specific process, called reverse arithmetic encoding, is as follows: First, the entire interval (0.0, 1.0) is partitioned into disjoint sub-intervals, one for each distinct element. The size of the interval is proportional to the normalized frequency of the element in the data. In the second step, these element intervals are reduced by encoding the path leading to this element in a depth-first tree traversal from the root. An example from [7] is shown in Table 1, where the element intervals are computed from the first partitioning, and the corresponding reduced intervals by following the path labels from the root.
Table 1

XPRESS interval scheme

Element tag

Path label

Element interval

Path interval

book

book

(0.0, 0.1)

(0.0, 0.1)

section

book.section

(0.3, 0.6)

(0.3, 0.33)

subsection

book.section.subsection

(0.6, 0.9)

(0.69, 0.699)

For each node in the tree, the associated interval is incrementally computed from the parent node. The intervals generated by reverse arithmetic encoding guarantee that “If a path P is represented by the interval I, then all intervals for suffixes of P contain I.” Therefore, the interval for subsection which is (0.6, 0.9) contains the interval (0.69, 0.699) for the path book.section.subsection. Another feature of XPRESS is that it infers the data types of elements and for those that turn out to be numbers over large domains, the values are compressed by first converting them to binary and then using differential encoding, instead of the default string encoding. Recently, XPRESS has been extended in [8] to handle updates such as insertions or deletions of XML fragments.

Finally, an index-based compression approach that improves on both the compression ratio and the querry processing speed has been recently proposed in [3].

Key Applications

Data archiving, data exchange, query processing.

Cross-References

Recommended Reading

  1. 1.
    Arion A, Bonifati A, Manolescu I, Pugliese A. XQueC: a query-conscious compressed XML database. ACM Trans Internet Technol. 2007;7(2):1–35.CrossRefGoogle Scholar
  2. 2.
    Cheney J. Compressing XML with multiplexed hierarchical PPM models. In: Proceedings Data Compression Conference; 2001. p. 163–72Google Scholar
  3. 3.
    Ferragina P, Luccio F, Manzini G, Muthukrishnan M. Compressing and searching XML data via two zips. In: Proceeding 15th International World Wide Web Conference; 2006. p. 751–60.Google Scholar
  4. 4.
    Girardot M, Sundaresan N. Millau: an encoding format for efficient representation and exchange of XML over the Web. In: Proceedings 9th International World Wide Web Conference; 2000.Google Scholar
  5. 5.
    Liefke H, Suciu D. An extensible compressor for XML data. ACM SIGMOD Rec. 2000;29(1):57–62.Google Scholar
  6. 6.
    Liefke H, Suciu D. XMill: an efficent compressor for XML data. In: Proceedings ACM SIGMOD International Conference on Management of Data; 2000. p. 153–64.Google Scholar
  7. 7.
    Min JK, Park M, Chung C. XPRESS: a queriable compression for XML data. In: Proceedings ACM SIGMOD International Conference on Management of Data; 2003. p. 122–33.Google Scholar
  8. 8.
    Min JK, Park M, Chung C. XPRESS: a compressor for effective archiving, retrieval, and update of XML documents. ACM Trans Internet Technol. 2006;6(3):223–58.CrossRefGoogle Scholar
  9. 9.
    Tolani P, Haritsa J. XGRIND: a query-friendly XML compressor. In: Proceedings 18th International Conference on Data Engineering; 2002. p. 225–35.Google Scholar
  10. 10.
    Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Trans Inf Theory. 1977;23(3):337–43.MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.

Copyright information

© Springer Science+Business Media LLC 2016

Authors and Affiliations

  1. 1.University of WashingtonSeattleUSA
  2. 2.Indian Institute of ScienceBangaloreIndia

Section editors and affiliations

  • Sihem Amer-Yahia
    • 1
  1. 1.Laboratoire d'Informatique de GrenobleCNRS and LIGGrenobleFrance