Abstract
XML has already become the de facto standard for specifying and exchanging data on the Web. However, XML is by nature verbose and thus XML documents are usually large in size, a factor that hinders its practical usage, since it substantially increases the costs of storing, processing, and exchanging data. In order to tackle this problem, many XML-specific compression systems, such as XMill, XGrind, XMLPPM, and Millau, have recently been proposed. However, these systems usually suffer from the following two inadequacies: They either sacrifice performance in terms of compression ratio and execution time in order to support a limited range of queries, or perform full decompression prior to processing queries over compressed documents.
In this paper, we address the above problems by exploiting the information provided by a Document Type Definition (DTD) associated with an XML document. We show that a DTD is able to facilitate better compression as well as generate more usable compressed data to support querying. We present the architecture of the XCQ, which is a compression and querying tool for handling XML data. XCQ is based on a novel technique we have developed called DTD Tree and SAX Event Stream Parsing (DSP). The documents compressed by XCQ are stored in Partitioned Path-Based Grouping (PPG) data streams, which are equipped with a Block Statistics Signature (BSS) indexing scheme. The indexed PPG data streams support the processing of XML queries that involve selection and aggregation, without the need for full decompression. In order to study the compression performance of XCQ, we carry out comprehensive experiments over a set of XML benchmark datasets.
Similar content being viewed by others
References
Apache Software Foundation (2005) Log Files—Apache HTTP Server. http://httpd.apache.org/docs/logs.html
Arion A, Bonifati A, Costa G, D'Aguanno S, Manolescu I, Pugliese A (2004) Efficient query evaluation over compressed XML data. In: Bertino E, Christodoulakis S, Plexousakis D, Christophides V, Koubarakis M, Böhm K, Ferrari E (eds) Proceedings of Advances in Database Technology (EDBT 2004), 9th international conference on extending database technology, Heraklion, Crete, Greece, March, 2004. Lecture Notes in Computer Science 2992, Springer, Berlin Heidelberg New York, pp 200–218
Bell TC, Cleary JG, Witten IH (1990) Text compression. Prentice Hall, Englewood Cliffs, New Jersey, USA
Boag S, Chamberlin D, Fernández MF, Florescu D, Robie J, Siméon J (eds) (2005) XQuery 1.0: An XML query language. W3C Working Draft. http://www.w3.org/TR/xquery
Bosak J (1999) Shakespeare 2.00. http://www.cs.wisc.edu/niagara/data/shakes/shaksper.htm
Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F (eds) (2004) Extensible markup language (XML) 1.0, 3rd edn. W3C Recommendation. http://www.w3.org/TR/REC-xml
Buneman P, Grohe M, Koch C (2003) Path queries on compressed XML. In: Freytag JC, Lockemann PC, Abiteboul S, Carey MJ, Selinger PG, Heuer A (eds) Proceedings of the 29th international conference on very large data bases, Berlin, Germany, pp 141–152
Buneman P, Choi B, Fan W, Hutchison R, Mann R, Viglas S (2005) Vectorizing and querying large XML repositories. In: Proceedings of the 21th international conference on data engineering, Tokyo, Japan, pp 261–272
Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Technical report 124, SRC. Digital Equipment Corporation, Palo Alto, California
Cannataro M, Comito C, Pugliese A (2002) SqueezeX: Synthesis and compression of XML data. In: Proceedings of the IEEE international conference on information technology: coding and computing, Las Vegas, USA, pp 326–331
Cheney J (2001) Compressing XML with multiplexed hierarchical PPM models. In: Proceedings of the IEEE data compression conference, Snowbird, UT, USA, pp 163–172
Clarke J (2004) The Expat XML parser. http://expat.sourceforge.net/
Clark J, DeRose S (eds) (1999) XML path language (XPath). Version 1.0. W3C Recommendation. http://www.w3.org/TR/xpath
Cleary J, Teahan W, Witten I (1995) Unbounded length contexts for PPM. In: Storer JA, Cohn M (eds) Proceedings of the IEEE data compression conference, Snowbird, UT, USA, pp 52–61
Datta A, Thomas H (1999) Accessing data in block-compressed data warehouses. In: Proceedings of the 9th workshop on information technologies and systems (WITS), Charlotte, North Carolina, USA
DTDParser—A Java DTD Parser (2005) http://www.wutka.com/dtdparser.html
Faloutsos C, Christodoulakis S (1985) Design of a signature file method that accounts for non-uniform occurrence and query frequencies. In: Pirotte A, Vassiliou Y (eds) Proceedings of the 11th international conference on very large data bases, Stockholm, Sweden, pp 165–170
Gailly J-L, Adler M (2003) gzip 1.2.4. http://www.gzip.org/
Gailly J-L, Adler M (2003) zlib 1.1.4. http://www.gzip.org/zlib/
Garofalakis M, Gionis A, Rastogi R, Seshadri S, Shim K (2003) XTRACT: Learning document type descriptors from XML document collections. Data Min Knowl Discovery 7:23–56
Girardot M, Sundaresan N (2000) Millau: An encoding format for efficient representation and exchange of XML over the Web. In: Proceedings of the 9th international world wide web conference, Amsterdam, The Netherlands, pp 747–765
Girardot M, Sundaresan N (2000) Efficient representation and streaming of XML content over the internet medium. In: Proceedings of the IEEE international conference on multimedia and expo (I), New York, NY, USA, pp 67–70
Goldman R, Widom J (1997) DataGuides: Enabling query formation and optimization in semistructured databases. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA (eds) Proceedings of the 23rd international conference on very large data bases, Athens, Greece, pp 436–445
Huffman DA (1952) A method for construction of minimum-redundancy codes. Proceed. IRE 40:1098–1101
Ishikawa H, Yokoyama S, Isshiki S, Ohta M (2001) Project Xanadu: XML- and active-database-unified approach to distributed E-Commerce. In: Tjoa AM, Wagner R (eds) Proceedings of the 12th international workshop on database and expert systems applications, Munich, Germany, pp 833–837
Iyer B, Wilhite D (1994) Data compression support in databases. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of the 20th international conference on very large data bases, Santiago de Chile, Chile, pp 695–704
Java Technology (2005) http://java.sun.com/
Lam WY, Ng W, Wood PT, Levene M (2003) XCQ: XML compression and querying system. In: Poster proceedings of the 12th international world wide web conference, Budapest, Hungary
Levene M, Wood PT (2002) XML structure compression. In: Proceedings of the second international workshop on web dynamics, Honolulu, Hawaii
Ley M (2005) DBLP. http://dblp.uni-trier.de/
Liefke H, Suciu D (2000) XMill: An efficient compressor for XML Data. In: Chen W, Naughton JF, Bernstein PA (eds) Proceedings of the ACM SIGMOD international conference on management of data, Dallas, Texas, USA, pp 153–164
Lin Z, Faloutsos C (1992) Frame-sliced signature files. IEEE Trans Knowl Data Eng 4(3):281–289
Martin B, Jano B (1999) WAP binary XML content format. W3C NOTE. http://www.w3.org/TR/wbxml/
Megginson D (2004) SAX. http://www.saxproject.org/
Min JK, Park MJ, Chung CW (2003). XPRESS: A queriable compression for XML data. In: Halevy AY, Ives ZG, Doan A (eds) Proceedings of the ACM SIGMOD international conference on management of data, San Diego, California, USA, pp 122–133
Ng WK, Ravishankar C (1997) Block-oriented compression techniques for large statistical databases. IEEE Trans Knowl Data Eng 9(2):314–328
Poess M, Potapov D (2003) Data compression in Oracle. In: Freytag JC, Lockemann PC, Abiteboul S, Carey MJ, Selinger PG, Heuer A (eds) Proceedings of the 29th international conference on very large data bases, Berlin, Germany, pp 937–947
Schefler WC (1988) Statistics: Concepts and applications. The Benjamin-Cummings Publishing Co., Inc., Redwood City, California, USA
Segoufin L, Vianu V (2002) Validating streaming XML documents. In: Popa L (ed) Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, Madison, Wisconsin, USA, pp 53–64
Seward J (2005) bzip2 and libbzip2. http://www.bzip.org/
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656
Sundaresan N, Moussa R (2001) Algorithms and programming models for efficient representation of XML for Internet applications. In: Proceedings of the 10th international world wide web conference, Hong Kong, China, pp 366–375
Swiss-Prot Protein Knowledgebase (2005) http://www.expasy.ch/sprot/
TAR (2004) http://www.gnu.org/software/tar/
Tolani PM, Haritsa JR (2002) XGRIND: A query-friendly XML compressor. In: Proceedings of the 18th international conference on data engineering, San Jose, CA, pp 225-234
Transaction Processing Performance Council (2004) TPC-H: An ad-hoc, decision support benchmark. http://www.tpc.org/tpch/default.asp
XMark—An XML Benchmark Project (2003) http://monetdb.cwi.nl/xml/
XML Solutions (2000) XMLZIP. http://www.xmls.com/
XCQ Appendix (2005) Experimental data of XCQ performance. http://www.cs.ust.hk/~wilfred/XCQ/appendix.pdf
Author information
Authors and Affiliations
Corresponding author
Additional information
Wilfred Ng obtained his M.Sc.(Distinction) and Ph.D. degrees from the University of London. His research interests are in the areas of databases and information Systems, which include XML data, database query languages, web data management, and data mining. He is now an assistant professor in the Department of Computer Science, the Hong Kong University of Science and Technology (HKUST). Further Information can be found at the following URL: http://www.cs.ust.hk/faculty/wilfred/index.html.
Wai-Yeung Lam obtained his M.Phil. degree from the Hong Kong University of Science and Technology (HKUST) in 2003. His research thesis was based on the project “XCQ: A Framework for Querying Compressed XML Data.” He is currently working in industry.
Peter Wood received his Ph.D. in Computer Science from the University of Toronto in 1989. He has previously studied at the University of Cape Town, South Africa, obtaining a B.Sc. degree in 1977 and an M.Sc. degree in Computer Science in 1982. Currently he is a senior lecturer at Birkbeck and a member of the Information Management and Web Technologies research group. His research interests include database and XML query languages, query optimisation, active and deductive rule languages, and graph algorithms.
Mark Levene received his Ph.D. in Computer Science in 1990 from Birkbeck College, University of London, having previously been awarded a B.Sc. in Computer Science from Auckland University, New Zealand in 1982. He is currently professor of Computer Science at Birkbeck College, where he is a member of the Information Management and Web Technologies research group. His main research interests are Web search and navigation, Web data mining and stochastic models for the evolution of the Web. He has published extensively in the areas of database theory and web technologies, and has recently published a book called ‘An Introduction to Search Engines and Web Navigation’.
Rights and permissions
About this article
Cite this article
Ng, W., Lam, WY., Wood, P.T. et al. XCQ: A queriable XML compression system. Knowl Inf Syst 10, 421–452 (2006). https://doi.org/10.1007/s10115-006-0012-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-006-0012-z