Skip to main content
Log in

Divide, Compress and Conquer: Querying XML via Partitioned Path-Based Compressed Data Blocks

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

We propose a novel partition path-based (PPB) grouping strategy to store compressed XML data in a stream of blocks. In addition, we employ a minimal indexing scheme called block statistic signature (BSS) on the compressed data, which is a simple but effective technique to support evaluation of selection and aggregate XPath queries of the compressed data. We present a formal analysis and empirical study of these techniques. The BSS indexing is first extended into effective cluster statistic signature (CSS) and multiple-cluster statistic signature (MSS) indexing by establishing more layers of indexes. We analyze how the response time is affected by various parameters involved in our compression strategy such as the data stream block size, the number of cluster layers, and the query selectivity. We also gain further insight about the compression and querying performance by studying the optimal block size in a stream, which leads to the minimum processing cost for queries. The cost model analysis provides a solid foundation for predicting the querying performance. Finally, we demonstrate that our PPB grouping and indexing strategies are not only efficient enough to support path-based selection and aggregate queries of the compressed XML data, but they also require relatively low computation time and storage space when compared with other state-of-the-art compression strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Arion, A., Bonifati, A., Costa, G., D’Aguanno, S., Manolescu, I., Pugliese, A.: Efficient query evaluation over compressed XML data. In: Proc. of EDBT (2004)

  2. Buneman, P., Grohe, M., Koch, C.: Path queries on compressed XML. In: Proc. of VLDB (2003)

  3. Cheney, J. Compressing XML with multiplexed hierarchical PPM models. In: Proc. of IEEE Data Compression Conf., pp. 163–172 (2000)

  4. Cheng, J., Ng, W.: XQzip: querying compressed XML using structural indexing. In: Proc. of EDBT (2004)

  5. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, Wiley, New York (1991)

    MATH  Google Scholar 

  6. Datta, A., Thomas, H.: Accessing data in block-compressed data warehouses. In: Workshop on Information Technologies and Systems (WITS) (1999)

  7. DBLP: http://dblp.uni-trier.de/

  8. Faloutsos, C., Christodoulakis, S.: Design of a signature file method that accounts for non-uniform occurrence and query frequencies. In: Proc. of VLDB, pp. 165–170 (1985)

  9. Gzip: http://www.gzip.org/

  10. He, J., Ng, W., Wang, X., Zhou, A.: An efficient co-operative framework for multi-query processing over compressed XML data. In: Proc. of DASFAA 2006. LNCS, vol. 3882, pp. 218–232 (2006)

  11. Iyer, B., Wilhite, D.: Data compression support in databases. In: Proc. of VLDB Conf., pp. 695–704 (1994)

  12. Liefke, H., Suciu, D.: XMill: an efficient compressor for XML data. In: Proc. of ACM SIGMOD, pp. 153–164 (2000)

  13. Lin, Z., Faloutsos, C.: Frame-sliced signature files. IEEE TKDE 4(3), 281–289 (1992)

    Google Scholar 

  14. Log Files—Apache: http://httpd.apache.org/docs/

  15. Marsden, J.E., Tromba, A.J.: Vector Calculus. W.H. Freeman and Company (1988)

  16. Min, J.K., Park, M.J., Chung, C.W.: XPRESS: a queriable compression for XML data. In: Proc. of ACM SIGMOD (2003)

  17. Ng, W., Lam, W.Y., Cheng, J.: Comparative analysis of XML compression technologies. Journal of World Wide Web: Internet and Web Information Systems 9(1), 5–33 (2005)

    Article  Google Scholar 

  18. Ng, W., Lam, W.Y., Wood, P.T., Levene, M.: XCQ: a queriable XML compression system. An International Journal of Knowledge and Information Systems (2005)

  19. Ng, W., Ravishankar, C.: Block-oriented compression techniques for large statistical databases. IEEE TKDE 9(2), 314–328 (1997)

    Google Scholar 

  20. Poess, M., Potapov, D.: Data compression in oracle. In: Proc. of VLDB Conf. (2003)

  21. SAX: http://www.saxproject.org/

  22. Segoufin, L., Vianu, V.: Validating streaming XML documents. In: Proc. of PODS (2002)

  23. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 398–403 (1948)

    MathSciNet  Google Scholar 

  24. Sundaresan, N., Moussa, R.: Algorithms and Programming Models for Efficient Representation of XML for Internet Applications. In: WWW Conf., pp. 366–375 (2001)

  25. Tolani, P.M., Haritsa, J.R.: XGRIND: a query-friendly XML compressor. In: IEEE ICDE Conf., pp. 225–234 (2002)

  26. TPC-H: http://www.tpc.org/tpch/default.asp

  27. XMark: http://monetdb.cwi.nl/xml/

  28. XPath 1.0: http://www.w3c.org/TR/xpath/

  29. XQuery 1.0: http://www.w3c.org/TR/2002/WD-xquery-20020816.

  30. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory IT-23(3), 337–343 (May 1977)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wilfred Ng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ng, W., Lau, HL. & Zhou, A. Divide, Compress and Conquer: Querying XML via Partitioned Path-Based Compressed Data Blocks. World Wide Web 11, 169–197 (2008). https://doi.org/10.1007/s11280-007-0037-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-007-0037-6

Keywords

Navigation