Document Decomposition for XML Compression: A Heuristic Approach

  • Byron Choi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3882)

Abstract

Sharing of common subtrees has been reported useful not only for XML compression but also for main-memory XML query processing. This method compresses subtrees only when they exhibit identical structure. Even slight irregularities among subtrees dramatically reduce the performance of compression algorithms of this kind. Furthermore, when XML documents are large, the chance of having large number of identical subtrees is inherently low. In this paper, we proposed a method of decomposing XML documents for better compression. We proposed a heuristic method of locating minor irregularities in XML documents. The irregularities are then projected out from the original XML document. We refered this process to as document decomposition. We demonstrated that better compression can be achieved by compressing the decomposed documents separately. Experimental results demonstrated that the compressed skeletons, for all real-world datasets, to our knowledge, fit comfortably into main memory of commodity computers nowadays. Preliminary results on querying compressed skeletons validate the effectiveness our approach.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Babu, S., Garofalakis, M.N., Rastogi, R.: Spartan: A model-based semantic compression system for massive data tables. In: SIGMOD, pp. 283–294 (2001)Google Scholar
  2. 2.
    Berchtold, S., Bohm, C., Keim, D.A., Kriegel, H.-P.: A cost model for nearest neighbor search in high-dimensional data space. In: PODS, pp. 78–86 (1997)Google Scholar
  3. 3.
    Buneman, P., Choi, B., Fan, W., Hutchison, R., Mann, R., Viglas, S.: Vectorizing and querying large xml repositories. In: ICDE, pp. 261–272 (2005)Google Scholar
  4. 4.
    Buneman, P., Grohe, M., Koch, C.: Path Queries on Compressed XML. In: Aberer, K., Koubarakis, M., Kalogeraki, V. (eds.) VLDB 2003. LNCS, vol. 2944, pp. 141–152. Springer, Heidelberg (2004)Google Scholar
  5. 5.
    Cheney, J.: Compressing XML with multiplexed hierarchical PPM models. In: Data Compression Conference, pp. 163–172 (2001)Google Scholar
  6. 6.
    Cheng, J., Ng, W.: Xqzip: Querying compressed xml using structural indexing. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 219–236. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  7. 7.
    Deutsch, A., Fernandez, M.F., Suciu, D.: Storing semistructured data with STORED. In: SIGMOD, pp. 431–442. ACM Press, New York (1999)Google Scholar
  8. 8.
    Gray, J., Slutz, D., Szalay, A., Thakar, A.,, J.: vandenBerg, P. Kunszt, and C. Stoughton. Data mining the SDSS Skyserver database. Technical Report MSR-TR-2002-01, Microsoft (2002)Google Scholar
  9. 9.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques, pp. 119–130. Morgan Kaufmann, San Francisco (2000)Google Scholar
  10. 10.
    Jagadish, H.V., Madar, J., Ng, R.T.: Semantic compression and pattern extraction with fascicles. In: VLDB, pp. 186–198 (1999)Google Scholar
  11. 11.
    Jagadish, H.V., Ng, R.T., Ooi, B.C., Tung, A.K.H.: Itcompress: An iterative semantic compression algorithm. In: ICDE, pp. 646–657 (2004)Google Scholar
  12. 12.
    Language and Information in Computation at Penn. Penn treebank project, Available at: http://www.cis.upenn.edu/~treebank/
  13. 13.
    Ley, M.: Dblp bibliography (March 2005), Available at: http://www.informatik.uni-trier.de/~ley/db/
  14. 14.
    Liefke, H., Suciu, D.: XMill: an efficient compressor for XML data. In: SIGMOD, pp. 153–164 (2000)Google Scholar
  15. 15.
    Miller, E., Swick, R., Brickley, D., McBride, B., Hendler, J., Schreiber, G., Connolly, D.: Semantic Web. W3C Working Group (August 2005), http://www.w3.org/2001/sw/
  16. 16.
    Min, J.-K., Park, M.-J., Chung, C.-W.: Xpress: a queriable compression for xml data. In: SIGMOD, pp. 122–133 (2003)Google Scholar
  17. 17.
    Schmidt, A., Waas, F., Kersten, M., Carey, M.J., Manolescu, I., Busse, R.: XMark: A benchmark for XML data management. In: Bressan, S., Chaudhri, A.B., Li Lee, M., Yu, J.X., Lacroix, Z. (eds.) CAiSE 2002 and VLDB 2002. LNCS, vol. 2590, pp. 974–985. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  18. 18.
    Tolani, P.M., Haritsa, J.R.: Xgrind: A query-friendly xml compressor. In: ICDE, pp. 225–234 (2002)Google Scholar
  19. 19.
    U.S. National Library of Medicine. MEDLINE distributed in XML format., Available at: http://www.nlm.nih.gov/bsd/licensee/data_elements_doc.html
  20. 20.
    Valduriez, P.: Join indices. TODS 12(2), 218–246 (1987)CrossRefGoogle Scholar
  21. 21.
    Wang, K., Liu, H.: Discovering typical structures of documents: a road map approach. In: SIGIR, pp. 146–154 (1998)Google Scholar
  22. 22.
    Ziv, J., Lempel, A.: A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Byron Choi
    • 1
  1. 1.Nanyang Technological UniversitySingapore

Personalised recommendations