Flexible Workload-Aware Clustering of XML Documents

  • Rajesh Bordawekar
  • Oded Shmueli
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3186)


We investigate workload-directed physical data clustering in native XML database and repository systems. We present a practical algorithm for clustering XML documents, called XC, which is based on Lukes’ tree partitioning algorithm. XC carefully approximates certain aspects of Lukes’ algorithm so as to substantially reduce memory and time usage. XC can operate with varying degrees of precision, even in memory constrained environments. Experimental results indicate that XC is a superior clustering algorithm in terms of partition quality, with only a slight overhead in performance when compared to a workload-directed depth-first scan and store scheme. We demonstrate that XC is substantially faster than the exact Lukes’ algorithm, with only a minimal loss in clustering quality. Results also indicate that XC can exploit application workload information to generate XML clustering solutions that lead to major reduction in page faults for the workload under consideration.


Memory Usage Optimal Partition Chunk Size Page Fault XPath Query 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bohannon, P., Freire, J., Roy, P., Simeon, J.: From XML Schema to Relations: A Cost-Based Approach to XML Storage. In: Proceedings of the 18th IEEE International it Conference on Data Engineering, pp. 64–80 (2002)Google Scholar
  2. 2.
    Bordawekar, R., Shmueli, O.: Flexible Workload-aware Clustering of XML Documents. Technical report, IBM T. J. Watson Research Center (May 2004)Google Scholar
  3. 3.
    Fiebig, T., Helmer, S., Kanne, C., Mildenberger, J., Moerkotte, G., Schiele, R., Westmann, T.: Anatomy of a Native XML Database System. Technical Report, University of Mannheim (2002)Google Scholar
  4. 4.
    Florescu, D., Kossmann, D.: Storing and Querying XML Data using an RDBMS. IEEE Data Engineering Bulletin 22(3), 27–34 (1999)Google Scholar
  5. 5.
    Garey, M.S., Johnson, D.S.: Computers and Intractability. W. H. Freeman and Co, New York (1979)zbMATHGoogle Scholar
  6. 6.
    Gerlhof, C.A., Kemper, A., Kilger, C., Moerkotte, G.: Partition-Based Clustering in Object Bases: From Theory to Practice. In: Lomet, D.B. (ed.) FODO 1993. LNCS, vol. 730, pp. 301–316. Springer, Heidelberg (1993)Google Scholar
  7. 7.
    Johnson, D.S., Niemi, K.A.: On Knapsacks, Partitions, and a New Dynamic Programming Technique for Trees. Mathematics of Operations Research 8(1) (1983)Google Scholar
  8. 8.
    Kanne, C., Moerkotte, G.: Efficient Storage of XML Data. In: Proceedings of the 16th International Conference on Data Engineering, IEEE Computer Society, Los Alamitos (March 2000)Google Scholar
  9. 9.
    Lukes, J.A.: Efficient Algorithm for the Partitioning of Trees. IBM Journal of Research and Development 13(2), 163–178 (1974)zbMATHGoogle Scholar
  10. 10.
    Schkolnick, M.: A Clustering Algorithm for Hierarchical Structures. Transactions on Database Systems 2(1), 27–44 (1977)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Schoning, H., Wasch, J.: Tamino - An Internet Database System. In: Zaniolo, C., Grust, T., Scholl, M.H., Lockemann, P.C. (eds.) EDBT 2000. LNCS, vol. 1777, pp. 383–387. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  12. 12.
    Tsangaris, M.M., Naughton, J.F.: On the Performance of Object Clustering Techniques. In: Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data, pp. 144–153. ACM Press, New York (1992)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Rajesh Bordawekar
    • 1
  • Oded Shmueli
    • 2
  1. 1.IBM T. J. Watson Research CenterHawthorneU.S.A.
  2. 2.Computer Science DepartmentTechnionHaifaIsrael

Personalised recommendations