Chapter

Advances in Databases: Concepts, Systems and Applications

Volume 4443 of the series Lecture Notes in Computer Science pp 905-911

Clustering XML Documents Based on Structural Similarity

  • Guangming XingAffiliated withDepartment of Computer Science, Western Kentucky University, Bowling Green, KY 42104
  • , Zhonghang XiaAffiliated withDepartment of Computer Science, Western Kentucky University, Bowling Green, KY 42104
  • , Jinhua GuoAffiliated withComputer and Information Science Department, University of Michigan - Dearborn, Dearborn, MI 48128

* Final gross prices may vary according to local VAT.

Get Access

Abstract

In this paper, we present a framework for clustering XML documents based on structural similarity between XML documents. Firstly, the validity of using the edit distance between XML documents and schemata as the structural similarity is presented. Secondly, a novel solution is given for schema extraction. The solution is based on the minimum length description (MLD) principle, and allows tradeoff between the schema simplicity and precision based on the user’s specification. Thirdly, clustering XML documents based on the edit distance is discussed. The efficacy and efficiency of our methodology have been tested using both real and synthesized data.