The VLDB Journal

, Volume 21, Issue 5, pp 677–702

Measuring structural similarity of semistructured data based on information-theoretic approaches

Regular Paper

DOI: 10.1007/s00778-012-0263-0

Cite this article as:
Helmer, S., Augsten, N. & Böhlen, M. The VLDB Journal (2012) 21: 677. doi:10.1007/s00778-012-0263-0

Abstract

We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on information-theoretic concepts. Common to all approaches is a two-step procedure: first, we extract and linearize the structural information from documents, and then, we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run-time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches.

Keywords

Similarity measures Semistructured documents XML Clustering 

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  • Sven Helmer
    • 1
  • Nikolaus Augsten
    • 2
  • Michael Böhlen
    • 3
  1. 1.Birkbeck, University of LondonLondonUK
  2. 2.Free University of Bozen-BolzanoBozen-BolzanoItaly
  3. 3.University of ZurichZurichSwitzerland

Personalised recommendations