XML Duplicate Detection Using Sorted Neighborhoods

  • Sven Puhlmann
  • Melanie Weis
  • Felix Naumann
Conference paper

DOI: 10.1007/11687238_46

Part of the Lecture Notes in Computer Science book series (LNCS, volume 3896)
Cite this paper as:
Puhlmann S., Weis M., Naumann F. (2006) XML Duplicate Detection Using Sorted Neighborhoods. In: Ioannidis Y. et al. (eds) Advances in Database Technology - EDBT 2006. EDBT 2006. Lecture Notes in Computer Science, vol 3896. Springer, Berlin, Heidelberg

Abstract

Detecting duplicates is a problem with a long tradition in many domains, such as customer relationship management and data warehousing. The problem is twofold: First define a suitable similarity measure, and second efficiently apply the measure to all pairs of objects. With the advent and pervasion of the XML data model, it is necessary to find new similarity measures and to develop efficient methods to detect duplicate elements in nested XML data.

A classical approach to duplicate detection in flat relational data is the sorted neighborhood method, which draws its efficiency from sliding a window over the relation and comparing only tuples within that window. We extend the algorithm to cover not only a single relation but nested XML elements. To compare objects we make use of XML parent and child relationships. For efficiency, we apply the windowing technique in a bottom-up fashion, detecting duplicates at each level of the XML hierarchy. Experiments show a speedup comparable to the original method data and they show the high effectiveness of our algorithm in detecting XML duplicates.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Sven Puhlmann
    • 1
  • Melanie Weis
    • 1
  • Felix Naumann
    • 1
  1. 1.Humboldt-Universität zu BerlinBerlinGermany

Personalised recommendations