SemGen—Towards a Semantic Data Generator for Benchmarking Duplicate Detectors

  • Wolfgang Gottesheim
  • Stefan Mitsch
  • Werner Retschitzegger
  • Wieland Schwinger
  • Norbert Baumgartner
Conference paper

DOI: 10.1007/978-3-642-20244-5_47

Part of the Lecture Notes in Computer Science book series (LNCS, volume 6637)
Cite this paper as:
Gottesheim W., Mitsch S., Retschitzegger W., Schwinger W., Baumgartner N. (2011) SemGen—Towards a Semantic Data Generator for Benchmarking Duplicate Detectors. In: Xu J., Yu G., Zhou S., Unland R. (eds) Database Systems for Adanced Applications. DASFAA 2011. Lecture Notes in Computer Science, vol 6637. Springer, Berlin, Heidelberg

Abstract

Benchmarking the quality of duplicate detection methods requires comprehensive knowledge on duplicate pairs in addition to sufficient size and variability of test data sets. While extending real-world data sets with artificially created data is promising, current approaches to such synthetic data generation, however, work solely on a quantitative level, which entails that duplicate semantics are only implicitly represented, leading to only insufficiently configurable variability.

In this paper we propose SemGen, a semantics-driven approach to synthetic data generation. SemGen first diversifies real-world objects on a qualitative level, before in a second step quantitative values are generated. To demonstrate the applicability of SemGen, we propose how to define duplicate semantics for the domain of road traffic management. A discussion of lessons learned concludes the paper.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Wolfgang Gottesheim
    • 1
  • Stefan Mitsch
    • 1
  • Werner Retschitzegger
    • 1
  • Wieland Schwinger
    • 1
  • Norbert Baumgartner
    • 2
  1. 1.Johannes Kepler University LinzLinzAustria
  2. 2.team Communication Technology Mgt. Ltd.ViennaAustria

Personalised recommendations