Advertisement

Datenbank-Spektrum

, Volume 13, Issue 1, pp 23–32 | Cite as

Parallel Entity Resolution with Dedoop

  • Lars Kolb
  • Erhard Rahm
Schwerpunktbeitrag

Abstract

We provide an overview of Dedoop (Deduplication with Hadoop), a new tool for parallel entity resolution (ER) on cloud infrastructures. Dedoop supports a browser-based specification of complex ER strategies and provides a large library of blocking and matching approaches. To simplify the configuration of ER strategies with several similarity metrics, training-based machine learning approaches can be employed with Dedoop. Specified ER strategies are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. For improved performance, Dedoop supports redundancy-free multi-pass blocking as well as advanced load balancing approaches. To illustrate the usefulness of Dedoop, we present the results of a comparative evaluation of different ER strategies on a challenging real-world dataset.

Keywords

MapReduce Hadoop Entity resolution Blocking Data skew Load balancing 

References

  1. 1.
    Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp 39–48 Google Scholar
  2. 2.
    Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Springer, Berlin Google Scholar
  3. 3.
    Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555 CrossRefGoogle Scholar
  4. 4.
    Christen P, Churches T, Hegland M (2004) Febrl—a parallel open source data linkage system. In: PAKDD, pp 638–647 Google Scholar
  5. 5.
    Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: OSDI, pp 137–150 Google Scholar
  6. 6.
    Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16 CrossRefGoogle Scholar
  7. 7.
    Gufler B, Augsten N, Reiser A, Kemper A (2012) Load balancing in mapreduce based on scalable cardinality estimates. In: ICDE, pp 522–533 Google Scholar
  8. 8.
    Kim H, Lee D (2007) Parallel linkage. In: CIKM, pp 283–292 Google Scholar
  9. 9.
    Kirsten T, Kolb L, Hartung M, Gross A, Köpcke H, Rahm E (2010) Data partitioning for parallel entity matching. In: QDB Google Scholar
  10. 10.
    Kolb L, Köpcke H, Thor A, Rahm E (2011) Learning-based entity resolution with MapReduce. In: CloudDB, pp 1–6 CrossRefGoogle Scholar
  11. 11.
    Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with Hadoop. Proc. VLDB Endow. 5(12):1878–1881 Google Scholar
  12. 12.
    Kolb L, Thor A, Rahm E (2012) Don’t match twice: redundancy-free similarity computation with MapReduce. Tech. rep. http://dbs.uni-leipzig.de/de/publication/redfree
  13. 13.
    Kolb L, Thor A, Rahm E (2012) Load balancing for MapReduce-based entity resolution. In: ICDE, pp 618–629 Google Scholar
  14. 14.
    Kolb L, Thor A, Rahm E (2012) Multi-pass sorted neighborhood blocking with MapReduce. Comput. Sci. Res. Dev. 27(1):45–63 CrossRefGoogle Scholar
  15. 15.
    Köpcke H, Rahm E (2010) Frameworks for entity matching: a comparison. Data Knowl Eng 69(2):197–210 CrossRefGoogle Scholar
  16. 16.
    Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1):484–493 Google Scholar
  17. 17.
    Köpcke H, Thor A, Rahm E (2010) Learning-based approaches for matching web data entities. IEEE Internet Comput 14(4):23–31 CrossRefGoogle Scholar
  18. 18.
    Kwon Y, Balazinska M, Howe B, Rolia JA (2010) Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: SoCC, pp 75–86 CrossRefGoogle Scholar
  19. 19.
    Kwon Y, Balazinska M, Howe B, Rolia JA (2012) SkewTune: mitigating skew in MapReduce applications. In: SIGMOD conference, pp 25–36 Google Scholar
  20. 20.
    Lange D, Naumann F (2011) Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate. In: CIKM, pp 243–248 Google Scholar
  21. 21.
    Lin J (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in MapReduce. In: Workshop on large-scale distributed systems for information retrieval Google Scholar
  22. 22.
    McNeill N, Kardes H, Borthwick A (2012) Dynamic record blocking: efficient linking of massive databases in mapreduce. In: QDB Google Scholar
  23. 23.
    Papadakis G Ioannou E Niederée C et al. (2011) Eliminating the redundancy in blocking-based entity resolution methods. In: JCDL, pp 85–94 Google Scholar
  24. 24.
    Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: SIGMOD conference, pp 495–506 Google Scholar
  25. 25.
    Wang C, Wang J, Lin X, Wang W, Wang H, Li H, Tian W, Xu J, Li R (2010) Mapdupreducer: detecting near duplicates over massive datasets. In: SIGMOD conference, pp 1119–1122 Google Scholar
  26. 26.
    Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.Institut für InformatikUniversität LeipzigLeipzigGermany

Personalised recommendations