Skip to main content
Log in

Parallel Entity Resolution with Dedoop

  • Schwerpunktbeitrag
  • Published:
Datenbank-Spektrum Aims and scope Submit manuscript

Abstract

We provide an overview of Dedoop (Deduplication with Hadoop), a new tool for parallel entity resolution (ER) on cloud infrastructures. Dedoop supports a browser-based specification of complex ER strategies and provides a large library of blocking and matching approaches. To simplify the configuration of ER strategies with several similarity metrics, training-based machine learning approaches can be employed with Dedoop. Specified ER strategies are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. For improved performance, Dedoop supports redundancy-free multi-pass blocking as well as advanced load balancing approaches. To illustrate the usefulness of Dedoop, we present the results of a comparative evaluation of different ER strategies on a challenging real-world dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://xmlstar.sourceforge.net/.

  2. Internally, Dedoop prefixes each blocking key with its (zero-padded) pass number to force blocking keys of pass i to be lexicographically smaller than keys of pass j>i. In favor of readability, this has been omitted in the previous sections.

  3. We do not have the perfect match result for this dataset so we could not use it for the evaluation of match quality.

References

  1. Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp 39–48

    Google Scholar 

  2. Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Springer, Berlin

    Google Scholar 

  3. Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555

    Article  Google Scholar 

  4. Christen P, Churches T, Hegland M (2004) Febrl—a parallel open source data linkage system. In: PAKDD, pp 638–647

    Google Scholar 

  5. Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: OSDI, pp 137–150

    Google Scholar 

  6. Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16

    Article  Google Scholar 

  7. Gufler B, Augsten N, Reiser A, Kemper A (2012) Load balancing in mapreduce based on scalable cardinality estimates. In: ICDE, pp 522–533

    Google Scholar 

  8. Kim H, Lee D (2007) Parallel linkage. In: CIKM, pp 283–292

    Google Scholar 

  9. Kirsten T, Kolb L, Hartung M, Gross A, Köpcke H, Rahm E (2010) Data partitioning for parallel entity matching. In: QDB

    Google Scholar 

  10. Kolb L, Köpcke H, Thor A, Rahm E (2011) Learning-based entity resolution with MapReduce. In: CloudDB, pp 1–6

    Chapter  Google Scholar 

  11. Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with Hadoop. Proc. VLDB Endow. 5(12):1878–1881

    Google Scholar 

  12. Kolb L, Thor A, Rahm E (2012) Don’t match twice: redundancy-free similarity computation with MapReduce. Tech. rep. http://dbs.uni-leipzig.de/de/publication/redfree

  13. Kolb L, Thor A, Rahm E (2012) Load balancing for MapReduce-based entity resolution. In: ICDE, pp 618–629

    Google Scholar 

  14. Kolb L, Thor A, Rahm E (2012) Multi-pass sorted neighborhood blocking with MapReduce. Comput. Sci. Res. Dev. 27(1):45–63

    Article  Google Scholar 

  15. Köpcke H, Rahm E (2010) Frameworks for entity matching: a comparison. Data Knowl Eng 69(2):197–210

    Article  Google Scholar 

  16. Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1):484–493

    Google Scholar 

  17. Köpcke H, Thor A, Rahm E (2010) Learning-based approaches for matching web data entities. IEEE Internet Comput 14(4):23–31

    Article  Google Scholar 

  18. Kwon Y, Balazinska M, Howe B, Rolia JA (2010) Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: SoCC, pp 75–86

    Chapter  Google Scholar 

  19. Kwon Y, Balazinska M, Howe B, Rolia JA (2012) SkewTune: mitigating skew in MapReduce applications. In: SIGMOD conference, pp 25–36

    Google Scholar 

  20. Lange D, Naumann F (2011) Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate. In: CIKM, pp 243–248

    Google Scholar 

  21. Lin J (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in MapReduce. In: Workshop on large-scale distributed systems for information retrieval

    Google Scholar 

  22. McNeill N, Kardes H, Borthwick A (2012) Dynamic record blocking: efficient linking of massive databases in mapreduce. In: QDB

    Google Scholar 

  23. Papadakis G Ioannou E Niederée C et al. (2011) Eliminating the redundancy in blocking-based entity resolution methods. In: JCDL, pp 85–94

    Google Scholar 

  24. Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: SIGMOD conference, pp 495–506

    Google Scholar 

  25. Wang C, Wang J, Lin X, Wang W, Wang H, Li H, Tian W, Xu J, Li R (2010) Mapdupreducer: detecting near duplicates over massive datasets. In: SIGMOD conference, pp 1119–1122

    Google Scholar 

  26. Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lars Kolb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kolb, L., Rahm, E. Parallel Entity Resolution with Dedoop. Datenbank Spektrum 13, 23–32 (2013). https://doi.org/10.1007/s13222-012-0110-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13222-012-0110-x

Keywords

Navigation