Abstract
We provide an overview of Dedoop (Deduplication with Hadoop), a new tool for parallel entity resolution (ER) on cloud infrastructures. Dedoop supports a browser-based specification of complex ER strategies and provides a large library of blocking and matching approaches. To simplify the configuration of ER strategies with several similarity metrics, training-based machine learning approaches can be employed with Dedoop. Specified ER strategies are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. For improved performance, Dedoop supports redundancy-free multi-pass blocking as well as advanced load balancing approaches. To illustrate the usefulness of Dedoop, we present the results of a comparative evaluation of different ER strategies on a challenging real-world dataset.
Similar content being viewed by others
Notes
Internally, Dedoop prefixes each blocking key with its (zero-padded) pass number to force blocking keys of pass i to be lexicographically smaller than keys of pass j>i. In favor of readability, this has been omitted in the previous sections.
We do not have the perfect match result for this dataset so we could not use it for the evaluation of match quality.
References
Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp 39–48
Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Springer, Berlin
Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555
Christen P, Churches T, Hegland M (2004) Febrl—a parallel open source data linkage system. In: PAKDD, pp 638–647
Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: OSDI, pp 137–150
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Gufler B, Augsten N, Reiser A, Kemper A (2012) Load balancing in mapreduce based on scalable cardinality estimates. In: ICDE, pp 522–533
Kim H, Lee D (2007) Parallel linkage. In: CIKM, pp 283–292
Kirsten T, Kolb L, Hartung M, Gross A, Köpcke H, Rahm E (2010) Data partitioning for parallel entity matching. In: QDB
Kolb L, Köpcke H, Thor A, Rahm E (2011) Learning-based entity resolution with MapReduce. In: CloudDB, pp 1–6
Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with Hadoop. Proc. VLDB Endow. 5(12):1878–1881
Kolb L, Thor A, Rahm E (2012) Don’t match twice: redundancy-free similarity computation with MapReduce. Tech. rep. http://dbs.uni-leipzig.de/de/publication/redfree
Kolb L, Thor A, Rahm E (2012) Load balancing for MapReduce-based entity resolution. In: ICDE, pp 618–629
Kolb L, Thor A, Rahm E (2012) Multi-pass sorted neighborhood blocking with MapReduce. Comput. Sci. Res. Dev. 27(1):45–63
Köpcke H, Rahm E (2010) Frameworks for entity matching: a comparison. Data Knowl Eng 69(2):197–210
Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1):484–493
Köpcke H, Thor A, Rahm E (2010) Learning-based approaches for matching web data entities. IEEE Internet Comput 14(4):23–31
Kwon Y, Balazinska M, Howe B, Rolia JA (2010) Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: SoCC, pp 75–86
Kwon Y, Balazinska M, Howe B, Rolia JA (2012) SkewTune: mitigating skew in MapReduce applications. In: SIGMOD conference, pp 25–36
Lange D, Naumann F (2011) Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate. In: CIKM, pp 243–248
Lin J (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in MapReduce. In: Workshop on large-scale distributed systems for information retrieval
McNeill N, Kardes H, Borthwick A (2012) Dynamic record blocking: efficient linking of massive databases in mapreduce. In: QDB
Papadakis G Ioannou E Niederée C et al. (2011) Eliminating the redundancy in blocking-based entity resolution methods. In: JCDL, pp 85–94
Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: SIGMOD conference, pp 495–506
Wang C, Wang J, Lin X, Wang W, Wang H, Li H, Tian W, Xu J, Li R (2010) Mapdupreducer: detecting near duplicates over massive datasets. In: SIGMOD conference, pp 1119–1122
Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kolb, L., Rahm, E. Parallel Entity Resolution with Dedoop. Datenbank Spektrum 13, 23–32 (2013). https://doi.org/10.1007/s13222-012-0110-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-012-0110-x