Parallel Entity Resolution with Dedoop

Kolb, Lars; Rahm, Erhard

doi:10.1007/s13222-012-0110-x

Parallel Entity Resolution with Dedoop

Schwerpunktbeitrag
Published: 29 December 2012

Volume 13, pages 23–32, (2013)
Cite this article

Datenbank-Spektrum Aims and scope Submit manuscript

Lars Kolb¹ &
Erhard Rahm¹

512 Accesses
28 Citations
Explore all metrics

Abstract

We provide an overview of Dedoop (Deduplication with Hadoop), a new tool for parallel entity resolution (ER) on cloud infrastructures. Dedoop supports a browser-based specification of complex ER strategies and provides a large library of blocking and matching approaches. To simplify the configuration of ER strategies with several similarity metrics, training-based machine learning approaches can be employed with Dedoop. Specified ER strategies are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. For improved performance, Dedoop supports redundancy-free multi-pass blocking as well as advanced load balancing approaches. To illustrate the usefulness of Dedoop, we present the results of a comparative evaluation of different ER strategies on a challenging real-world dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

http://xmlstar.sourceforge.net/.
Internally, Dedoop prefixes each blocking key with its (zero-padded) pass number to force blocking keys of pass i to be lexicographically smaller than keys of pass j>i. In favor of readability, this has been omitted in the previous sections.
We do not have the perfect match result for this dataset so we could not use it for the evaluation of match quality.

References

Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp 39–48
Google Scholar
Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Springer, Berlin
Google Scholar
Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555
Article Google Scholar
Christen P, Churches T, Hegland M (2004) Febrl—a parallel open source data linkage system. In: PAKDD, pp 638–647
Google Scholar
Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: OSDI, pp 137–150
Google Scholar
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Article Google Scholar
Gufler B, Augsten N, Reiser A, Kemper A (2012) Load balancing in mapreduce based on scalable cardinality estimates. In: ICDE, pp 522–533
Google Scholar
Kim H, Lee D (2007) Parallel linkage. In: CIKM, pp 283–292
Google Scholar
Kirsten T, Kolb L, Hartung M, Gross A, Köpcke H, Rahm E (2010) Data partitioning for parallel entity matching. In: QDB
Google Scholar
Kolb L, Köpcke H, Thor A, Rahm E (2011) Learning-based entity resolution with MapReduce. In: CloudDB, pp 1–6
Chapter Google Scholar
Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with Hadoop. Proc. VLDB Endow. 5(12):1878–1881
Google Scholar
Kolb L, Thor A, Rahm E (2012) Don’t match twice: redundancy-free similarity computation with MapReduce. Tech. rep. http://dbs.uni-leipzig.de/de/publication/redfree
Kolb L, Thor A, Rahm E (2012) Load balancing for MapReduce-based entity resolution. In: ICDE, pp 618–629
Google Scholar
Kolb L, Thor A, Rahm E (2012) Multi-pass sorted neighborhood blocking with MapReduce. Comput. Sci. Res. Dev. 27(1):45–63
Article Google Scholar
Köpcke H, Rahm E (2010) Frameworks for entity matching: a comparison. Data Knowl Eng 69(2):197–210
Article Google Scholar
Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1):484–493
Google Scholar
Köpcke H, Thor A, Rahm E (2010) Learning-based approaches for matching web data entities. IEEE Internet Comput 14(4):23–31
Article Google Scholar
Kwon Y, Balazinska M, Howe B, Rolia JA (2010) Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: SoCC, pp 75–86
Chapter Google Scholar
Kwon Y, Balazinska M, Howe B, Rolia JA (2012) SkewTune: mitigating skew in MapReduce applications. In: SIGMOD conference, pp 25–36
Google Scholar
Lange D, Naumann F (2011) Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate. In: CIKM, pp 243–248
Google Scholar
Lin J (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in MapReduce. In: Workshop on large-scale distributed systems for information retrieval
Google Scholar
McNeill N, Kardes H, Borthwick A (2012) Dynamic record blocking: efficient linking of massive databases in mapreduce. In: QDB
Google Scholar
Papadakis G Ioannou E Niederée C et al. (2011) Eliminating the redundancy in blocking-based entity resolution methods. In: JCDL, pp 85–94
Google Scholar
Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: SIGMOD conference, pp 495–506
Google Scholar
Wang C, Wang J, Lin X, Wang W, Wang H, Li H, Tian W, Xu J, Li R (2010) Mapdupreducer: detecting near duplicates over massive datasets. In: SIGMOD conference, pp 1119–1122
Google Scholar
Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Institut für Informatik, Universität Leipzig, PF 100920, 04009, Leipzig, Germany
Lars Kolb & Erhard Rahm

Authors

Lars Kolb
View author publications
You can also search for this author in PubMed Google Scholar
Erhard Rahm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lars Kolb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kolb, L., Rahm, E. Parallel Entity Resolution with Dedoop. Datenbank Spektrum 13, 23–32 (2013). https://doi.org/10.1007/s13222-012-0110-x

Download citation

Received: 01 November 2012
Accepted: 15 December 2012
Published: 29 December 2012
Issue Date: March 2013
DOI: https://doi.org/10.1007/s13222-012-0110-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Parallel Entity Resolution with Dedoop

Abstract

Access this article

Similar content being viewed by others

The Emergence of Modified Hadoop Online-Based MapReduce Technology in Cloud Environments

Performance Comparison of Three Spark-Based Implementations of Parallel Entity Resolution

Versatile XQuery Processing in MapReduce

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parallel Entity Resolution with Dedoop

Abstract

Access this article

Similar content being viewed by others

The Emergence of Modified Hadoop Online-Based MapReduce Technology in Cloud Environments

Performance Comparison of Three Spark-Based Implementations of Parallel Entity Resolution

Versatile XQuery Processing in MapReduce

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation