Advertisement

Computer Science - Research and Development

, Volume 27, Issue 1, pp 45–63 | Cite as

Multi-pass sorted neighborhood blocking with MapReduce

Special Issue Paper

Abstract

Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution using Sorting Neighborhood blocking (SN). We propose and evaluate two efficient MapReduce-based implementations for single- and multi-pass SN that either use multiple MapReduce jobs or apply a tailored data replication. We also propose an automatic data partitioning approach for multi-pass SN to achieve load balancing. Our evaluation based on real-world datasets shows the high efficiency and effectiveness of the proposed approaches.

Keywords

MapReduce Hadoop Cloud computing Entity resolution Blocking Sorted Neighborhood 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Armbrust M, Fox A, Griffith R, Joseph AD, Katz RH, Konwinski A, Lee G, Patterson DA, Rabkin A, Stoica I, Zaharia M (2009) Above the clouds: A berkeley view of cloud computing. Tech rep, EECS Department. University of California, Berkeley Google Scholar
  2. 2.
    Batini C, Scannapieco M (2006) Data quality: concepts, methodologies and techniques. Data-centric systems and applications. Springer, Berlin Google Scholar
  3. 3.
    Baxter R, Christen P, Churches T (2003) A comparison of fast blocking methods for record linkage. In: ACM SIGKDD, vol 3, pp 25–27 Google Scholar
  4. 4.
    Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp 39–48 Google Scholar
  5. 5.
    Borthakur D (2007) The hadoop distributed file system: Architecture and design. Hadoop Project Website Google Scholar
  6. 6.
    Christen P (2008) Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: KDD, pp 1065–1068 Google Scholar
  7. 7.
    Christen P, Churches T, Hegland M (2004) Febrl—a parallel open source data linkage system. In: PAKDD, pp 638–647 Google Scholar
  8. 8.
    Dean J, Ghemawat S (2004) MapReduce: Simplified data processing on large clusters. In: OSDI, pp 137–150 Google Scholar
  9. 9.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113 CrossRefGoogle Scholar
  10. 10.
    DeWitt D, Gray J (1992) Parallel database systems: the future of high performance database systems. Commun ACM 35(6):85–98 CrossRefGoogle Scholar
  11. 11.
    DeWitt DJ, Naughton JF, Schneider DA, Seshadri S (1992) Practical skew handling in parallel joins. In: VLDB, pp 27–40 Google Scholar
  12. 12.
    Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16 CrossRefGoogle Scholar
  13. 13.
    Foundation AS (2006) Hadoop. http://hadoop.apache.org/mapreduce/
  14. 14.
    Hernández MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: SIGMOD Conference, pp 127–138 Google Scholar
  15. 15.
    Kim HS, Lee D (2007) Parallel linkage. In: CIKM, pp 283–292 Google Scholar
  16. 16.
    Kirsten T, Kolb L, Hartung M, Gross A, Köpcke H, Rahm E (2010) Data partitioning for parallel entity matching. In: 8th International Workshop on Quality in Databases Google Scholar
  17. 17.
    Kolb L, Thor A, Rahm E (2011) Parallel sorted neighborhood blocking with mapreduce. In: BTW, pp 45–64 Google Scholar
  18. 18.
    Köpcke H, Rahm E (2010) Frameworks for entity matching: a comparison. Data Knowl Eng 69(2):197–210 CrossRefGoogle Scholar
  19. 19.
    Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. In: VLDB, pp 484–493 Google Scholar
  20. 20.
    Köpcke H, Thor A, Rahm E (2010) Learning-based approaches for matching web data entities. IEEE Internet Comput 14:23–31 CrossRefGoogle Scholar
  21. 21.
    Lin J, Dyer C (2010) Data-intensive text processing with mapreduce. Synth Lect Hum Lang Technol 3(1):1–177 CrossRefGoogle Scholar
  22. 22.
    Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23(4):3–13 Google Scholar
  23. 23.
    Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: SIGMOD Conference, pp 495–506 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  1. 1.Institut für Informatik, Fakultät für Mathematik und InformatikUniversität LeipzigLeipzigGermany

Personalised recommendations