Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data

Abstract

Record Matching refers to identifying pairs of records that relate to the same entities across different data sources. In many applications of data mining, record matching is usually associated to quadratic complexity. In practice, the number of non-matching record pairs always far exceeds the number of matching pairs, and this is called imbalance problem. Blocking is a technique of data reduction, which can filter unlikely matching pairs before record matching. However, for big data there is no fast and effective blocking algorithm yet. In this paper, we report on big data infrastructure to improve efficiency of blocking. Our approach runs blocking process independently and distributedly on the partitions of whole data. To improve efficiency, we adopt a probabilistic technique to balance the speed and the effect of the algorithm that we proposed for distributed blocking. Our experimental analysis endorses the superiority of our technique and shows its novel scalability.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Notes

  1. 1.

    http://nectar.org.au/.

  2. 2.

    http://dbs.uni-leipzig.de/en/research/projects/object_matching.

References

  1. 1.

    Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1

  2. 2.

    Bilenko M, Kamath B, Mooney RJ (2006) Adaptive blocking: learning to scale up record linkage. In: ICDM’06 Sixth International Conference on Data Mining (IEEE, 2006), pp 87–96

  3. 3.

    Michelson M, Knoblock CA (2006) Learning blocking schemes for record linkage. In: Proceedings of the National Conference on Artificial Intelligence, vol 21. (Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006), p 440

  4. 4.

    Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: SIGMOD Conference, pp 219–232

  5. 5.

    Dou C, Sun D, Wong R (2016) Unsupervised blocking of imbalanced datasets for record matching. In: International Conference on Web Information Systems Engineering. Springer, Berlin

  6. 6.

    Dou C, Sun D, Chen YC, Li G, Liu J (2016) Probabilistic parallelisation of blocking non-matched records for big data. In: 2016 IEEE International Conference on Big Data (Big Data), pp 3465–3473. doi:10.1109/BigData.2016.7841009

  7. 7.

    Chaudhuri S, Chen BC, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB Endowment), pp 327–338

  8. 8.

    Bilenko M, Mooney RJ (2003) On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pp 7–12

  9. 9.

    Arasu A, Götz M, Kaushik R (2010) On active learning of record matching packages. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (ACM), pp 783–794

  10. 10.

    Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: KDD. ACM, pp 269–278

  11. 11.

    Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45

    MATH  Google Scholar 

  12. 12.

    Tejada S, Knoblock CA, Minton S (2002) Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp 350–359

  13. 13.

    Newcombe HB (1988) Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press, Oxford

    Google Scholar 

  14. 14.

    Wang R, Sun D, Li G, Atif M, Nepal S, LogProv (2016) Logging events as provenance of big data analytics pipelines with trustworthiness. In: 2016 IEEE International Conference on Big Data (Big Data), pp 1402–1411. doi:10.1109/BigData.2016.7840748

  15. 15.

    Wu D, Zhu L, Xu X, Sakr S, Sun D, Lu Q (2016) Building pipelines for heterogeneous execution environments for big data processing. IEEE Softw 33(2):60. doi:10.1109/MS.2016.35

    Article  Google Scholar 

  16. 16.

    Akbudak K, Aykanat C (2017) Exploiting locality in sparse matrix–matrix multiplication on many-core architectures. IEEE Trans Parallel Distrib Syst PP(99):1–1. doi:10.1109/TPDS.2017.2656893

  17. 17.

    Kunfang S, Lu H (2016) Efficient querying distributed big-XML data using MapReduce. Int J Grid High Perform Comput 8(3):70

  18. 18.

    Zeng Q, Zhao M, Liu P, Yadav P, Calo S, Lobo J (2015) Enforcement of autonomous authorizations in collaborative distributed query evaluation. IEEE Trans Knowl Data Eng 27(4):979

    Article  Google Scholar 

  19. 19.

    Slagter K, Hsu CH, Chung YC (2015) An adaptive and memory efficient sampling mechanism for partitioning in MapReduce. Int J Parallel Program 43(3):489

    Article  Google Scholar 

  20. 20.

    Christen P, Churches T, Hegland M (2004) Febrl—a parallel open source data linkage system. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, pp 638–647

  21. 21.

    Kim Hs, Lee D (2007) Parallel linkage. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management. ACM, pp 283–292

  22. 22.

    Efthymiou V, Stefanidis K, Christophides V (2015) Big data entity resolution: from highly to somehow similar entity descriptions in the Web. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE, pp 401–410

  23. 23.

    Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537

    Article  Google Scholar 

  24. 24.

    Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with Hadoop. Proc VLDB Endow 5(12):1878

    Article  Google Scholar 

  25. 25.

    Hernández MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: ACM Sigmod Record, vol 24. ACM, pp 127–138

  26. 26.

    Kolb L, Thor A, Rahm E (2012) Multi-pass sorted neighborhood blocking with MapReduce. Comput Sci Res Dev 27(1):45

    Article  Google Scholar 

  27. 27.

    Efthymiou V, Papadakis G, Papastefanatos G, Stefanidis K, Palpanas T (2015) Parallel meta-blocking: realizing scalable entity resolution over large, heterogeneous data. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE, pp 411–420

  28. 28.

    Papadakis G, Koutrika G, Palpanas T, Nejdl W (2014) Meta-blocking: taking entity resolutionto the next level. IEEE Trans Knowl Data Eng 26(8):1946

    Article  Google Scholar 

  29. 29.

    Wang L, Tao J, Ranjan R, Marten H, Streit A, Chen J, Chen D (2013) G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Future Gener Comput Syst 29(3):739

    Article  Google Scholar 

  30. 30.

    Jayalath C, Stephen J, Eugster P (2014) From the cloud to the atmosphere: running mapreduce across data centers. IEEE Trans Comput 63(1):74

    MathSciNet  Article  MATH  Google Scholar 

  31. 31.

    Luo Y, Plale B (2012) Hierarchical mapreduce programming model and scheduling algorithms. In: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Ccgrid 2012). IEEE Computer Society, pp 769–774

  32. 32.

    Shabeera T, Madhu Kumar S (2015) Optimising virtual machine allocation in MapReduce cloud for improved data locality. Int J Big Data Intell 2(1):2

    Article  Google Scholar 

  33. 33.

    Hsu CH, Slagter KD, Chung YC (2015) Locality and loading aware virtual machine mapping techniques for optimizing communications in mapreduce applications. Future Gener Comput Syst 53:43

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Daniel Sun.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dou, C., Cui, Y., Sun, D. et al. Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data. J Supercomput 75, 623–645 (2019). https://doi.org/10.1007/s11227-017-2008-8

Download citation

Keywords

  • Big data
  • Record matching
  • Blocking
  • Density
  • Parallelisation