Entity Resolution-Based Jaccard Similarity Coefficient for Heterogeneous Distributed Databases

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 379)


Entity Resolution (ER) is a task for identifying same real world entity. It refers to data object matching or deduplication. It has been a leading research in the field of structure database. Due to its significance, entity resolution continues to be a most important challenge for heterogeneous distributed databases. Several methods have been proposed for the Entity resolution, but they have yielded unsatisfactory results. In this paper, we propose an efficient integrated solution to the entity resolution problem based on Jaccard similarity coefficient. Here we use Markov logic and Jaccard similarity coefficient for providing an efficient solution towards ER problem in heterogeneous distributed databases. The approach that we have implemented gives an overall success rate of about 98 %, thus proving better than the previously implemented algorithms.


Entity resolution (ER) Distributed database Jaccard similarity coefficient Markov logic 


  1. 1.
    Weis, M., Naumann, F., Brosy, F.: A duplicate detection benchmark for XML (and relational) data. In: Proceedings of Workshop on Information Quality for Information Systems (IQIS) (2006)Google Scholar
  2. 2.
    Re, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: IEEE 23rd International Conference on Data Engineering, 2007, ICDE 2007, pp. 886–895. IEEE (2007)Google Scholar
  3. 3.
    Panse, F., Van Keulen, M., De Keijzer, A., Ritter, N.: Duplicate detection in probabilistic data. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 179–182. IEEE (2010)Google Scholar
  4. 4.
    Kopcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)CrossRefGoogle Scholar
  5. 5.
    Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. (TODS) 36(3), 15 (2011)CrossRefGoogle Scholar
  6. 6.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48. ACM (2003)Google Scholar
  7. 7.
    Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pp. 11–18. ACM (2004)Google Scholar
  8. 8.
    Ayat, N., Akbarinia, R., Afsarmanesh, H., Valduriez, P.: Entity resolution for probabilistic data. Inf. Sci. 277, 492–511 (2014)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Schewe, K.D., Wang, Q.: A theoretical framework for knowledge-based entity resolution. Theoret. Comput. Sci. 549, 101–126 (2014)MATHMathSciNetCrossRefGoogle Scholar
  10. 10.
    Suciu, D., Connolly, A.J., Howe, B.: Embracing uncertainty in large-scale computational astrophysics. In: MUD, pp. 63–77 (2009)Google Scholar
  11. 11.
    Soliman, M.A., Ilyas, I.F., Chen-Chuan Chang, K.: Top-k query processing in uncertain databases. In: IEEE 23rd International Conference onData Engineering, 2007. ICDE 2007, pp. 896–905. IEEE (2007)Google Scholar
  12. 12.
    Singla, P., Domingos, P.: Discriminative training of Markov logic networks. In: AAAI, vol. 5, pp. 868–873 (2005)Google Scholar
  13. 13.
    Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. Proc. VLDB Endowment 2(1), 1282–1293 (2009)CrossRefGoogle Scholar
  14. 14.
    Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: ACM SIGKDD, vol. 3, pp. 25–27 (2003)Google Scholar
  15. 15.
    Kopcke, H., Thor, A., Rahm, E.: Learning-based approaches for matching web data entities. IEEE Internet Comput. 14(4), 23–31 (2010)CrossRefGoogle Scholar
  16. 16.
    Kopcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD, pp. 3–12 (2008)Google Scholar
  17. 17.
    Singla, P., Domingos, P.: Entity resolution with markov logic. In: Sixth International Conference on Data Mining, 2006. ICDM’06, pp. 572–582. IEEE (2006)Google Scholar
  18. 18.
    Kok, S., Domingos, P.: Learning the structure of Markov logic networks. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 441–448. ACM (2005)Google Scholar
  19. 19.
    Ayat, N., Akbarinia, R., Afsarmanesh, H., Valduriez, P.: Entity resolution for uncertain data. In: BDA’2012: 28e Journées Bases de Données Avancées, p. 20 (2002)Google Scholar
  20. 20.
    Das Sarma, A., Benjelloun, O., Halevy, A., Widom, J.: Working models for uncertain data. In: Proceedings of the 22nd International Conference on Data Engineering, 2006, ICDE’06, pp. 7–7. IEEE (2006)Google Scholar
  21. 21.
    Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: Kdd Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)Google Scholar
  22. 22.
    Yi, K., Li, F., Kollios, G., Srivastava, D.: Efficient processing of top-k queries in uncertain databases with x-relations. IEEE Trans. Knowl. Data Eng. 20(12), 1669–1682 (2008)CrossRefGoogle Scholar
  23. 23.
    Yuen, S.M., Tao, Y., Xiao, X., Pei, J., Zhang, D.: Superseding nearest neighbor search on uncertain spatial databases. IEEE Trans. Knowl. Data Eng. 22(7), 1041–1055 (2010)CrossRefGoogle Scholar
  24. 24.
    Peng, L., Diao, Y., Liu, A.: Optimizing probabilistic query processing on continuous uncertain data. Proc. VLDB Endowment 4(11), 1169–1180 (2011)Google Scholar
  25. 25.
    McCallum, A., Wellner, B.: Object consolidation by graph partitioning with a conditionally-trained distance metric. In: KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation (2003)Google Scholar
  26. 26.
    Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003)CrossRefGoogle Scholar

Copyright information

© Springer India 2016

Authors and Affiliations

  1. 1.Department of Computer Science and Engineering, Indian School of MinesDhanbadIndia

Personalised recommendations