Advertisement

Sorted Neighborhood for Schema-Free RDF Data

  • Mayank Kejriwal
  • Daniel P. Miranker
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9341)

Abstract

Entity Resolution (ER) concerns identifying pairs of entities that refer to the same underlying entity. To avoid \(O(n^2)\) pairwise comparison of n entities, blocking methods are used. Sorted Neighborhood is an established blocking method for Relational Databases. It has not been applied to schema-free Resource Description Framework (RDF) data sources widely prevalent in the Linked Data ecosystem. This paper presents a Sorted Neighborhood workflow that may be applied to schema-free RDF data. The workflow is modular and makes minimal assumptions about its inputs. Empirical evaluations of the proposed algorithm on five real-world benchmarks demonstrate its utility compared to two state-of-the-art blocking baselines.

Keywords

Entity resolution Sorted neighborhood Schema-free RDF 

References

  1. 1.
    Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: Sixth International Conference on Data Mining, 2006. ICDM 2006, pp. 87–96. IEEE (2006)Google Scholar
  2. 2.
    Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. Int. J. semant. Web Inf. Syst. 5(3), 1–22 (2009)CrossRefGoogle Scholar
  3. 3.
    Christen, P.: Further topics and research directions. In: Christen, P. (ed.) Data Matching, pp. 209–228. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  4. 4.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)CrossRefGoogle Scholar
  5. 5.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  6. 6.
    Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: ACM SIGMOD Record, vol. 24, pp. 127–138. ACM (1995)CrossRefGoogle Scholar
  7. 7.
    Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Disc. 2(1), 9–37 (1998)CrossRefGoogle Scholar
  8. 8.
    Isele, R., Jentzsch, A., Bizer, C.: Efficient multidimensional blocking for link discovery without losing recall. In: WebDB (2011)Google Scholar
  9. 9.
    Kejriwal, M., Miranker, D.P.: An unsupervised algorithm for learning blocking schemes. In: Thirteenth International Conference on Data Mining, ICDM 2013. IEEE (2013)Google Scholar
  10. 10.
    Kejriwal, M., Miranker, D.P.: A two-step blocking scheme learner for scalable link discovery. In: Thirteenth International Semantic Web Conference on Ontology Matching Workshop, ISWC 2014 (2014)Google Scholar
  11. 11.
    Kejriwal, M., Miranker, D.P.: A dnf blocking scheme learner for heterogeneous datasets (2015). arXiv preprint arXiv:1501.01694
  12. 12.
    McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178. ACM (2000)Google Scholar
  13. 13.
    McCarthy, J.F., Lehnert, W.G.: Using decision trees for coreference resolution (1995). arXiv preprint cmp-lg/9505043
  14. 14.
    Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records (1959)Google Scholar
  15. 15.
    Ngomo, A.-C.N.: A time-efficient hybrid approach to link discovery. In: Ontology Matching, p. 1 (2011)Google Scholar
  16. 16.
    Papadakis, G., Ioannou, E., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 535–544. ACM (2011)Google Scholar
  17. 17.
    Puhlmann, S., Weis, M., Naumann, F.: XML duplicate detection using sorted neighborhoods. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., Böhm, K., Kemper, A., Grust, T., Böhm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 773–791. Springer, Heidelberg (2006) CrossRefGoogle Scholar
  18. 18.
    Scharffe, F., Ferrara, A., Nikolov, A., et al.: Data linking for the semantic web. Int. J. Seman. Web Inf. Syst. 7(3), 46–76 (2011)CrossRefGoogle Scholar
  19. 19.
    Scharffe, F., Liu, Y., Zhou, C.: RDF-AI: an architecture for RDF datasets matching, fusion and interlink. In: Proceedings of the IJCAI 2009 workshop on Identity, reference, and knowledge representation (IR-KR), Pasadena (CA US) (2009)Google Scholar
  20. 20.
    Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., et al. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 245–260. Springer, Heidelberg (2014) Google Scholar
  21. 21.
    Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D., et al.: Efficient RDF storage and retrieval in Jena2. SWDB 3, 131–150 (2003)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Open Access This chapter is distributed under the terms of the Creative Commons Attribution Noncommercial License, which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Authors and Affiliations

  1. 1.University of Texas at AustinAustinUSA

Personalised recommendations