Advertisement

It Pays to Be Certain: Unsupervised Record Linkage via Ambiguity Minimization

  • Anna Jurek
  • Deepak P.
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10939)

Abstract

Record linkage (RL) is a process of identifying records that refer to the same real-world entity. Many existing approaches to RL apply supervised machine learning (ML) techniques to generate a classification model that classifies a pair of records as either linked or non-linked. In such techniques, the labeled data helps guide the choice and relative importance to similarity measures to be employed in RL. Unsupervised RL is therefore a more challenging problem since the quality of similarity measures needs to be estimated in the absence of linkage labels. In this paper we propose a novel optimization approach to unsupervised RL. We define a scoring technique which aggregates similarities between two records along all attributes and all available similarity measures using a weighted sum formulation. The core idea behind our method is embodied in an objective function representing the overall ambiguity of the scoring across a dataset. Our goal is to iteratively optimize the objective function to progressively refine estimates of the scoring weights in the direction of lesser overall ambiguity. We have evaluated our approach on multiple real world datasets which are commonly used in the RL community. Our experimental results show that our proposed approach outperforms state-of-the-art techniques, while being orders of magnitude faster.

References

  1. 1.
    Arasu, A., Gotz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD, pp. 783–794. ACM (2010)Google Scholar
  2. 2.
    Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: SIGKDD, pp. 151–159. ACM (2008)Google Scholar
  3. 3.
    Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-31164-2CrossRefGoogle Scholar
  4. 4.
    Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: SIGKDD, pp. 475–480. ACM (2002)Google Scholar
  5. 5.
    Deepak, P., Deshpande, P.M.: Operators for Similarity Search: Semantics Techniques and Usage Scenarios. Springer, Heidelberg (2015).  https://doi.org/10.1007/978-3-319-21257-9CrossRefGoogle Scholar
  6. 6.
    Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: TAILOR: a record linkage toolbox. In: 2002 Proceedings of 18th International Conference on Data Engineering, pp. 17–28. IEEE (2002)Google Scholar
  7. 7.
    Isele, R., Bizer, C.: Learning expressive linkage rules using genetic programming. Proc. VLDB Endow. 5(11), 1638–1649 (2012)CrossRefGoogle Scholar
  8. 8.
    Jaccard, P.: Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bull. Soc. Vaud. Sci. Nat. 37, 241–272 (1901)Google Scholar
  9. 9.
    Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)CrossRefGoogle Scholar
  10. 10.
    Jurek, A., Hong, J., Chi, Y., Liu, W.: A novel ensemble learning approach to unsupervised record linkage. Inf. Syst. 71, 40–54 (2017)CrossRefGoogle Scholar
  11. 11.
    Kejriwal, M., Miranker, D.P.: An unsupervised algorithm for learning blocking schemes. In: IEEE 13th International Conference on Data Mining (ICDM), pp. 340–349 (2013)Google Scholar
  12. 12.
    Kejriwal, M., Miranker, D.P.: Semi-supervised instance matching using boosted classifiers. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 388–402. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-18818-8_24CrossRefGoogle Scholar
  13. 13.
    Lee, S., Lee, J., Hwang, S.W.: Efficient entity matching using materialized lists. Inf. Sci. 261, 170–184 (2014)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710 (1966)Google Scholar
  15. 15.
    Ngomo, A.C.N., Lyko, K.: Unsupervised learning of link specifications: deterministic vs. non-deterministic. In: Proceedings of the 8th International Conference on Ontology Matching, vol. 1111, pp. 25–36. CEUR-WS.org (2013)Google Scholar
  16. 16.
    Nikolov, A., d’Aquin, M., Motta, E.: Unsupervised learning of link discovery configuration. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 119–133. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-30284-8_15CrossRefGoogle Scholar
  17. 17.
    Deepak, P.: MixKMeans: clustering question-answer archives. In: 2016 Conference on Empirical Methods in Natural Language Processing. In: EMNLP, pp. 1576–1585 (2016)Google Scholar
  18. 18.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: SIGKDD, pp. 269–278. ACM (2002)Google Scholar
  19. 19.
    Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Steorts, R.C., Hall, R., Fienberg, S.E.: A Bayesian approach to graphical record linkage and deduplication. J. Am. Stat. Assoc. 111(516), 1660–1672 (2016)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 253–268. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-11257-2_20CrossRefGoogle Scholar
  22. 22.
    Vatsalan, D., Christen, P., Verykios, V.S.: A taxonomy of privacy-preserving record linkage techniques. Inf. Syst. 38(6), 946–969 (2013)CrossRefGoogle Scholar
  23. 23.
    Wang, J., Li, G., Yu, J.X., Feng, J.: Entity matching: how similar is similar. Proc. VLDB Endow. 4(10), 622–633 (2011)CrossRefGoogle Scholar
  24. 24.
    Wang, Q., Vatsalan, D., Christen, P.: Efficient interactive training selection for large-scale entity resolution. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS (LNAI), vol. 9078, pp. 562–573. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-18032-8_44CrossRefGoogle Scholar
  25. 25.
    Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage, pp. 354–359 (1990)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Queen’s University BelfastBelfastUK

Personalised recommendations