Active Learning Based Entity Resolution Using Markov Logic

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9652)

Abstract

Entity resolution is a common data cleaning and data integration problem that involves determining which records in one or more data sets refer to the same real-world entities. It has numerous applications for commercial, academic and government organisations. For most practical entity resolution applications, training data does not exist which limits the type of classification models that can be applied. This also prevents complex techniques such as Markov logic networks from being used on real-world problems. In this paper we apply an active learning based technique to generate training data for a Markov logic network based entity resolution model and learn the weights for the formulae in a Markov logic network. We evaluate our technique on real-world data sets and show that we can generate balanced training data and learn and also learn approximate weights for the formulae in the Markov logic network.

References

  1. 1.
    Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: ACM SIGMOD, pp. 783–794, Indianapolis (2010)Google Scholar
  2. 2.
    Bellare, K., Iyengar, S., Parameswaran, A.G., Rastogi, V.: Active sampling for entity matching. In: ACM SIGKDD. ACM (2012)Google Scholar
  3. 3.
    Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM TKDD 1(1), 5 (2007)CrossRefGoogle Scholar
  4. 4.
    Christen, V.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  5. 5.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2012)Google Scholar
  6. 6.
    Christen, P., Vatsalan, D., Fu, Z.: Advanced record linkage methods and privacy aspects for population reconstruction - a survey and case studies. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds.) Population Reconstruction, pp. 87–110. Springer, Switzerland (2015)CrossRefGoogle Scholar
  7. 7.
    Dal Bianco, G., Galante, R., Gonalves, M., Canuto, S., Heuser, C.: A practical and effective sampling selection strategy for large scale deduplication. IEEE KDE 27(9), 2305–2319 (2015)Google Scholar
  8. 8.
    Du, J., Ling, C.: Active learning with human-like noisy oracle. In: IEEE ICDM, pp. 797–802 (2010)Google Scholar
  9. 9.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE TKDE 19(1), 1–16 (2007)Google Scholar
  10. 10.
    Fisher, J., Christen, P., Wang, Q., Rahm, V.: A clustering-based framework to control block sizes for entity resolution. In: ACM SIGKDD (2015)Google Scholar
  11. 11.
    Fu, Z., Christen, P., Zhou, J.: A graph matching method for historical census household linkage. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014, Part I. LNCS, vol. 8443, pp. 485–496. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  12. 12.
    Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. DMKD 2(1), 9–37 (1998)Google Scholar
  13. 13.
    Huynh, T.N., Mooney, R.J.: Discriminative structure and parameter learning for Markov logic networks. In: ACM ICML (2008)Google Scholar
  14. 14.
    Huynh, T.N., Mooney, R.J.: Online max-margin weight learning for Markov logic networks. In: SDM, pp. 642–651 (2011)Google Scholar
  15. 15.
    Kalashnikov, D., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM TODS 31(2), 716–767 (2006)CrossRefGoogle Scholar
  16. 16.
    Kok, S., Domingos, P.: Learning the structure of Markov logic networks. In: ACM ICML (2005)Google Scholar
  17. 17.
    Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)CrossRefGoogle Scholar
  18. 18.
    MacKay, D.J.: Information-based objective functions for active data selection. Neural Comput. 4(4), 590–604 (1992)CrossRefGoogle Scholar
  19. 19.
    Mihalkova, L., Mooney, R.: Learning to disambiguate search queries from short sessions. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 111–127. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  20. 20.
    On, B.W., Elmacioglu, E., Lee, D., Kang, J., Pei, J.: Improving grouped-entity resolution using quasi-cliques. In: IEEE ICDM, pp. 1008–1015 (2006)Google Scholar
  21. 21.
    Rastogi, V., Dalvi, N., Garofalakis, M.: Large-scale collective entity matching. VLDB Endowment 4, 208–218 (2011)CrossRefGoogle Scholar
  22. 22.
    Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006)CrossRefGoogle Scholar
  23. 23.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: ACM SIGKDD (2002)Google Scholar
  24. 24.
    Settles, B.: Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin, Madison (2010)Google Scholar
  25. 25.
    Settles, B., Craven, M.: An analysis of active learning strategies for sequence labeling tasks. In: ACL Empirical methods in NLP (2008)Google Scholar
  26. 26.
    Singla, P., Domingos, P.: Discriminative training of Markov logic networks. AAAI 5, 868–873 (2005)Google Scholar
  27. 27.
    Singla, P., Domingos, P.: Entity resolution with Markov logic. In: IEEE ICDM, pp. 572–582 (2006)Google Scholar
  28. 28.
    Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. Proc. VLDB Endow. 5(11), 1483–1494 (2012)CrossRefGoogle Scholar
  29. 29.
    Wang, Q., Vatsalan, D., Christen, P.: Efficient interactive training selection for large-scale entity resolution. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS, vol. 9078, pp. 562–573. Springer, Heidelberg (2015)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Research School of Computer ScienceAustralian National UniversityCanberraAustralia

Personalised recommendations