Efficient Interactive Training Selection for Large-Scale Entity Resolution

  • Qing Wang
  • Dinusha Vatsalan
  • Peter Christen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9078)


Entity resolution (ER) has wide-spread applications in many areas, including e-commerce, health-care, the social sciences, and crime and fraud detection. A crucial step in ER is the accurate classification of pairs of records into matches (assumed to refer to the same entity) and non-matches (assumed to refer to different entities). In most practical ER applications it is difficult and costly to obtain training data of high quality and enough size, which impedes the learning of an ER classifier. We tackle this problem using an interactive learning algorithm that exploits the cluster structure in similarity vectors calculated from compared record pairs. We select informative training examples to assess the purity of clusters, and recursively split clusters until clusters pure enough for training are found. We consider two aspects of active learning that are significant in practical applications: a limited budget for the number of manual classifications that can be done, and a noisy oracle where manual labeling might be incorrect. Experiments using several real data sets show that manual labeling efforts can be significantly reduced for training an ER classifier without compromising matching quality.


Data matching Record linkage Deduplication Active learning Noisy oracle Hierarchical clustering Interactive labeling 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: ACM SIGMOD, Indianapolis, pp. 783–794 (2010)Google Scholar
  2. 2.
    Bellare, K., Iyengar, S., Parameswaran, A.G., Rastogi, V.: Active sampling for entity matching. In: ACM SIGKDD, Beijing, pp. 1131–1139 (2012)Google Scholar
  3. 3.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: ACM SIGKDD, Washington DC, pp. 39–48 (2003)Google Scholar
  4. 4.
    Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: IEEE ICDE, Tokyo, pp. 865–876 (2005)Google Scholar
  5. 5.
    Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting context analysis for combining multiple entity resolution systems. In: ACM SIGMOD, Providence, pp. 207–218 (2009)Google Scholar
  6. 6.
    Christen, P.: Data Matching. Data-Centric Systems and Applications. Springer (2012)Google Scholar
  7. 7.
    Christen, P.: Automatic training example selection for scalable unsupervised record linkage. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 511–518. Springer, Heidelberg (2008) CrossRefGoogle Scholar
  8. 8.
    Christen, P.: Development and user experiences of an open source data cleaning, deduplication and record linkage system. SIGKDD Explorations 11(1) (2009)Google Scholar
  9. 9.
    Cochinwala, M., Kurien, V., Lalk, G., Shasha, D.: Efficient data reconciliation. Information Sciences 137(1), 1–15 (2001)CrossRefMATHGoogle Scholar
  10. 10.
    Dal Bianco, G., Galante, R., Heuser, C.A., Gonçalves, M.A.: Tuning large scale deduplication with reduced effort. In: SSDBM, Baltimore, p. 18 (2013)Google Scholar
  11. 11.
    Dasgupta, S., Hsu, D.: Hierarchical sampling for active learning. In: IEEE ICML, Helsinki, pp. 208–215 (2008)Google Scholar
  12. 12.
    Du, J., Ling, C.X.: Active learning with human-like noisy oracle. In: IEEE ICDM, Sydney, pp. 797–802 (2010)Google Scholar
  13. 13.
    Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: TAILOR: a record linkage toolbox. In: IEEE ICDE, San Jose, pp. 17–28 (2002)Google Scholar
  14. 14.
    Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: A survey. IEEE TKDE 19(1), 1–16 (2007)Google Scholar
  15. 15.
    Hochbaum, D.S., Shmoys, D.B.: A best possible heuristic for the k-center problem. Mathematics of Operations Research 10(2), 180–184 (1985)CrossRefMATHMathSciNetGoogle Scholar
  16. 16.
    Huang, S.J., Jin, R., Zhou, Z.H.: Active learning by querying informative and representative examples. In: NIPS, Vancouver, pp. 892–900 (2010)Google Scholar
  17. 17.
    Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. VLDB Endowment 3(1–2), 484–493 (2010)CrossRefGoogle Scholar
  18. 18.
    Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research 12, 2825–2830 (2011)MATHMathSciNetGoogle Scholar
  19. 19.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: ACM SIGKDD, Edmonton, pp. 269–278 (2002)Google Scholar
  20. 20.
    Settles, B.: Active learning literature survey, vol. 52, pp. 55–66. University of Wisconsin, Madison (2010)Google Scholar
  21. 21.
    Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: ACM SIGKDD, Las Vegas, pp. 614–622 (2008)Google Scholar
  22. 22.
    Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: ACM SIGKDD, Edmonton, pp. 350–359 (2002)Google Scholar
  23. 23.
    Wu, W., Liu, Y., Guo, M., Wang, C., Liu, X.: A probabilistic model of active learning with multiple noisy oracles. Neurocomputing 118, 253–262 (2013)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Research School of Computer ScienceThe Australian National UniversityCanberraAustralia

Personalised recommendations