A Bag Reconstruction Method for Multiple Instance Classification and Group Record Linkage

  • Zhichun Fu
  • Jun Zhou
  • Furong Peng
  • Peter Christen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7713)


Record linking is the task of detecting records in several databases that refer to the same entity. This task aims at exploring the relationship between entities, which normally lack common identifiers in heterogeneous datasets. When entities contain multiple relational records, linking them across datasets can be more accurate by treating the records as groups, which leads to group linking methods. Even so, individual record links may still be needed for the final group linking step. This problem can be solved by multiple instance learning, in which group links are modelled as bags, and record links are considered as instances. In this paper, we propose a novel method for instance classification and group record linkage via bag reconstruction from instances. The bag reconstruction is based on the modeling of the distribution of negative instances in the training bags via kernel density estimation. We evaluate this approach on both synthetic and real-world data. Our results show that the proposed method can outperform several baseline methods.


Multiple instance learning bag reconstruction instance classification record linkage group linkage historical census data 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 11–18 (2004)Google Scholar
  2. 2.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, pp. 39–48 (2003)Google Scholar
  3. 3.
    Chartrand, G.: Introductory Graph Theory. Dover Publications (1985)Google Scholar
  4. 4.
    Chen, Y., Bi, J., Wang, J.: MILES: Multiple-instance learning via embedded instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(12), 1931–1947 (2006)CrossRefGoogle Scholar
  5. 5.
    Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 151–159. ACM (2008)Google Scholar
  6. 6.
    Christen, P.: Development and user experiences of an open source data cleaning, deduplication and record linkage system. ACM SIGKDD Explorations 11(1), 39–48 (2009)CrossRefGoogle Scholar
  7. 7.
    Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer (2012)Google Scholar
  8. 8.
    Dunn, H.L.: Record linkage. American Journal of Public Health 36(12), 1412–1416 (1946)CrossRefGoogle Scholar
  9. 9.
    Elfeky, M., Verykios, V., Elmagarmid, A.: Tailor: A record linkage toolbox. In: Proceedings of the 18th International Conference on Data Engineering, pp. 17–28 (2002)Google Scholar
  10. 10.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)zbMATHGoogle Scholar
  11. 11.
    Fu, Z., Christen, P., Boot, M.: Automatic cleaning and linking of historical census data using household information. In: Proceedings of the 15th International Workshop on Domain Driven Data Mining, Vancouver, Canada, pp. 413–420 (2011)Google Scholar
  12. 12.
    Fu, Z., Christen, P., Boot, M.: A supervised learning and group linking method for historical census household linkage. In: Proceedings of the 19th Ninth Australasian Data Mining Conference, Ballarat, Australia (2011)Google Scholar
  13. 13.
    Fu, Z., Zhou, J., Christen, P., Boot, M.: Multiple Instance Learning for Group Record Linkage. In: Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J. (eds.) PAKDD 2012, Part I. LNCS, vol. 7301, pp. 171–182. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  14. 14.
    Fu, Z., Robles-Kelly, A., Zhou, J.: MILIS: Multiple instance learning with instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(5), 958–977 (2011)CrossRefGoogle Scholar
  15. 15.
    Herschel, M., Naumann, F.: Scaling up duplicate detection in graph data. In: Proceedings of the ACM International Conference on Information and Knowledge Management, Napa Valley, California, pp. 1325–1326 (2008)Google Scholar
  16. 16.
    Herzog, T.N., Scheuren, F., Winkler, W.E.: Data quality and record linkage techniques. Springer ( (2007)Google Scholar
  17. 17.
    Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems 31(2), 716–767 (2006)CrossRefGoogle Scholar
  18. 18.
    Namata, G.M., Kok, S., Getoor, L.: Collective graph identification. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 87–95 (2011)Google Scholar
  19. 19.
    Naumann, F., Herschel, M.: An introduction to duplicate detection. Synthesis Lectures on Data Management 2(1), 1–87 (2010)CrossRefGoogle Scholar
  20. 20.
    On, B.W., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: Proceeding of the IEEE International Conference on Data Engineering, Istanbul, Turkey, pp. 496–505 (2007)Google Scholar
  21. 21.
    Rossi, R.A., KcDowell, L.K., Aha, D.W., Neville, J.: Transforming graph representations for statistical relational learning. Journal of Artificial Intelligence Research (2012)Google Scholar
  22. 22.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer (1995)Google Scholar
  23. 23.
    Winkler, W.E.: Methods for record linkage and bayesian networks. Technical report, US Bureau of the Census (2001)Google Scholar
  24. 24.
    Xiang, R., Neville, J., Rogati, M.: Modeling relationship strength in online social networks. In: Proceedings of the 19th International World Wide Web Conference, pp. 981–990 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Zhichun Fu
    • 1
  • Jun Zhou
    • 2
  • Furong Peng
    • 3
  • Peter Christen
    • 1
  1. 1.Research School of Computer ScienceThe Australian National UniversityCanberraAustralia
  2. 2.School of Information and Communication TechnologyGriffith UniversityNathanAustralia
  3. 3.School of Computer Science and TechnologyNanjing University of Science and TechnologyNanjingChina

Personalised recommendations