Multiple Instance Learning for Group Record Linkage

  • Zhichun Fu
  • Jun Zhou
  • Peter Christen
  • Mac Boot
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7301)


Record linkage is the process of identifying records that refer to the same entities from different data sources. While most research efforts are concerned with linking individual records, new approaches have recently been proposed to link groups of records across databases. Group record linkage aims to determine if two groups of records in two databases refer to the same entity or not. One application where group record linkage is of high importance is the linking of census data that contain household information across time. In this paper we propose a novel method to group record linkage based on multiple instance learning. Our method treats group links as bags and individual record links as instances. We extend multiple instance learning from bag to instance classification to reconstruct bags from candidate instances. The classified bag and instance samples lead to a significant reduction in multiple group links, thereby improving the overall quality of linked data. We evaluate our method with both synthetic data and real historical census data.


Multiple instance learning record linkage entity resolution instance classification historical census data 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: NIPS, Vancouver, Canada, pp. 561–568 (2003)Google Scholar
  2. 2.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: ACM KDD, Washington, DC, pp. 39–48 (2003)Google Scholar
  3. 3.
    Bloothooft, G.: Multi-source family reconstruction. History and Computing 7(2), 90–103 (1995)CrossRefGoogle Scholar
  4. 4.
    Chen, Y., Bi, J., Wang, J.: MILES: Multiple-instance learning via embedded instance selection. IEEE TPAMI 28(12), 1931–1947 (2006)CrossRefGoogle Scholar
  5. 5.
    Chen, Y., Wang, J.Z.: Image categorization by learning and reasoning with regions. Journal of Machine Learning Research 5 (2004)Google Scholar
  6. 6.
    Christen, P.: Automatic Training Example Selection for Scalable Unsupervised Record Linkage. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 511–518. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  7. 7.
    Christen, P.: Development and user experiences of an open source data cleaning, deduplication and record linkage system. ACM SIGKDD Explorations 11(1), 39–48 (2009)CrossRefGoogle Scholar
  8. 8.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE (2011)Google Scholar
  9. 9.
    Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence 89, 31–71 (1997)MATHCrossRefGoogle Scholar
  10. 10.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)CrossRefGoogle Scholar
  11. 11.
    Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: LIBLINEAR: A library for large linear classification. JMLR 9, 1871–1874 (2008)MATHGoogle Scholar
  12. 12.
    Fu, Z., Christen, P., Boot, M.: Automatic cleaning and linking of historical census data using household information. In: IEEE ICDM Workshop on DDDM (2011)Google Scholar
  13. 13.
    Fu, Z., Christen, P., Boot, M.: A supervised learning and group linking method for historical census household linkage. In: AusDM (2011)Google Scholar
  14. 14.
    Fu, Z., Robles-Kelly, A., Zhou, J.: MILIS: Multiple instance learning with instance selection. IEEE TPAMI 33(5), 958–977 (2011)CrossRefGoogle Scholar
  15. 15.
    Fure, E.: Interactive record linkage: The cumulative construction of life courses. Demographic Research 3, 11 (2000)CrossRefGoogle Scholar
  16. 16.
    Li, F., Sminchisescu, C.: Convex multiple instance learning by estimating likelihood ratio. In: NIPS (2010)Google Scholar
  17. 17.
    On, B.-W., Elmaciogl, E., Lee, D., Kang, J., Pei, J.: Improving grouped-entity resolution using quasi-cliques. In: IEEE ICDM, Hong Kong, pp. 1008–1015 (2006)Google Scholar
  18. 18.
    On, B.-W., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: IEEE ICDE, Istanbul, Turkey, pp. 496–505 (2007)Google Scholar
  19. 19.
    Quass, D., Starkey, P.: Record linkage for genealogical databases. In: ACM KDD Workshop, Washington, DC, pp. 40–42 (2003)Google Scholar
  20. 20.
    Reid, A., Davies, R., Garrett, E.: Nineteenth century Scottish demography from linked censuses and civil registers: a ‘sets of related individuals’ approach. History and Computing 14(1+2), 61–86 (2006)Google Scholar
  21. 21.
    Ruggles, S.: Linking historical censuses: a new approach. History and Computing 14(1+2), 213–224 (2006)Google Scholar
  22. 22.
    Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Addison-Wesley (2005)Google Scholar
  23. 23.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer (1995)Google Scholar
  24. 24.
    Xiao, Y., Liu, B., Cao, L., Yin, J., Wu, X.: SMILE: A similarity-based approach for multiple instance learning. In: IEEE ICDM, Sydney, pp. 309–313 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Zhichun Fu
    • 1
  • Jun Zhou
    • 1
  • Peter Christen
    • 1
  • Mac Boot
    • 2
  1. 1.Research School of Computer Science, College of Engineering and Computer ScienceThe Australian National UniversityCanberraAustralia
  2. 2.Australian Demographic and Social Research Institute, College of Arts and Social SciencesThe Australian National UniversityCanberraAustralia

Personalised recommendations