A Machine Learning Approach for Instance Matching Based on Similarity Metrics

  • Shu Rong
  • Xing Niu
  • Evan Wei Xiang
  • Haofen Wang
  • Qiang Yang
  • Yong Yu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7649)


The Linking Open Data (LOD) project is an ongoing effort to construct a global data space, i.e. the Web of Data. One important part of this project is to establish owl:sameAs links among structured data sources. Such links indicate equivalent instances that refer to the same real-world object. The problem of discovering owl:sameAs links between pairwise data sources is called instance matching. Most of the existing approaches addressing this problem rely on the quality of prior schema matching, which is not always good enough in the LOD scenario. In this paper, we propose a schema-independent instance-pair similarity metric based on several general descriptive features. We transform the instance matching problem to the binary classification problem and solve it by machine learning algorithms. Furthermore, we employ some transfer learning methods to utilize the existing owl:sameAs links in LOD to reduce the demand for labeled data. We carry out experiments on some datasets of OAEI2010. The results show that our method performs well on real-world LOD data and outperforms the participants of OAEI2010.


Linking Open Data instance matching similarity matric machine learning transfer learning 


  1. 1.
    Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 586–597. VLDB Endowment (2002)Google Scholar
  2. 2.
    Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Machine Learning 56(1), 89–113 (2004)zbMATHCrossRefGoogle Scholar
  3. 3.
    Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intelligent Systems 18(5), 16–23 (2003)CrossRefGoogle Scholar
  4. 4.
    Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. International Journal on Semantic Web and Information Systems (IJSWIS) 5(3), 1–22 (2009)CrossRefGoogle Scholar
  5. 5.
    Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)zbMATHCrossRefGoogle Scholar
  6. 6.
    Cochinwala, M., Kurien, V., Lalk, G., Shasha, D.: Efficient data reconciliation. Information Sciences 137(1), 1–15 (2001)zbMATHCrossRefGoogle Scholar
  7. 7.
    Cohen, W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. ACM SIGMOD Record 27, 201–212 (1998)CrossRefGoogle Scholar
  8. 8.
    Dai, W., Yang, Q., Xue, G., Yu, Y.: Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning, pp. 193–200. ACM (2007)Google Scholar
  9. 9.
    Eaton, E., desJardins, M., et al.: Selective transfer between learning tasks using task-based boosting. In: Twenty-Fifth AAAI Conference on Artificial Intelligence (2011)Google Scholar
  10. 10.
    Ell, B., Vrandečić, D., Simperl, E.: Labels in the Web of Data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 162–176. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)CrossRefGoogle Scholar
  12. 12.
    Freund, Y., Schapire, R.: A Desicion-Theoretic Generalization of On-line Learning and an Application to Boosting. In: Vitányi, P.M.B. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995)CrossRefGoogle Scholar
  13. 13.
    Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match operations. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, pp. 636–647. VLDB Endowment (2004)Google Scholar
  14. 14.
    Hogan, A., Harth, A., Decker, S.: Performing object consolidation on the semantic web data graph (2007)Google Scholar
  15. 15.
    Hogan, A., Polleres, A., Umbrich, J., Zimmermann, A.: Some entities are more equal than others: statistical methods to consolidate linked data. In: 4th International Workshop on New Forms of Reasoning for the Semantic Web: Scalable and Dynamic, NeFoRS 2010 (2010)Google Scholar
  16. 16.
    Hu, W., Chen, J., Cheng, G., Qu, Y.: Objectcoref & falcon-ao: results for oaei 2010. In: Ontology Matching, p. 158 (2010)Google Scholar
  17. 17.
    Hu, W., Chen, J., Qu, Y.: A self-training approach for resolving object coreference on the semantic web. In: Proceedings of the 20th International Conference on World Wide Web, pp. 87–96. ACM (2011)Google Scholar
  18. 18.
    Isele, R., Jentzsch, A., Bizer, C.: Efficient multidimensional blocking for link discovery without losing recall (2011)Google Scholar
  19. 19.
    Li, J., Tang, J., Li, Y., Luo, Q.: Rimom: A dynamic multistrategy ontology alignment framework. IEEE Transactions on Knowledge and Data Engineering 21(8), 1218–1232 (2009)CrossRefGoogle Scholar
  20. 20.
    Loh, W.: Classification and regression tree methods. In: Encyclopedia of Statistics in Quality and Reliability (2008)Google Scholar
  21. 21.
    Ngomo, A., Auer, S.: Limes-a time-efficient approach for large-scale link discovery on the web of data. In: Proceedings of IJCAI (2011)Google Scholar
  22. 22.
    Ngomo, A., Lehmann, J., Auer, S., Höffner, K.: Raven–active learning of link specifications. In: Ontology Matching, p. 25 (2011)Google Scholar
  23. 23.
    Niu, X., Rong, S., Zhang, Y., Wang, H.: Zhishi.links results for oaei 2011. In: Ontology Matching, p. 220 (2011)Google Scholar
  24. 24.
    Pan, S., Yang, Q.: A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22(10), 1345–1359 (2010)CrossRefGoogle Scholar
  25. 25.
    Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: Proceedings of NIPS 2002 (2002)Google Scholar
  26. 26.
    Rahm, E., Bernstein, P.: A survey of approaches to automatic schema matching. The VLDB Journal 10(4), 334–350 (2001)zbMATHCrossRefGoogle Scholar
  27. 27.
    Raimond, Y., Sutton, C., Sandler, M.: Automatic interlinking of music datasets on the semantic web. In: Proceedings of the 1st Workshop about Linked Data on the Web. Citeseer (2008)Google Scholar
  28. 28.
    Sleeman, J., Finin, T.: Computing foaf co-reference relations with rules and machine learning. In: Proceedings of the Third International Workshop on Social Data on the Web (2010)Google Scholar
  29. 29.
    Sleeman, J., Finin, T.: A machine learning approach to linking foaf instances. In: Spring Symposium on Linked Data Meets AI. AAAI (January 2010)Google Scholar
  30. 30.
    Song, D., Heflin, J.: Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 649–664. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  31. 31.
    Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and Maintaining Links on the Web of Data. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 650–665. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  32. 32.
    Wang, Z., Zhang, X., Hou, L., Zhao, Y., Li, J., Qi, Y., Tang, J.: Rimom results for oaei 2010. In: Ontology Matching, p. 195 (2010)Google Scholar
  33. 33.
    Winkler, W.: The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau. Citeseer (1999)Google Scholar
  34. 34.
    Winkler, W.: Overview of record linkage and current research directions. In: Bureau of the Census. Citeseer (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Shu Rong
    • 1
  • Xing Niu
    • 1
  • Evan Wei Xiang
    • 2
  • Haofen Wang
    • 1
  • Qiang Yang
    • 2
  • Yong Yu
    • 1
  1. 1.APEX Data & Knowledge Management LabShanghai Jiao Tong UniversityChina
  2. 2.Department of Computer Science and EngineeringHong Kong University of Science and TechnologyHong Kong

Personalised recommendations