Journal of Intelligent Information Systems

, Volume 48, Issue 3, pp 519–551 | Cite as

ScLink: supervised instance matching system for heterogeneous repositories

Article

Abstract

Instance matching is the finding of co-referent instances that describe the same real-world object across two different repositories. For this problem, the heterogeneity, also known as the differences of objects’ attributes and repositories’ schema, is a challenging issue. It creates the limitations in the accuracy of existing solutions. In order to match the instances of heterogeneous repositories, a matching system can follow a configuration that specifies the equivalent properties, suitable similarity metrics, and other important parameters. This configuration can be created manually or automatically by learning methods. We present ScLink, an instance matching system that can generate a configuration automatically. In ScLink, we install two novel supervised learning algorithms, cLearn and minBlock. cLearn applies an apriori-like heuristic for finding the optimal combination of matching properties and similarity metrics. minBlock finds a blocking model, which aims at optimally reducing the pairwise alignments of instances between input repositories. In addition, ScLink introduces other techniques to take into account the scalability issue on large repositories. Experimental results on standard and very large datasets find that minBlock and cLearn are very effective and efficient. cLearn is also significantly better than existing configuration learning algorithms. It drastically boosts the accuracy of ScLink and makes the system outperform the state-of-the-arts, even when being trained using a small amount of labeled data.

Keywords

Instance matching Blocking Schema-independent Supervised Configuration 

References

  1. Agrawal, R., Srikant, R., et al. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th international conference on very large data bases, (Vol. 1215 pp. 487–499).Google Scholar
  2. Altowim, Y., Kalashnikov, D.V., & Mehrotra, S. (2014). Progressive approach to relational entity resolution. Proceedings of the VLDB Endowment, 7, 999–1010.CrossRefGoogle Scholar
  3. Araujo, S., De Vries, A., & Schwabe, D. (2011). SERIMI Results for OAEI 2011. In Proceedings of the 6th workshop on ontology matching (pp. 212–219).Google Scholar
  4. Araujo, S., Tran, D.T., de Vries, A., & Schwabe, D. (2015). SERIMI: Class-Based matching for instance matching across heterogeneous datasets. IEEE Transactions on Knowledge and Data Engineering, 27(5), 1397–1440.CrossRefGoogle Scholar
  5. Bhattacharya, I., & Getoor, L. (2004). Iterative record linkage for cleaning and integration. In Proceedings of the 9th SIGMOD workshop on research numbers in data mining and knowledge discovery (pp. 11–18): ACM.Google Scholar
  6. Bhattacharya, I., & Getoor, L. (2006). A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 6th SIAM international conference on data mining (pp. 47–58): SIAM.Google Scholar
  7. Bilenko, M., & Mooney, R.J. (2003). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the SIGKDD conference on knowledge discovery and data mining (pp. 39–48): ACM.Google Scholar
  8. Bilenko, M., Kamath, B., & Mooney, R.J. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th international conference on data mining (pp. 87–96).Google Scholar
  9. Christen, P. (2008a). Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the 14th SIGKDD international conference on knowledge discovery and data mining (pp. 151–159): ACM.Google Scholar
  10. Christen, P. (2008b). Automatic training example selection for scalable unsupervised record linkage. In Proceedings of the 12th pacific-asia conference on advances in knowledge discovery and data mining (pp. 511–518): Springer.Google Scholar
  11. Christen, P. (2008c). Febrl: a freely available record linkage system with a graphical user interface. In Proceedings of the 2nd australasian workshop on health data and knowledge management, (Vol. 80 pp. 17–25).Google Scholar
  12. Christen, P., & Gayler, R.W. (2013). Adaptive temporal entity resolution on dynamic databases. In Proceedings of the 17th pacific-asia conference on advances in knowledge discovery and data mining (pp. 558–569): Springer.Google Scholar
  13. Cruz, I.F., Antonelli, F.P., & Stroe, C. (2009). AgreementMaker: Efficient matching for large real-world schemas and ontologies. In Proceedings of the VLDB endowment, (Vol. 2 pp. 1586–1589).Google Scholar
  14. Cruz, I.F., Stroe, C., Caimi, F., Fabiani, A., Pesquita, C., Couto, F.M., & Palmonari, M. (2011). Using agreementMaker to align ontologies for OAEI 2011. In Proceedings of the 6th workshop on ontology matching (pp. 114–121).Google Scholar
  15. Dalvi, N., Rastogi, V., Dasgupta, A., Das Sarma, A., & Sarlós, T. (2013). Optimal hashing schemes for entity matching. In Proceedings of the 22nd international conference on world wide web (pp. 295–306).Google Scholar
  16. Demartini, G., Difallah, D.E., & Cudré-Mauroux, P. (2013). Large-scale linked data integration using probabilistic reasoning and crowdsourcing. The VLDB Journal, 22(5), 665–687.CrossRefGoogle Scholar
  17. Dong, X., Halevy, A., & Madhavan, J. (2005). Reference reconciliation in complex information spaces. In Proceedings of the 24th SIGMOD international conference on management of data (pp. 85–96): ACM.Google Scholar
  18. Euzenat, J., Ferrara, A., van Hague, W.R., Hollink, L., Meilicke, C., Nikolov, A., Scharffe, F., Shvaiko, P., Stuckenschmidt, H., Sváb-Zamazal, O., & dos Santos, C.T. (2011). Final results of the ontology alignment evaluation initiative 2011. In Proceedings of the 6th workshop on ontology matching (pp. 85–113).Google Scholar
  19. Ferrara, A., Nikolov, A., & Scharffe, F. (2011). Data linking for the semantic web. Semantic Web and Information System, 7(3), 46–76.CrossRefGoogle Scholar
  20. Gale, D., & Shapley, L.S. (1962). College admissions and the stability of marriage. American Mathematical Monthly, 96(1), 9–15.MathSciNetCrossRefMATHGoogle Scholar
  21. Hall, R., Sutton, C., & McCallum, A. (2008). Unsupervised deduplication using cross-field dependencies. In Proceedings of the 14th SIGKDD conference on knowledge discovery and data mining (pp. 310–317): ACM.Google Scholar
  22. Hernández, M.A., & Stolfo, S.J. (1995). The merge/purge problem for large databases. ACM SIGMOD Record, 24, 127–138.CrossRefGoogle Scholar
  23. Hogan, A., Zimmermann, A., Umbrich, J., Polleres, A., & Decker, S. (2012). Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora. Web Semantics: Science, Services and Agents on the World Wide Web, 10, 76–110.CrossRefGoogle Scholar
  24. Hu, W., Chen, J., & Qu, Y. (2011). A self-training approach for resolving object coreference on the semantic web. In Proceedings of the 20th international conference on world wide web (pp. 87–96).Google Scholar
  25. Hu, W., Yang, R., & Qu, Y. (2014). Automatically generating data linkages using class-based discriminative properties. Data & Knowledge Engineering, 91, 34–51.CrossRefGoogle Scholar
  26. Isele, R., & Bizer, C. (2012). Learning expressive linkage rules using genetic programming. The VLDB Journal, 5(11), 1638–1649.Google Scholar
  27. Isele, R., & Bizer, C. (2013). Active learning of expressive linkage rules using genetic programming. Web Semantics: Science, Services and Agents on the World Wide Web, 23, 2–15.CrossRefGoogle Scholar
  28. Isele, R., Jentzsch, A., & Bizer, C. (2011). Efficient multidimensional blocking for link discovery without losing recall. In Proceedings of the 14th SIGMOD workshop on the web and databases.Google Scholar
  29. Kejriwal, M., & Miranker, D.P. (2013). An unsupervised algorithm for learning blocking schemes. In Proceedings of the 13th international conference on data mining (pp. 340–349): IEEE.Google Scholar
  30. Kejriwal, M., & Miranker, D.P. (2015). Semi-supervised instance matching using boosted classifiers. In Proceedings of the 12th extended semantic web conference. LNCS, (Vol. 9088 pp. 388–402): Springer.Google Scholar
  31. Kirsten, T., Kolb, L., Hartung, M., Groß, A., Köpcke, H., & Rahm, E. (2010). Data partitioning for parallel entity matching. Proceedings of the VLDB Endowment, 3.Google Scholar
  32. Köpcke, H., & Rahm, E. (2010). Frameworks for entity matching: a comparison. Data & Knowledge Engineering, 69(2), 197–210.CrossRefGoogle Scholar
  33. Köpcke, H., Thor, A., & Rahm, E. (2010). Evaluation of entity resolution approaches on real-world match problems. In Proceedings of the VLDB endowment, (Vol. 3 pp. 484–493): VLDB Endowment.Google Scholar
  34. Koudas, N., Sarawagi, S., & Srivastava, D. (2006). Record linkage: similarity measures and algorithms. In Proceedings of the 25th SIGMOD international conference on management of data (pp. 802–803): ACM.Google Scholar
  35. Levenshtein, V.I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, (Vol. 10 pp. 707–710).Google Scholar
  36. Li, J., Tang, J., Li, Y., & Luo, Q. (2009). RiMOM: a dynamic multistrategy ontology alignment framework. IEEE Transactions on Knowledge and Data Engineering, 21(8), 1218–1232.CrossRefGoogle Scholar
  37. Li, W. S., & Clifton, C. (2000). SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowledge and Engineering, 33, 49–84.CrossRefMATHGoogle Scholar
  38. Locoro, A., David, J., & Euzenat, J. (2014). Context-based matching: design of a flexible framework and experiment. Journal on Data Semantics, 3(1), 25–46.CrossRefGoogle Scholar
  39. McCallum, A., Nigam, K., & Ungar, L.H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the 6th SIGKDD conference on knowledge discovery and data mining (pp. 169–178): ACM.Google Scholar
  40. Mendes, P.N., & Jakob, M. (2011). García-silva, A., Bizer, C.: Dbpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th international conference on semantic systems (pp. 1–8): ACM.Google Scholar
  41. Mishra, S., Gandhi, T., Arora, A., & Bhattacharya, A. (2013). Efficient edit distance based string similarity search using deletion neighborhoods. In Proceedings of the 16th joint EDBT/ICDT workshops on string similarity (pp. 375–383): ACM.Google Scholar
  42. Ngomo, A.C.N., & Auer, S. (2011). LIMES: A time-efficient approach for large-scale link discovery on the web of data. In Proceedings of the 22nd international joint conference on artificial intelligence (pp. 2312–2317).Google Scholar
  43. Ngomo, A.C.N., & Lyko, K. (2012). EAGLE: Efficient Active learning of link specifications using genetic programming. In Proceedings of the 9th extended semantic web conference. LNCS, (Vol. 7295 pp. 149–163): Springer.Google Scholar
  44. Ngomo, A.C.N., & Lyko, K. (2013). Unsupervised learning of link specifications: Deterministic vs. non-deterministic. In Proceedings of the 8th workshop on ontology matching (pp. 25–36).Google Scholar
  45. Ngomo, A.C.N., Lehmann, J., Auer, S., & Höffner, K. (2011). RAVEN - active learning of link specifications. In Proceedings of the 6th workshop on ontology matching (pp. 25–36).Google Scholar
  46. Nguyen, K., & Ichise, R. (2015a). Heuristic-based configuration learning for linked data instance matching. In Proceedings of the 5th joint international semantic technology conference. LNCS, (Vol. 9544 pp. 56–72): Springer.Google Scholar
  47. Nguyen, K., & Ichise, R. (2015b). ScSLINT: Time and memory efficient interlinking framework for linked data. In Proceedings of the 14th international semantic web conference posters and demonstrations track.Google Scholar
  48. Nguyen, K., Ichise, R., & Le, B. (2012a). Interlinking linked data sources using a domain-independent system. In Proceedings of the 2nd joint international semantic technology. LNCS, (Vol. 7774 pp. 113–128): Springer.Google Scholar
  49. Nguyen, K., Ichise, R., & Le, H.B. (2012b). Learning approach for domain-independent linked data instance matching. In Proceedings of the SIGKDD 2nd workshop on mining data semantics (pp. 7–15): ACM.Google Scholar
  50. Nikolov, A., d’Aquin, M., & Motta, E. (2012). Unsupervised learning of link discovery configuration. In Proceedings of the 9th extended semantic web conference. LNCS, (Vol. 7295 pp. 119–133): Springer.Google Scholar
  51. Niu, X., Rong, S., Zhang, Y., & Wang, H. (2011). Zhishi.links results for OAEI 2011. In Proceedings of the 6th workshop on ontology matching (pp. 220–227).Google Scholar
  52. Papadakis, G., Ioannou, E., Niederée, C., & Fankhauser, P. (2011). Efficient entity resolution for large heterogeneous information spaces. In Proceedings of the 4th international conference on web search and data mining (pp. 535–544): ACM.Google Scholar
  53. Papadakis, G., Ioannou, E., Palpanas, T., Niederée, C., & Nejdl, W. (2013). A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Transactions on Knowledge and Data Engineering, 25(12), 2665–2682.CrossRefGoogle Scholar
  54. Papadakis, G., Papastefanatos, G., & Koutrika, G. (2014). Supervised meta-blocking. In Proceedings of the VLDB endowment, (Vol. 7 pp. 1929–1940): VLDB Endowment.Google Scholar
  55. Papadakis, G., Svirsky, J., Gal, A., & Palpanas, T. (2016). Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment, 9.Google Scholar
  56. Pernelle, N., Saïs, F., & Symeonidou, D. (2013). An automatic key discovery approach for data linking. Web Semantics: Science, Services and Agents on the World Wide Web, 23, 16–30.CrossRefGoogle Scholar
  57. Rahm, E., & Do, H.H. (2000). Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3–13.Google Scholar
  58. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., & Gatford, M. (1994). Okapi at TREC-3. In Proceedings of the 3rd text retrieval conference (pp. 109–123).Google Scholar
  59. Rong, S., Niu, X., Xiang, W.E., Wang, H., Yang, Q., & Yu, Y. (2012). A machine learning approach for instance matching based on similarity metrics. In Proceedings of the 11th international semantic web conference. LNCS, (Vol. 7649 pp. 460–475): Springer.Google Scholar
  60. Sarawagi, S., & Bhamidipaty, A. (2002). Interactive deduplication using active learning. In Proceedings of the 8th SIGKDD conference on knowledge discovery and data mining (pp. 269–278). New York, USA: ACM.Google Scholar
  61. Sheila, T., Knoblock, C., & Minton, S. (2002). Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of the 8th SIGKDD conference on knowledge discovery and data mining (pp. 350–359): ACM.Google Scholar
  62. Song, D., & Heflin, J. (2011). Automatically generating data linkages using a domain-independent candidate selection approach. In Proceedings of the 10th international semantic web conference. LNCS, (Vol. 7031 pp. 649–664): Springer.Google Scholar
  63. Soru, T., & Ngomo, A.C.N. (2013). Rapid execution of weighted edit distances. In Proceedings of the 8th workshop on ontology matching (pp. 1–12).Google Scholar
  64. Soru, T., & Ngomo, A.C.N. (2014). A comparison of supervised learning classifiers for link discovery. In Proceedings of the 10th international conference on semantic systems (pp. 41–44): ACM.Google Scholar
  65. Suchanek, F.M., Abiteboul, S., & Senellart, P. (2011). PARIS: probabilistic alignment of relations, instances, and schema. The VLDB Journal, 5(3), 157–168.Google Scholar
  66. Thor, A., & Rahm, E. (2007). MOMA-a mapping-based object matching system. In Proceedings of the 3rd biennial conference on innovative data systems research (pp. 247–258).Google Scholar
  67. Urbani, J., Kotoulas, S., Maassen, J., Van Harmelen, F., & Bal, H. (2010). OWL Reasoning with webpie: calculating the closure of 100 billion triples. In Proceedings of the 7th european semantic web conference. LNCS, (Vol. 5554 pp. 213–227): Springer.Google Scholar
  68. Vesdapunt, N., Bellare, K., & Dalvi, N. (2014). Crowdsourcing algorithms for entity resolution. In Proceedings of the VLDB endowment, (Vol. 7 pp. 1071–1082): VLDB Endowment.Google Scholar
  69. Volz, J., Bizer, C., Gaedke, M., & Kobilarov, G. (2009). Discovering and maintaining links on the web of data. In Proceedings of the 8th international semantic web conference. LNCS, (Vol. 5823 pp. 650–665): Springer.Google Scholar
  70. Whang, S.E., & Garcia-Molina, H. (2014). Incremental entity resolution on rules and data. The VLDB Journal, 23, 77–102.CrossRefGoogle Scholar
  71. Winkler, W.E. (2006). Overview of record linkage and current research directions. Tech. rep., Bureau of the Cencus.Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.National Institute of InformaticsTokyoJapan
  2. 2.SOKENDAI (The Graduate University for Advanced Studies)HayamaJapan

Personalised recommendations