Advertisement

Dragon: Decision Tree Learning for Link Discovery

Conference paper
  • 1.1k Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11496)

Abstract

The provision of links across RDF knowledge bases is regarded as fundamental to ensure that knowledge bases can be used joined to address real-world needs of applications. The growth of knowledge bases both with respect to their number and size demands the development of time-efficient and accurate approaches for the computation of such links. This is generally done with the aid of machine learning approaches, such as e.g. Decision Trees. While Decision Trees are known to be fast, they are generally outperformed in the link discovery task by the state-of-the-art in terms of quality, i.e. F-measure. In this work, we present Dragon, a fast decision-tree-based approach that is both efficient and accurate. Our approach was evaluated by comparing it with state-of-the-art link discovery approaches as well as the common decision-tree-learning approach J48. Our results suggest that our approach achieves state-of-the-art performance with respect to its F-measure while being 18 times faster on average than existing algorithms for link discovery on RDF knowledge bases. Furthermore, we investigate why Dragon significantly outperforms J48 in terms of link accuracy. We provide an open-source implementation of our algorithm in the LIMES framework.

Keywords

Link discovery Decision trees Machine learning Entity resolution Semantic web 

Notes

Acknowledgements

This work has been supported by the BMVI projects LIMBO (project no. 19F2029C) and OPAL (project no. 19F2028A).

References

  1. 1.
    Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: 16th International Conference on World Wide Web. WWW 2007, pp. 131–140. ACM, New York (2007).  https://doi.org/10.1145/1242572.1242591
  2. 2.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: International Conference on Knowledge Discovery and Data Mining. KDD 2003, pp. 39–48. ACM, New York (2003).  https://doi.org/10.1145/956750.956759
  3. 3.
    Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Chapman & Hall, New York (1984)zbMATHGoogle Scholar
  4. 4.
    Christen, P.: Febrl: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: International Conference on Knowledge Discovery and Data Mining. KDD 2008, pp. 1065–1068. ACM, New York (2008).  https://doi.org/10.1145/1401890.1402020
  5. 5.
    Cochinwala, M., Kurien, V., Lalk, G., Shasha, D.E.: Efficient data reconciliation. Inf. Sci. 137(1–4), 1–15 (2001).  https://doi.org/10.1016/S0020-0255(00)00070-0CrossRefzbMATHGoogle Scholar
  6. 6.
    Daskalaki, E., Flouris, G., Fundulaki, I., Saveta, T.: Instance matching benchmarks in the era of linked data. J. Web Sem. 39, 1–14 (2016).  https://doi.org/10.1016/j.websem.2016.06.002CrossRefGoogle Scholar
  7. 7.
    Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006). http://www.jmlr.org/papers/v7/demsar06a.htmlMathSciNetzbMATHGoogle Scholar
  8. 8.
    Elfeky, M.G., Elmagarmid, A.K., Verykios, V.S.: Tailor: a record linkage tool box. In: International Conference on Data Engineering, pp. 17–28 (2002). http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=994694
  9. 9.
    Ermilov, I., Lehmann, J., Martin, M., Auer, S.: LODStats: the data web census dataset. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 38–46. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46547-0_5CrossRefGoogle Scholar
  10. 10.
    Holmes, G., Donkin, A., Witten, I.: Weka: a machine learning workbench. In: Proceedings of the Second Australia and New Zealand Conference on Intelligent Information Systems, pp. 357–361. Brisbane, Australia (1994). http://www.cs.waikato.ac.nz/~ml/publications/1994/Holmes-ANZIIS-WEKA.pdf
  11. 11.
    Isele, R., Bizer, C.: Learning expressive linkage rules using genetic programming. Proc. VLDB Endow. 5(11), 1638–1649 (2012).  https://doi.org/10.14778/2350229.2350276CrossRefGoogle Scholar
  12. 12.
    Isele, R., Jentzsch, A., Bizer, C.: Efficient multidimensional blocking for link discovery without losing recall. In: 14th International Workshop on the Web and Databases (2011). http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/IseleJentzschBizer-WebDB2011.pdf
  13. 13.
    Kejriwal, M., Miranker, D.P.: Semi-supervised instance matching using boosted classifiers. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 388–402. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-18818-8_24CrossRefGoogle Scholar
  14. 14.
    Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1), 484–493 (2010).  https://doi.org/10.14778/1920841.1920904CrossRefGoogle Scholar
  15. 15.
    Nentwig, M., Hartung, M., Ngomo, A.N., Rahm, E.: A survey of current link discovery frameworks. Semant. Web 8(3), 419–436 (2017).  https://doi.org/10.3233/SW-150210CrossRefGoogle Scholar
  16. 16.
    Ngomo, A.N., Lehmann, J., Auer, S., Höffner, K.: RAVEN - active learning of link specifications. In: Ontology Matching Workshop, pp. 25–36 (2011). http://ceur-ws.org/Vol-814/om2011_Tpaper3.pdf
  17. 17.
    Ngonga Ngomo, A.C.: On link discovery using a hybrid approach. J. Data Semant. 1, 203–217 (2012).  https://doi.org/10.1007/s13740-012-0012-yCrossRefGoogle Scholar
  18. 18.
    Ngonga Ngomo, A.C., Auer, S.: Limes: a time-efficient approach for large-scale link discovery on the web of data. In: ICJAI. IJCAI 2011, pp. 2312–2317. AAAI Press (2011).  https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-385
  19. 19.
    Ngonga Ngomo, A.-C., Lyko, K.: EAGLE: efficient active learning of link specifications using genetic programming. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 149–163. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-30284-8_17CrossRefGoogle Scholar
  20. 20.
    Ngonga Ngomo, A.C., Lyko, K.: Unsupervised learning of link specifications: deterministic vs. non-deterministic. In: Ontology Matching Workshop, pp. 25–36 (2013). http://ceur-ws.org/Vol-1111/om2013_Tpaper3.pdf
  21. 21.
    Ngomo, A.-C.N., Lyko, K., Christen, V.: COALA – correlation-aware active learning of link specifications. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 442–456. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-38288-8_30CrossRefGoogle Scholar
  22. 22.
    Nikolov, A., d’Aquin, M., Motta, E.: Unsupervised learning of link discovery configuration. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 119–133. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-30284-8_15CrossRefGoogle Scholar
  23. 23.
    Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986).  https://doi.org/10.1023/A:1022643204877CrossRefGoogle Scholar
  24. 24.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)Google Scholar
  25. 25.
    Sherif, M.A., Ngonga Ngomo, A.C., Lehmann, J.: Wombat – a generalization approach for automatic link discovery. In: Blomqvist, E., Maynard, D., Gangemi, A., Hoekstra, R., Hitzler, P., Hartig, O. (eds.) ESWC 2017. LNCS, vol. 10249, pp. 103–119. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-58068-5_7CrossRefGoogle Scholar
  26. 26.
    Soru, T., Ngonga Ngomo, A.C.: A comparison of supervised learning classifiers for link discovery. In: 10th International Conference on Semantic Systems, pp. 41–44. ACM (2014).  https://doi.org/10.1145/2660517.2660532
  27. 27.
    Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. 26 (2001).  https://doi.org/10.1016/S0306-4379(01)00042-4
  28. 28.
    Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Silk - A link discovery framework for the web of data. In: Workshop on Linked Data on the Web, LDOW (2009). http://ceur-ws.org/Vol-538/ldow2009_paper13.pdf
  29. 29.
    Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1(1), 933–944 (2008).  https://doi.org/10.14778/1453856.1453957MathSciNetCrossRefGoogle Scholar
  30. 30.
    Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15 (2011).  https://doi.org/10.1145/2000824.2000825CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of LeipzigLeipzigGermany
  2. 2.University of PaderbornPaderbornGermany

Personalised recommendations