Dragon: Decision Tree Learning for Link Discovery

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11496)


The provision of links across RDF knowledge bases is regarded as fundamental to ensure that knowledge bases can be used joined to address real-world needs of applications. The growth of knowledge bases both with respect to their number and size demands the development of time-efficient and accurate approaches for the computation of such links. This is generally done with the aid of machine learning approaches, such as e.g. Decision Trees. While Decision Trees are known to be fast, they are generally outperformed in the link discovery task by the state-of-the-art in terms of quality, i.e. F-measure. In this work, we present Dragon, a fast decision-tree-based approach that is both efficient and accurate. Our approach was evaluated by comparing it with state-of-the-art link discovery approaches as well as the common decision-tree-learning approach J48. Our results suggest that our approach achieves state-of-the-art performance with respect to its F-measure while being 18 times faster on average than existing algorithms for link discovery on RDF knowledge bases. Furthermore, we investigate why Dragon significantly outperforms J48 in terms of link accuracy. We provide an open-source implementation of our algorithm in the LIMES framework.


Link discovery Decision trees Machine learning Entity resolution Semantic web 



This work has been supported by the BMVI projects LIMBO (project no. 19F2029C) and OPAL (project no. 19F2028A).


  1. 1.
    Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: 16th International Conference on World Wide Web. WWW 2007, pp. 131–140. ACM, New York (2007).
  2. 2.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: International Conference on Knowledge Discovery and Data Mining. KDD 2003, pp. 39–48. ACM, New York (2003).
  3. 3.
    Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Chapman & Hall, New York (1984)zbMATHGoogle Scholar
  4. 4.
    Christen, P.: Febrl: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: International Conference on Knowledge Discovery and Data Mining. KDD 2008, pp. 1065–1068. ACM, New York (2008).
  5. 5.
    Cochinwala, M., Kurien, V., Lalk, G., Shasha, D.E.: Efficient data reconciliation. Inf. Sci. 137(1–4), 1–15 (2001). Scholar
  6. 6.
    Daskalaki, E., Flouris, G., Fundulaki, I., Saveta, T.: Instance matching benchmarks in the era of linked data. J. Web Sem. 39, 1–14 (2016). Scholar
  7. 7.
    Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006). Scholar
  8. 8.
    Elfeky, M.G., Elmagarmid, A.K., Verykios, V.S.: Tailor: a record linkage tool box. In: International Conference on Data Engineering, pp. 17–28 (2002).
  9. 9.
    Ermilov, I., Lehmann, J., Martin, M., Auer, S.: LODStats: the data web census dataset. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 38–46. Springer, Cham (2016). Scholar
  10. 10.
    Holmes, G., Donkin, A., Witten, I.: Weka: a machine learning workbench. In: Proceedings of the Second Australia and New Zealand Conference on Intelligent Information Systems, pp. 357–361. Brisbane, Australia (1994).
  11. 11.
    Isele, R., Bizer, C.: Learning expressive linkage rules using genetic programming. Proc. VLDB Endow. 5(11), 1638–1649 (2012). Scholar
  12. 12.
    Isele, R., Jentzsch, A., Bizer, C.: Efficient multidimensional blocking for link discovery without losing recall. In: 14th International Workshop on the Web and Databases (2011).
  13. 13.
    Kejriwal, M., Miranker, D.P.: Semi-supervised instance matching using boosted classifiers. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 388–402. Springer, Cham (2015). Scholar
  14. 14.
    Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1), 484–493 (2010). Scholar
  15. 15.
    Nentwig, M., Hartung, M., Ngomo, A.N., Rahm, E.: A survey of current link discovery frameworks. Semant. Web 8(3), 419–436 (2017). Scholar
  16. 16.
    Ngomo, A.N., Lehmann, J., Auer, S., Höffner, K.: RAVEN - active learning of link specifications. In: Ontology Matching Workshop, pp. 25–36 (2011).
  17. 17.
    Ngonga Ngomo, A.C.: On link discovery using a hybrid approach. J. Data Semant. 1, 203–217 (2012). Scholar
  18. 18.
    Ngonga Ngomo, A.C., Auer, S.: Limes: a time-efficient approach for large-scale link discovery on the web of data. In: ICJAI. IJCAI 2011, pp. 2312–2317. AAAI Press (2011).
  19. 19.
    Ngonga Ngomo, A.-C., Lyko, K.: EAGLE: efficient active learning of link specifications using genetic programming. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 149–163. Springer, Heidelberg (2012). Scholar
  20. 20.
    Ngonga Ngomo, A.C., Lyko, K.: Unsupervised learning of link specifications: deterministic vs. non-deterministic. In: Ontology Matching Workshop, pp. 25–36 (2013).
  21. 21.
    Ngomo, A.-C.N., Lyko, K., Christen, V.: COALA – correlation-aware active learning of link specifications. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 442–456. Springer, Heidelberg (2013). Scholar
  22. 22.
    Nikolov, A., d’Aquin, M., Motta, E.: Unsupervised learning of link discovery configuration. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 119–133. Springer, Heidelberg (2012). Scholar
  23. 23.
    Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). Scholar
  24. 24.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)Google Scholar
  25. 25.
    Sherif, M.A., Ngonga Ngomo, A.C., Lehmann, J.: Wombat – a generalization approach for automatic link discovery. In: Blomqvist, E., Maynard, D., Gangemi, A., Hoekstra, R., Hitzler, P., Hartig, O. (eds.) ESWC 2017. LNCS, vol. 10249, pp. 103–119. Springer, Cham (2017). Scholar
  26. 26.
    Soru, T., Ngonga Ngomo, A.C.: A comparison of supervised learning classifiers for link discovery. In: 10th International Conference on Semantic Systems, pp. 41–44. ACM (2014).
  27. 27.
    Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. 26 (2001).
  28. 28.
    Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Silk - A link discovery framework for the web of data. In: Workshop on Linked Data on the Web, LDOW (2009).
  29. 29.
    Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1(1), 933–944 (2008). Scholar
  30. 30.
    Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15 (2011). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of LeipzigLeipzigGermany
  2. 2.University of PaderbornPaderbornGermany

Personalised recommendations