EAGLE: Efficient Active Learning of Link Specifications Using Genetic Programming

  • Axel-Cyrille Ngonga Ngomo
  • Klaus Lyko
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7295)


With the growth of the Linked Data Web, time-efficient approaches for computing links between data sources have become indispensable. Most Link Discovery frameworks implement approaches that require two main computational steps. First, a link specification has to be explicated by the user. Then, this specification must be executed. While several approaches for the time-efficient execution of link specifications have been developed over the last few years, the discovery of accurate link specifications remains a tedious problem. In this paper, we present EAGLE, an active learning approach based on genetic programming. EAGLE generates highly accurate link specifications while reducing the annotation burden for the user. We evaluate EAGLE against batch learning on three different data sets and show that our algorithm can detect specifications with an F-measure superior to 90% while requiring a small number of questions.


Active Learning Genetic Program Record Linkage Entity Resolution Fuzzy Decision Tree 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22, 207–216 (1993)CrossRefGoogle Scholar
  2. 2.
    Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD Conference, pp. 783–794 (2010)Google Scholar
  3. 3.
    Auer, S., Lehmann, J., Ngonga Ngomo, A.-C.: Introduction to Linked Data and Its Lifecycle on the Web. In: Polleres, A., d’Amato, C., Arenas, M., Handschuh, S., Kroner, P., Ossowski, S., Patel-Schneider, P. (eds.) Reasoning Web 2011. LNCS, vol. 6848, pp. 1–75. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  4. 4.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp. 39–48 (2003)Google Scholar
  5. 5.
    Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1–41 (2008)CrossRefGoogle Scholar
  6. 6.
    Carvalho, M.G., Laender, A.H.F., Gonçalves, M.A., da Silva, A.S.: Replica identification using genetic programming. In: Proceedings of the 2008 ACM Symposium on Applied Computing, SAC 2008, pp. 1801–1806. ACM, New York (2008)Google Scholar
  7. 7.
    Christen, P.: Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: KDD 2008, pp. 1065–1068 (2008)Google Scholar
  8. 8.
    Cristianini, N., Ricci, E.: Support vector machines. In: Kao, M.-Y. (ed.) Encyclopedia of Algorithms. Springer (2008)Google Scholar
  9. 9.
    Cudré-Mauroux, P., Haghani, P., Jost, M., Aberer, K., de Meer, H.: idmesh: graph-based disambiguation of linked data. In: WWW, pp. 591–600 (2009)Google Scholar
  10. 10.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19, 1–16 (2007)CrossRefGoogle Scholar
  11. 11.
    Glaser, H., Millard, I.C., Sung, W.-K., Lee, S., Kim, P., You, B.-J.: Research on linked data and co-reference resolution. Technical report, University of Southampton (2009)Google Scholar
  12. 12.
    Hassanzadeh, O., Consens, M.: Linked movie data base. In: Bizer, C., Heath, T., Berners-Lee, T., Idehen, K. (eds.) Proceedings of the WWW 2009 Worshop on Linked Data on the Web, LDOW 2009 (2009)Google Scholar
  13. 13.
    Hogan, A., Polleres, A., Umbrich, J., Zimmermann, A.: Some entities are more equal than others: statistical methods to consolidate linked data. In: Workshop on New Forms of Reasoning for the Semantic Web: Scalable & Dynamic (NeFoRS 2010) (2010)Google Scholar
  14. 14.
    Isele, R., Jentzsch, A., Bizer, C.: Efficient Multidimensional Blocking for Link Discovery without losing Recall. In: WebDB (2011)Google Scholar
  15. 15.
    Isele, R., Bizer, C.: Learning Linkage Rules using Genetic Programming. In: Sixth International Ontology Matching Workshop (2011)Google Scholar
  16. 16.
    Sathiya Keerthi, S., Lin, C.-J.: Asymptotic behaviors of support vector machines with gaussian kernel. Neural Comput. 15, 1667–1689 (2003)zbMATHCrossRefGoogle Scholar
  17. 17.
    Köpcke, H., Thor, A., Rahm, E.: Comparative evaluation of entity resolution approaches with fever. Proc. VLDB Endow. 2(2), 1574–1577 (2009)Google Scholar
  18. 18.
    Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection (Complex Adaptive Systems). The MIT Press (1992)Google Scholar
  19. 19.
    Liere, R., Tadepalli, P.: Active learning with committees for text categorization. In: Proceedings of the Fourteenth National Conference on Artificial Intelligence, pp. 591–596 (1997)Google Scholar
  20. 20.
    Ngonga Ngomo, A.-C.: A Time-Efficient Hybrid Approach to Link Discovery. In: Sixth International Ontology Matching Workshop (2011)Google Scholar
  21. 21.
    Ngonga Ngomo, A.-C., Auer, S.: LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data. In: Proceedings of IJCAI (2011)Google Scholar
  22. 22.
    Ngonga Ngomo, A.-C., Lehmann, J., Auer, S., Höffner, K.: RAVEN – Active Learning of Link Specifications. In: Proceedings of OM@ISWC (2011)Google Scholar
  23. 23.
    Nikolov, A., Uren, V., Motta, E., de Roeck, A.: Overcoming Schema Heterogeneity between Linked Semantic Repositories to Improve Coreference Resolution. In: Gómez-Pérez, A., Yu, Y., Ding, Y. (eds.) ASWC 2009. LNCS, vol. 5926, pp. 332–346. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  24. 24.
    Papadakis, G., Ioannou, E., Niedere, C., Palpanasz, T., Nejdl, W.: Eliminating the redundancy in blocking-based entity resolution methods. In: JCDL (2011)Google Scholar
  25. 25.
    Raimond, Y., Sutton, C., Sandler, M.: Automatic interlinking of music datasets on the semantic web. In: Proceedings of the 1st Workshop about Linked Data on the Web (2008)Google Scholar
  26. 26.
    Scharffe, F., Liu, Y., Zhou, C.: RDF-AI: an architecture for RDF datasets matching, fusion and interlink. In: Proc. IJCAI 2009 Workshop on Identity, Reference, and Knowledge Representation (IR-KR), Pasadena, CA, US (2009)Google Scholar
  27. 27.
    Settles, B.: Active learning literature survey. Technical Report 1648, University of Wisconsin-Madison (2009)Google Scholar
  28. 28.
    Sleeman, J., Finin, T.: Computing foaf co-reference relations with rules and machine learning. In: Proceedings of the Third International Workshop on Social Data on the Web (2010)Google Scholar
  29. 29.
    Song, D., Heflin, J.: Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 649–664. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  30. 30.
    Winkler, W.: Overview of record linkage and current research directions. Technical report, Bureau of the Census - Research Report Series (2006)Google Scholar
  31. 31.
    Yuan, Y., Shaw, M.J.: Induction of fuzzy decision trees. Fuzzy Sets Syst. 69, 125–139 (1995)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Axel-Cyrille Ngonga Ngomo
    • 1
  • Klaus Lyko
    • 1
  1. 1.Department of Computer ScienceUniversity of LeipzigLeipzigGermany

Personalised recommendations