Learning URI Selection Criteria to Improve the Crawling of Linked Open Data

  • Hai HuangEmail author
  • Fabien Gandon
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11503)


As the Web of Linked Open Data is growing the problem of crawling that cloud becomes increasingly important. Unlike normal Web crawlers, a Linked Data crawler performs a selection to focus on collecting linked RDF (including RDFa) data on the Web. From the perspectives of throughput and coverage, given a newly discovered and targeted URI, the key issue of Linked Data crawlers is to decide whether this URI is likely to dereference into an RDF data source and therefore it is worth downloading the representation it points to. Current solutions adopt heuristic rules to filter irrelevant URIs. Unfortunately, when the heuristics are too restrictive this hampers the coverage of crawling. In this paper, we propose and compare approaches to learn strategies for crawling Linked Data on the Web by predicting whether a newly discovered URI will lead to an RDF data source or not. We detail the features used in predicting the relevance and the methods we evaluated including a promising adaptation of FTRL-proximal online learning algorithm. We compare several options through extensive experiments including existing crawlers as baseline methods to evaluate their efficacy.


Linked Data Crawling strategy Machine learning Online prediction 



This work is supported by the ANSWER project PIA FSN2 \(\text {N}^\circ \)P159564-2661789/DOS0060094 between Inria and Qwant.


  1. 1.
    Berners-Lee, T.: Linked data - design issues (2006).
  2. 2.
    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRefGoogle Scholar
  3. 3.
    Burer, S., Monteiro, R.D.C.: A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math. Program. 95(2), 329–357 (2003)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)CrossRefGoogle Scholar
  5. 5.
    Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000)Google Scholar
  6. 6.
    Dodds, L.: Slug: A Semantic Web Crawler (2006)Google Scholar
  7. 7.
    Duchi, J.C., Singer, Y.: Efficient learning using forward-backward splitting. In: NIPS, pp. 495–503 (2009)Google Scholar
  8. 8.
    Ermilov, I., Lehmann, J., Martin, M., Auer, S.: LODStats: the data web census dataset. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 38–46. Springer, Cham (2016). Scholar
  9. 9.
    Färber, M., Bartscherer, F., Menne, C., Rettinger, A.: Linked data quality of dbpedia, freebase, opencyc, wikidata, and YAGO. Seman. Web 9(1), 77–129 (2018)CrossRefGoogle Scholar
  10. 10.
    Heath, T., Bizer, C.: Linked Data: Evolving the Web Into a Global Data Space, vol. 1. Morgan & Claypool Publishers, San Rafael (2011)Google Scholar
  11. 11.
    Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. In: LDOW (2010)Google Scholar
  12. 12.
    Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. J. Web Sem. 9(4), 365–401 (2011)CrossRefGoogle Scholar
  13. 13.
    Isele, R., Umbrich, J., Bizer, C., Harth, A.: LDspider: an open-source crawling framework for the web of linked data. In: Proceedings of the ISWC 2010 Posters & Demonstrations Track (2010)Google Scholar
  14. 14.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 6, 707–710 (1966)MathSciNetGoogle Scholar
  15. 15.
    McMahan, H.B.: Follow-the-regularized-leader and mirror descent: equivalence theorems and L1 regularization. In: AISTATS, pp. 525–533 (2011)Google Scholar
  16. 16.
    McMahan, H.B., et al.: Ad click prediction: a view from the trenches. In: SIGKDD, pp. 1222–1230 (2013)Google Scholar
  17. 17.
    Meusel, R., Mika, P., Blanco, R.: Focused crawling for structured data. In: CIKM, pp. 1039–1048 (2014)Google Scholar
  18. 18.
    Umbrich, J., Harth, A., Hogan, A., Decker, S.: Four heuristics to guide structured content crawling. In: ICWE, pp. 196–202 (2008)Google Scholar
  19. 19.
    Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A.J., Attenberg, J.: Feature hashing for large scale multitask learning. In: ICML, pp. 1113–1120 (2009)Google Scholar
  20. 20.
    Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In: NIPS, pp. 2116–2124 (2009)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Authors and Affiliations

  1. 1.Inria, Université Côte d’Azur, CNRS, I3SSophia AntipolisFrance

Personalised recommendations