Skip to main content

Squirrel – Crawling RDF Knowledge Graphs on the Web

  • Conference paper
  • First Online:
The Semantic Web – ISWC 2020 (ISWC 2020)

Abstract

The increasing number of applications relying on knowledge graphs from the Web leads to a heightened need for crawlers to gather such data. Only a limited number of these frameworks are available, and they often come with severe limitations on the type of data they are able to crawl. Hence, they are not suited to certain scenarios of practical relevance. We address this drawback by presenting Squirrel, an open-source distributed crawler for the RDF knowledge graphs on the Web, which supports a wide range of RDF serializations and additional structured and semi-structured data formats. Squirrel is being used in the extension of national data portals in Germany and is available at https://github.com/dice-group/squirrel under a permissive open license.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See https://lod-cloud.net/ for an example of the growth.

  2. 2.

    Examples include the European Union at https://ec.europa.eu/digital-single-market/en/open-data and the German Federal Ministry of Transport and Digital Infrastructure with data at https://www.mcloud.de/.

  3. 3.

    See, e.g., https://www.mdm-portal.de/, where traffic data from the German Federal Ministry of Transport and Digital Infrastructure is made available.

  4. 4.

    See, e.g., the German mFund funds at http://mfund.de.

  5. 5.

    Our code is available at https://github.com/dice-group/squirrel and the documentation at https://w3id.org/dice-research/squirrel/documentation.

  6. 6.

    https://www.docker.com/.

  7. 7.

    https://github.com/ldspider/ldspider.

  8. 8.

    http://nutch.apache.org/.

  9. 9.

    The information has been gathered by an analysis of the plugin’s source code.

  10. 10.

    A brief description of the plugin and its source code can be found at https://issues.apache.org/jira/browse/NUTCH-460.

  11. 11.

    https://www.limbo-project.org/.

  12. 12.

    http://projekt-opal.de/projektergebnisse/deliverables/.

  13. 13.

    See http://projekt-opal.de/en/welcome-project-opal/ and https://www.bmvi.de/SharedDocs/DE/Artikel/DG/mfund-projekte/ope-data-portal-germany-opal.html.

  14. 14.

    See https://www.limbo-project.org/ and https://www.bmvi.de/SharedDocs/DE/Artikel/DG/mfund-projekte/linked-data-services-for-mobility-limbo.html.

  15. 15.

    https://spring.io/.

  16. 16.

    Details about implementing a new fetcher can be found at https://dice-group.github.io/squirrel.github.io/tutorials/fetcher.html.

  17. 17.

    Details regarding the compressions can be found at https://pkware.cachefly.net/Webdocs/APPNOTE/APPNOTE-6.3.5.TXT, https://www.gnu.org/software/gzip/ and http://sourceware.org/bzip2/, respectively.

  18. 18.

    https://jena.apache.org.

  19. 19.

    https://github.com/semarglproject/semargl.

  20. 20.

    https://jsoup.org/.

  21. 21.

    Details about implementing a new analyzer can be found at https://dice-group.github.io/squirrel.github.io/tutorials/analyzer.html.

  22. 22.

    https://www.w3.org/TR/turtle/.

  23. 23.

    Details about implementing a new sink can be found at https://dice-group.github.io/squirrel.github.io/tutorials/sink.html.

  24. 24.

    https://github.com/dice-group/orca.

  25. 25.

    The detailed results can be seen at https://w3id.org/hobbit/experiments#1585403645660,1584545072279,1585230107697,1584962226404,1584962243223,1585574894994,1585574924888,1585532668155,1585574716469.

  26. 26.

    Detailed results can be found at https://w3id.org/hobbit/experiments#1586886425879,1587151926893,1587284972402,1588111671515,1587121394160,1586886364444,1586424067908,1586374166710,1586374133562.

  27. 27.

    The details of the hardware setup that underlies the HOBBIT platform can be found at https://hobbit-project.github.io/master#hardware-of-the-cluster.

  28. 28.

    http://projekt-opal.de/.

  29. 29.

    https://dice-group.github.io/squirrel.github.io/tutorials.html.

References

  1. Archer, P.: Data catalog vocabulary (dcat) (w3c recommendation), January 2014. https://www.w3.org/TR/vocab-dcat/

  2. Beek, W., Rietveld, L., Bazoobandi, H.R., Wielemaker, J., Schlobach, S.: Lod laundromat: a uniform way of publishing other people’s dirty data. In: Mika, P., et al. (eds.) The Semantic Web - ISWC 2014, pp. 213–228. Springer International Publishing, Cham (2014)

    Chapter  Google Scholar 

  3. Berners-Lee, T., Fielding, R., Masinter, L.: Uniform Resource Identifier (URI): Generic Syntax. Internet Standard, Internet Engineering Task Force (IETF), January 2005. https://tools.ietf.org/html/rfc3986

  4. Fernández, J.D., Beek, W., Martínez-Prieto, M.A., Arias, M.: LOD-a-lot. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 75–83. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_7

    Chapter  Google Scholar 

  5. Fernández, J.D., Martínez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Binary RDF representation for publication and exchange (HDT). Web Semant. Sci. Serv. Agents World Wide Web, 19, 22–41 (2013). http://www.websemanticsjournal.org/index.php/ps/article/view/328

  6. Harth, A., Umbrich, J., Decker, S.: MultiCrawler: a pipelined architecture for crawling and indexing semantic web data. In: Cruz, I., et al. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 258–271. Springer, Heidelberg (2006). https://doi.org/10.1007/11926078_19

    Chapter  Google Scholar 

  7. Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. Word Wide Web 2(4), 219–229 (1999)

    Article  Google Scholar 

  8. Hogan, A.: Exploiting RDFS and OWL for Integrating Heterogeneous, Large-Scale, Linked Data Corpora (2011). http://aidanhogan.com/docs/thesis/

  9. Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. Web Semant. Sci. Serv. Agents World Wide Web, 9(4), 365–401 (2011). https://doi.org/10.1016/j.websem.2011.06.004. http://www.sciencedirect.com/science/article/pii/S1570826811000473, JWS special issue on Semantic Search

  10. Isele, R., Umbrich, J., Bizer, C., Harth, A.: LDspider: an open-source crawling framework for the web of linked data. In: Proceedings of the ISWC 2010 Posters & Demonstrations Track: Collected Abstracts, vol. 658, pp. 29–32. CEUR-WS (2010)

    Google Scholar 

  11. Koster, M., Illyes, G., Zeller, H., Harvey, L.: Robots Exclusion Protocol. Internet-draft, Internet Engineering Task Force (IETF), July 2019. https://tools.ietf.org/html/draft-rep-wg-topic-00

  12. Lebo, T., Sahoo, S., McGuinness, D.: PROV-O: The PROV Ontology. W3C Recommendation, W3C, April 2013. http://www.w3.org/TR/2013/REC-prov-o-20130430/

  13. Merkel, D.: Docker: Lightweight linux containers for consistent development and deployment. Linux J. 2014(239), March 2014. http://dl.acm.org/citation.cfm?id=2600239.2600241

  14. Röder, M., Kuchelev, D., Ngonga Ngomo, A.C.: HOBBIT: a platform for benchmarking Big Linked Data. Data Sci. (2019). https://doi.org/10.3233/DS-190021

    Article  Google Scholar 

  15. Röder, M., de Souza, G., Kuchelev, D., Desouki, A.A., Ngomo, A.C.N.: Orca: a benchmark for data web crawlers (2019). https://arxiv.org/abs/1912.08026

Download references

Acknowledgments

This work has been supported by the BMVI (Bundesministerium für Verkehr und digitale Infrastruktur) projects LIMBO (GA no. 19F2029C) and OPAL (GA no. 19F2028A).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Röder .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Röder, M., de Souza Jr, G., Ngomo, AC.N. (2020). Squirrel – Crawling RDF Knowledge Graphs on the Web. In: Pan, J.Z., et al. The Semantic Web – ISWC 2020. ISWC 2020. Lecture Notes in Computer Science(), vol 12507. Springer, Cham. https://doi.org/10.1007/978-3-030-62466-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-62466-8_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-62465-1

  • Online ISBN: 978-3-030-62466-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics