Abstract
The increasing number of applications relying on knowledge graphs from the Web leads to a heightened need for crawlers to gather such data. Only a limited number of these frameworks are available, and they often come with severe limitations on the type of data they are able to crawl. Hence, they are not suited to certain scenarios of practical relevance. We address this drawback by presenting Squirrel, an open-source distributed crawler for the RDF knowledge graphs on the Web, which supports a wide range of RDF serializations and additional structured and semi-structured data formats. Squirrel is being used in the extension of national data portals in Germany and is available at https://github.com/dice-group/squirrel under a permissive open license.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
See https://lod-cloud.net/ for an example of the growth.
- 2.
Examples include the European Union at https://ec.europa.eu/digital-single-market/en/open-data and the German Federal Ministry of Transport and Digital Infrastructure with data at https://www.mcloud.de/.
- 3.
See, e.g., https://www.mdm-portal.de/, where traffic data from the German Federal Ministry of Transport and Digital Infrastructure is made available.
- 4.
See, e.g., the German mFund funds at http://mfund.de.
- 5.
Our code is available at https://github.com/dice-group/squirrel and the documentation at https://w3id.org/dice-research/squirrel/documentation.
- 6.
- 7.
- 8.
- 9.
The information has been gathered by an analysis of the plugin’s source code.
- 10.
A brief description of the plugin and its source code can be found at https://issues.apache.org/jira/browse/NUTCH-460.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
Details about implementing a new fetcher can be found at https://dice-group.github.io/squirrel.github.io/tutorials/fetcher.html.
- 17.
Details regarding the compressions can be found at https://pkware.cachefly.net/Webdocs/APPNOTE/APPNOTE-6.3.5.TXT, https://www.gnu.org/software/gzip/ and http://sourceware.org/bzip2/, respectively.
- 18.
- 19.
- 20.
- 21.
Details about implementing a new analyzer can be found at https://dice-group.github.io/squirrel.github.io/tutorials/analyzer.html.
- 22.
- 23.
Details about implementing a new sink can be found at https://dice-group.github.io/squirrel.github.io/tutorials/sink.html.
- 24.
- 25.
- 26.
- 27.
The details of the hardware setup that underlies the HOBBIT platform can be found at https://hobbit-project.github.io/master#hardware-of-the-cluster.
- 28.
- 29.
References
Archer, P.: Data catalog vocabulary (dcat) (w3c recommendation), January 2014. https://www.w3.org/TR/vocab-dcat/
Beek, W., Rietveld, L., Bazoobandi, H.R., Wielemaker, J., Schlobach, S.: Lod laundromat: a uniform way of publishing other people’s dirty data. In: Mika, P., et al. (eds.) The Semantic Web - ISWC 2014, pp. 213–228. Springer International Publishing, Cham (2014)
Berners-Lee, T., Fielding, R., Masinter, L.: Uniform Resource Identifier (URI): Generic Syntax. Internet Standard, Internet Engineering Task Force (IETF), January 2005. https://tools.ietf.org/html/rfc3986
Fernández, J.D., Beek, W., Martínez-Prieto, M.A., Arias, M.: LOD-a-lot. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 75–83. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_7
Fernández, J.D., Martínez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Binary RDF representation for publication and exchange (HDT). Web Semant. Sci. Serv. Agents World Wide Web, 19, 22–41 (2013). http://www.websemanticsjournal.org/index.php/ps/article/view/328
Harth, A., Umbrich, J., Decker, S.: MultiCrawler: a pipelined architecture for crawling and indexing semantic web data. In: Cruz, I., et al. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 258–271. Springer, Heidelberg (2006). https://doi.org/10.1007/11926078_19
Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. Word Wide Web 2(4), 219–229 (1999)
Hogan, A.: Exploiting RDFS and OWL for Integrating Heterogeneous, Large-Scale, Linked Data Corpora (2011). http://aidanhogan.com/docs/thesis/
Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. Web Semant. Sci. Serv. Agents World Wide Web, 9(4), 365–401 (2011). https://doi.org/10.1016/j.websem.2011.06.004. http://www.sciencedirect.com/science/article/pii/S1570826811000473, JWS special issue on Semantic Search
Isele, R., Umbrich, J., Bizer, C., Harth, A.: LDspider: an open-source crawling framework for the web of linked data. In: Proceedings of the ISWC 2010 Posters & Demonstrations Track: Collected Abstracts, vol. 658, pp. 29–32. CEUR-WS (2010)
Koster, M., Illyes, G., Zeller, H., Harvey, L.: Robots Exclusion Protocol. Internet-draft, Internet Engineering Task Force (IETF), July 2019. https://tools.ietf.org/html/draft-rep-wg-topic-00
Lebo, T., Sahoo, S., McGuinness, D.: PROV-O: The PROV Ontology. W3C Recommendation, W3C, April 2013. http://www.w3.org/TR/2013/REC-prov-o-20130430/
Merkel, D.: Docker: Lightweight linux containers for consistent development and deployment. Linux J. 2014(239), March 2014. http://dl.acm.org/citation.cfm?id=2600239.2600241
Röder, M., Kuchelev, D., Ngonga Ngomo, A.C.: HOBBIT: a platform for benchmarking Big Linked Data. Data Sci. (2019). https://doi.org/10.3233/DS-190021
Röder, M., de Souza, G., Kuchelev, D., Desouki, A.A., Ngomo, A.C.N.: Orca: a benchmark for data web crawlers (2019). https://arxiv.org/abs/1912.08026
Acknowledgments
This work has been supported by the BMVI (Bundesministerium für Verkehr und digitale Infrastruktur) projects LIMBO (GA no. 19F2029C) and OPAL (GA no. 19F2028A).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Röder, M., de Souza Jr, G., Ngomo, AC.N. (2020). Squirrel – Crawling RDF Knowledge Graphs on the Web. In: Pan, J.Z., et al. The Semantic Web – ISWC 2020. ISWC 2020. Lecture Notes in Computer Science(), vol 12507. Springer, Cham. https://doi.org/10.1007/978-3-030-62466-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-62466-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62465-1
Online ISBN: 978-3-030-62466-8
eBook Packages: Computer ScienceComputer Science (R0)