DistLODStats: Distributed Computation of RDF Dataset Statistics

  • Gezim SejdiuEmail author
  • Ivan Ermilov
  • Jens Lehmann
  • Mohamed Nadjib Mami
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11137)


Over the last years, the Semantic Web has been growing steadily. Today, we count more than 10,000 datasets made available online following Semantic Web standards. Nevertheless, many applications, such as data integration, search, and interlinking, may not take the full advantage of the data without having a priori statistical information about its internal structure and coverage. In fact, there are already a number of tools, which offer such statistics, providing basic information about RDF datasets and vocabularies. However, those usually show severe deficiencies in terms of performance once the dataset size grows beyond the capabilities of a single machine. In this paper, we introduce a software component for statistical calculations of large RDF datasets, which scales out to clusters of machines. More specifically, we describe the first distributed in-memory approach for computing 32 different statistical criteria for RDF datasets using Apache Spark. The preliminary results show that our distributed approach improves upon a previous centralized approach we compare against and provides approximately linear horizontal scale-up. The criteria are extensible beyond the 32 default criteria, is integrated into the larger SANSA framework and employed in at least four major usage scenarios beyond the SANSA community.



This work was partly supported by the EU Horizon2020 projects BigDataEurope (GA no. 644564), QROWD (GA no. 723088), WDAqua (GA no. 642795) and BigDataOcean (GA no. 732310).


  1. 1.
    Auer, S., et al.: The BigDataEurope platform - supporting the variety dimension of big data. In: 17th International Conference on Web Engineering (2017)Google Scholar
  2. 2.
    Bizer, C., Schultz, A.: The berlin SPARQL benchmark. Int. J. Semant. Web Inf. Syst. 5, 1–24 (2009)Google Scholar
  3. 3.
    Corcoglioniti, F., Rospocher, M., Mostarda, M., Amadori, M.: Processing billions of RDF triples on a single machine using streaming and sorting. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp. 368–375. ACM (2015)Google Scholar
  4. 4.
    Debattista, J., Auer, S., Lange, C.: Luzzu – a methodology and framework for linked data quality assessment. J. Data Inf. Qual. (JDIQ) 8(1), 4 (2016)Google Scholar
  5. 5.
    Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats — an extensible framework for high-performance dataset analytics. In: ten Teije A., et al. (eds.) Knowledge Engineering and Knowledge Management, EKAW 2012, Lecture Notes in Computer Science, vol. 7603, pp. 353–362. Springer, Heidelberg (2012). Scholar
  6. 6.
    Ermilov, I., et al.: The Tale of Sansa Spark. In: Proceedings of 16th International Semantic Web Conference, Poster and Demos (2017)Google Scholar
  7. 7.
    Ermilov, I., Martin, M., Lehmann, J., Auer, S.: Linked open data statistics: collection and exploitation. In: Proceedings of the 4th Conference on Knowledge Engineering and Semantic Web (2013)CrossRefGoogle Scholar
  8. 8.
    Forchhammer, B., Jentzsch, A., Naumann, F.: LODOP - multi-query optimization for linked data profiling queries. In: International Workshop on Dataset PROFIling and fEderated Search for Linked Data (PROFILES), Heraklion, Greece (2014)Google Scholar
  9. 9.
    Langegger, A., Wöß, W.: RDFstats - an extensible RDF statistics generator and library. In: DEXA Workshops, pp. 79–83. IEEE Computer Society (2009)Google Scholar
  10. 10.
    Lehmann, J., et al.: DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web J. 6(2), 167–195 (2015)Google Scholar
  11. 11.
    Lehmann, J., et al.: Distributed semantic analytics using the SANSA stack. In: Proceedings of 16th International Semantic Web Conference (2017)Google Scholar
  12. 12.
    Mäkelä, E.: Aether – generating and viewing extended VoID statistical descriptions of RDF datasets. In: Presutti, V., Blomqvist, E., Troncy, R., Sack, H., Papadakis, I., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8798, pp. 429–433. Springer, Cham (2014). Scholar
  13. 13.
    Ngonga Ngomo, A.-C., Auer, S., Lehmann, J., Zaveri, A.: Introduction to linked data and its lifecycle on the web. In: Reasoning Web (2014)Google Scholar
  14. 14.
    Palmonari, M., Rula, A., Porrini, R., Maurino, A., Spahiu, B., Ferme, V.: ABSTAT: linked data summaries with ABstraction and STATistics. In: Gandon, F., Guéret, C., Villata, S., Breslin, J., Faron-Zucker, C., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9341, pp. 128–132. Springer, Cham (2015). Scholar
  15. 15.
    Shi, J., et al.: Clash of the titans: mapreduce vs. spark for large scale data analytics. Proc. VLDB Endow. 8(13), 2110–2121 (2015)CrossRefGoogle Scholar
  16. 16.
    Stadler, C., Lehmann, J., Höffner, K., Auer, S.: LinkedGeoData: a core for a web of spatial open data. Semant. Web J. 3(4), 333–354 (2012)Google Scholar
  17. 17.
    Vandenbussche, P.-Y., Atemezing, G.A., Poveda-Villalón, M., Vatant, B.: Linked open vocabularies (LOV): a gateway to reusable semantic vocabularies on the web. Semant. Web, 1–16 (2015). Preprint(Preprint)Google Scholar
  18. 18.
    Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (2012)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Gezim Sejdiu
    • 1
    Email author
  • Ivan Ermilov
    • 2
  • Jens Lehmann
    • 1
    • 3
  • Mohamed Nadjib Mami
    • 1
    • 3
  1. 1.Smart Data AnalyticsUniversity of BonnBonnGermany
  2. 2.Department of Computer ScienceUniversity of LeipzigLeipzigGermany
  3. 3.Fraunhofer IAISSankt AugustinGermany

Personalised recommendations