Incremental Data Partitioning of RDF Data in SPARK

  • Giannis Agathangelos
  • Georgia Troullinou
  • Haridimos KondylakisEmail author
  • Kostas Stefanidis
  • Dimitris Plexousakis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11155)


Significant efforts have been dedicated recently to the development of architectures for storing and querying RDF data in distributed environments. Several approaches focus on data partitioning, which are able to answer queries efficiently, by using a small number of computational nodes. However, such approaches provide static data partitions. Given the increase on the continuous and rapid flow of data, nowadays there is a clear need to deal with streaming data. In this work, we propose a framework for incremental data partitioning by exploiting machine learning techniques. Specifically, we present a method to learn the structure of a partitioned database, and we employ two machine learning algorithms, namely Logistic Regression and Random Forest, to classify new streaming data.


  1. 1.
    Agathangelos, G., Troullinou, G., Kondylakis, H., Stefanidis, K., Plexousakis, D.: RDF query answering using apache Spark: Review and assessment. In: IEEE ICDE (2018)Google Scholar
  2. 2.
    Christophides, V., Efthymiou, V., Stefanidis, K.: Entity resolution in the web of data. Synth. Lect. Semant. Web Theory Technol. 5(3), 1–122 (2015)CrossRefGoogle Scholar
  3. 3.
    Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: IEEE ICDE (2013)Google Scholar
  4. 4.
    Kondylakis, H., Plexousakis, D.: Ontology evolution in data integration: query rewriting to the rescue. In: ER (2011)Google Scholar
  5. 5.
    Kondylakis, H., Plexousakis, D.: Ontology evolution: assisting query migration. In: ER (2012)Google Scholar
  6. 6.
    Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 91–113 (2010)CrossRefGoogle Scholar
  7. 7.
    Refaeilzadeh, P., Tang, L., Liu, H.: Cross-validation. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 532–538. Springer, Boston (2009). Scholar
  8. 8.
    Rohloff, K., Schantz, R.E.: High-performance, massively scalable distributed systems using the mapreduce software framework: the SHARD triple-store. In: SPLASH (2010)Google Scholar
  9. 9.
    Wang, R., Chiu, K.: A stream partitioning approach to processing large scale distributed graph datasets. In: IEEE Big Data (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Giannis Agathangelos
    • 1
  • Georgia Troullinou
    • 1
  • Haridimos Kondylakis
    • 1
    Email author
  • Kostas Stefanidis
    • 2
  • Dimitris Plexousakis
    • 1
  1. 1.FORTH-ICSHeraklionGreece
  2. 2.University of TampereTampereFinland

Personalised recommendations