Incremental Data Partitioning of RDF Data in SPARK
Significant efforts have been dedicated recently to the development of architectures for storing and querying RDF data in distributed environments. Several approaches focus on data partitioning, which are able to answer queries efficiently, by using a small number of computational nodes. However, such approaches provide static data partitions. Given the increase on the continuous and rapid flow of data, nowadays there is a clear need to deal with streaming data. In this work, we propose a framework for incremental data partitioning by exploiting machine learning techniques. Specifically, we present a method to learn the structure of a partitioned database, and we employ two machine learning algorithms, namely Logistic Regression and Random Forest, to classify new streaming data.
- 1.Agathangelos, G., Troullinou, G., Kondylakis, H., Stefanidis, K., Plexousakis, D.: RDF query answering using apache Spark: Review and assessment. In: IEEE ICDE (2018)Google Scholar
- 3.Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: IEEE ICDE (2013)Google Scholar
- 4.Kondylakis, H., Plexousakis, D.: Ontology evolution in data integration: query rewriting to the rescue. In: ER (2011)Google Scholar
- 5.Kondylakis, H., Plexousakis, D.: Ontology evolution: assisting query migration. In: ER (2012)Google Scholar
- 8.Rohloff, K., Schantz, R.E.: High-performance, massively scalable distributed systems using the mapreduce software framework: the SHARD triple-store. In: SPLASH (2010)Google Scholar
- 9.Wang, R., Chiu, K.: A stream partitioning approach to processing large scale distributed graph datasets. In: IEEE Big Data (2013)Google Scholar