Advertisement

Distributed Graph Analytics with Datalog Queries in Flink

Conference paper
  • 115 Downloads
Part of the Communications in Computer and Information Science book series (CCIS, volume 1281)

Abstract

Large-scale, parallel graph processing has been in demand over the past decade. Succinct program structure and efficient execution are among the essential requirements of graph processing frameworks. In this paper, we present Cog, which executes Datalog programs on the Apache Flink distributed dataflow system. We chose Datalog for its compact program structure and Flink for its efficiency. We implemented a parallel semi-naive evaluation algorithm exploiting Flink’s delta iteration to propagate only the tuples that need to be further processed to the subsequent iterations. Flink’s delta iteration feature reduces the overhead present in acyclic dataflow systems, such as Spark, when evaluating recursive queries, hence making it more efficient. We demonstrated in our experiments that Cog outperformed BigDatalog, the state-of-the-art distributed Datalog evaluation system, in most of the tests.

Keywords

Datalog Recursive queries Graph processing Cyclic dataflows 

1 Introduction

Graphs can represent numerous real-world problems. With the advancement of the web and the vast number of its users, efficiently processing massive graphs is becoming essential. Efficiency can be achieved by scaling out computations to a cluster and thereby reducing computation times. Existing state-of-the-art systems that choose Datalog as their language, such as BigDatalog [18] and Myria [20], suffer either from significant scheduling overhead or shuffling overhead to perform each iteration of a graph computation.

From the users’ perspective, having concise programming constructs that are easy to learn is also an essential factor to consider. Existing large-scale graph processing systems, such as Gelly [25] of Flink [6], or GraphX [11] of Spark [21], do not provide conciseness and require significant effort to perform even simple analytics. These systems are complex and verbose due to their APIs being embedded in general-purpose languages, such as Java or Scala. In contrast, Datalog offers more conciseness [13], i.e., shorter programs, and therefore makes it easier to implement graph-analytics or artificial intelligence algorithms.

This paper presents Cog, which is a Flink-based evaluation system of positive Datalog programs that do not contain aggregates. The core feature of Cog is the efficient evaluation of Datalog’s linear recursive queries by exploiting Flink’s native delta iterations [8]. Flink is particularly suitable to evaluate Datalog programs because of its ability to evaluate iterative algorithms efficiently by cyclic dataflows. From relational queries (i.e., join, union, recursive queries) to graph processing algorithms (e.g., transitive closure) can conveniently be implemented in Cog, and executed on a cluster in a scalable way.

Contributions. We made the following contributions:
  • We created logical plans for Datalog programs to be executed on a distributed dataflow engine. The logical plan also includes an explicit representation of recursive queries.

  • We implemented a Datalog query execution engine that exploits Flink’s delta iteration feature, which we found to be particularly well-suited for the classic semi-naive Datalog evaluation algorithm.

  • We experimentally confirmed that evaluating recursive queries of Datalog using Flink’s delta iteration performs better than the Spark-based BigDatalog [18] system, which is the state of the art in scalable Datalog execution.

2 Preliminaries

We will now briefly review Datalog and Apache Flink. We will also show why Flink’s delta iteration is suitable to evaluate Datalog programs efficiently.

2.1 Datalog

Datalog [7] is a rule-based query language. Each rule is expressed as a function-free horn clause, such as h : - \(b_1,...,b_n\), where h is the head predicate of the rule, and each \(b_i\) is a body predicate separated by a comma “, ” which represents the logical AND (\(\wedge \)). A predicate is also known as a relation. A fact is a tuple in a relation. A Datalog rule is recursive if the head predicate of a rule also appears in the body of the rule. After evaluating all body predicates, the produced facts are assigned to the head predicate of the rule. A relation that comes into existence as a result of a rule execution is called an intensional database (IDB). A stored relation is called an extensional database (EDB). The transitive closure (TC) program in Datalog is given in Listing 1 as an example. In the example, the predicate arc is an EDB, whereas the predicate tc is an IDB. The rule \(r_2\) is a recursive rule as it has the predicate tc in its head and body. A join is created between tc and arc predicates in rule \(r_2\), and the resulting facts are assigned to the head predicate tc.

2.2 Apache Flink

Apache Flink [6] is a distributed dataflow system. While nowadays Flink is mostly known for efficient stream processing, it initially focused on iterative dataflows in batch computations [8]. In this paper, we rely only on its batch-processing capabilities for translating Datalog programs to iterative dataflows.

Flink’s batch API is centered around the DataSet class, which represents a scalable collection of tuples. DataSets offer numerous data processing operators (such as map, filter, join), which create new DataSets. From a Flink program written using DataSet operators, Flink creates a dataflow job, a directed graph where nodes represent data processing operators and edges represent data transfers. Flink executes these dataflow jobs in a scalable way, by parallelizing the execution of each dataflow node on the available worker machines in a cluster. Flink executes all operators lazily, i.e., the operator is first only added to the dataflow job as a node, and then later executed as part of the dataflow job execution. The dataflow job is executed when Flink encounters an action operator (such as counting the elements in a DataSet, or printing its elements), or when the user explicitly triggers the execution of the dataflow job that was built up so far. Flink provides libraries and APIs to perform relational querying, graph processing [25], and machine learning.

Iteration APIs. Flink supports two types of iterations: bulk and delta. Bulk iterations are general-purpose iterations, where the result of each iteration is a completely new solution set computed from the previous iteration’s solution set [8]. On the other hand, delta iterations are a form of incremental iterations, which is suitable for iterative algorithms with sparse computational dependencies, i.e., where each iteration’s result differs only partially from the previous iteration. In the context of Datalog evaluation, the semantics of delta iteration matches well with the principles of the classic semi-naive evaluation algorithm [3], thus making it suitable for recursive Datalog program executions: applying a recursive Datalog rule once often adds only a small number of tuples compared to the total result size.

Iteration Execution in Cyclic Dataflow Jobs. Flink executes iterative programs written using the above iteration APIs in a single, cyclic dataflow job, i.e., where an iteration’s result is fed back as the next iteration’s input through a backwards dataflow edge. This is in contrast to many other dataflow systems, such as Apache Spark [21], which execute iterative programs as a series of acyclic dataflow jobs. Flink’s cyclic dataflows are more efficient for several reasons:
  • Having a single dataflow job for all iterations avoids the inherent overhead of launching a dataflow job on a cluster of machines. The main overhead of launching a job is the (centralized) scheduling of the constituent tasks of the job to a large number of machines.

  • Operator lifespans can be extended to all iterations. (Whereas in Spark, new operators are launched for each iteration.) This enables Flink to naturally perform two optimizations:

  • In the case of a delta iteration, Flink can keep the solution set in the state of an operator that spans all iterations. Thereby, the solution set does not need to be newly rebuilt for each iteration, and instead small changes can be efficiently accommodated by just modifying the existing operator state.

  • Loop-invariant datasets, i.e., datasets that are reused without changes in each iteration (e.g., arc in Listing 1), can be more efficiently handled. For example, when one input of an equi-join is a loop-invariant dataset, the join operator can build a hash table of the loop-invariant input only once, and just probe that same hash table at every iteration.

3 Cog

In this section, we discuss Cog, our system that executes Datalog programs on Flink. We implemented positive Datalog without aggregation. Cog parses a Datalog program, converts the parsed program to an intermediate representation, creates and optimizes a logical plan, and finally creates a Flink plan for execution. Listing 2 shows an example for writing Datalog programs in Cog.

3.1 Query Representation and Planning

Query Representation. A parsed Datalog program is represented in the form of a predicate connection graph (PCG) [2]. Figure 1 shows the PCG for the TC query as an example. A PCG is an annotated AND/OR tree, i.e., it has alternating levels of AND and OR nodes. The AND nodes represent head predicates of rules, and the OR nodes represent body predicates of rules. The root and the leaves are always OR nodes. The root of the tree represents the query predicate.
Fig. 1.

Predicate Connection Graph (PCG) for Transitive Closure (TC) query.

Logical Plan. We used the algebra module of Apache Calcite [5] to represent logical plans. Calcite provides numerous operators (such as join, project, union) to represent a query algebra. To evaluate recursive Datalog queries, the repeat union operator is an important one. The repeat union operator has two child nodes: seed and iterative. The seed node represents facts generated by non-recursive rule(s), whereas the iterative node represents facts generated by the recursive rule(s). The semantics of the repeat union operator are as follows: it first evaluates the seed node, whose result will be the input to the first iteration; then, it repeatedly evaluates the iterative node, using the previous iteration’s result as input. The evaluation terminates when the result does not change between two iterations. Figure 2 shows the logical plan created for the TC program given in Listing 1. The Calcite-based logical plans are then transformed into Flink’s own logical plans and then to Flink’s DataSet-based plan. During these transformations, standard relational optimizations are also performed.
Fig. 2.

Cog logical plan for Transitive Closure (TC) query.

Flink Plan. The optimized logical plans are translated into Flink’s DataSet-based plans. We utilized existing Flink DataSet operators for scans, joins, unions, filters, and projections. However, we implemented a translation from the repeat union and transient table scan operators to Flink DataSet operators to enable the execution of recursive queries, which we discuss in the next subsection.

3.2 Semi-Naive Evaluation in the Flink DataSet API

Semi-naive evaluation [3] is an efficient way to evaluate Datalog programs. With this technique, each iteration processes only the tuples that were produced by the previous iteration, and thus redundant work is eliminated. The final result is obtained by the union of the results produced by each iteration. Algorithm 1 shows the pseudocode of semi-naive evaluation. In the algorithm, seed represents the non-recursive rule(s) (e.g., \(r_1\) in TC), whereas recursive represents one execution of the recursive rule(s) (e.g., \(r_2\) in TC). W represents the differential that is calculated in each iteration, and S stores the final result at the end.

Compare Algorithm 1 with Algorithm 2, which shows the general template of a Flink Delta Iteration. There is an initial solution set (S), and an initial workset (W), and then each iteration first computes a differential (D), which is to be merged into the solution set (Line 7), and also computes the workset for the next iteration. Note that the merging into the solution set is denoted by Open image in new window , which means that elements that not yet appear in the solution set should be added, and elements which have the same key as an element already in the solution set should override the old element: Open image in new window . We can see that with the following mapping, a Flink Delta Iteration performs exactly the semi-naive evaluation of a Datalog query:

\(S=seed\); \(W=seed\); \(u(S,W)=recursive(W)-S\); \(\delta (D,S,W)=D\); \(key(x)=x\). Note that by choosing the key to be the entire tuple, we make the Open image in new window behave as a standard union.

When translating from Cog logical plans, the semi-naive evaluation is implemented to translate the repeat union operator to DataSet operators. Listing 3 presents this translation. We use a CoGroup operation to compute which of the tuples created in this iteration are not already in the solution set. We also use this CoGroup operation to eliminate duplicates. The work set propagates the differential to the next iteration. The solution set accumulates the output of all iterations. The work set and the solution set are always kept in memory for efficiency. Note that all the created Flink operators are evaluated lazily upon the call of a sink operator. Figure 3 shows the Flink plan for the TC query as an example. The sync task is a special operator inserted by Flink, which waits for all operators in the iteration body to perform one iteration, and then signals to the Flink runtime that the next iteration can start.
Fig. 3.

The Flink plan for Transitive Closure (TC) query. Some operators are omitted/combined for clarity. Note that across all the iterations the Join operator keeps the hash table that it built for the arc dataset.

4 Experiments

4.1 Experimental Setup

Hardware and Software Environment. We performed our experiments on a cluster of 8 nodes, connected with Gigabit Ethernet. Each worker node has an IBM PowerPC 48-core CPU. We allocated 48 GB memory to the Flink and Spark worker processes. We implemented Cog on the current snapshot version of Flink (on the top of commit 8f8e358).

Benchmark Programs. Thus far, Cog supports positive Datalog with recursion, but without aggregation. We chose the following benchmark queries:
  • Transitive Closure (TC): Finds all pairs of vertices in a graph that are connected by some path. Listing 1 shows TC in Datalog.

  • Same Generation (SG): Two nodes are in the Same Generation (SG) if and only if they are at the same distance from another node in the graph. Listing 4 shows SG program in Datalog. The program finds all pairs that are in the same generation.
  • Single-Source Reachability: Finds all vertices connected by some path to a given source vertex. Listing 5 shows the Reachability program in Datalog.
Datasets. We used synthetic graph datasets to evaluate and benchmark our system. These datasets are Tree11, Grid150, and g10K. The same datasets are also used by Shkapsky et al. [18] for benchmark comparison of BigDatalog with Myria [20] and Distributed SociaLite [17] systems. Table 1 shows the properties of the datasets. These graphs have specific structural properties: Tree11 has 11 levels, Grid150 is a grid of 151 by 151, and the G10K graphs are 10k-vertex random graphs in which each randomly-chosen pair of vertices is connected with probability 0.001. The last three columns of Table 1 show the output size produced with these datasets by the benchmark queries. For the Reachability program, we used graph datasets generated with R-MAT [26] synthetic graph generator with probabilities \(a=0.45, b=0.25, c=0.15, d=0.15\). For all the datasets, we calculated Reachability from vertex 977.
Table 1.

Input- and output sizes, and the number of iterations (in parenthesis)

Name

Vertices

Edges

TC

SG

Reachability

Tree11

71,391

71,390

805,001 (11)

2,086,271,974 (11)

-

Grid150

22,801

45,300

131,675,775 (299)

2,295,050 (149)

-

G10K

10,000

100,185

100,000,000 (6)

100,000,000 (3)

-

R-MAT-1M

1 mill

10 mill

-

-

523,967 (4)

R-MAT-2M

2 mill

20 mill

-

-

1,047,937 (4)

R-MAT-4M

4 mill

40 mill

-

-

2,095,865 (4)

R-MAT-8M

8 mill

80 mill

-

-

4,191,735 (4)

R-MAT-16M

16 mill

160 mill

-

-

8,383,418 (5)

R-MAT-32M

32 mill

320 mill

-

-

16,767,026 (5)

Fig. 4.

Evaluation result comparison using TC and SG queries.

Fig. 5.

Evaluation result comparison using Reachability query.

4.2 Results

We ran TC, SG, and Reachability in our system and another state-of-the-art distributed Datalog system, namely BigDatalog [18]. As BigDatalog demonstrated its efficiency compared to other distributed Datalog systems (such as Myria [20] and Distributed SociaLite [17]), our purpose here is to show how Cog performs w.r.t. BigDatalog. Figure 4 and Fig. 5 show the benchmark comparison of Cog and BigDatalog. We report the median values in Fig. 4 and Fig. 5.

TC. We used the query shown in Listing 1 for calculating TC. Cog outperformed BigDatalog for all the graphs. Notably, Cog showed 3x better performance than BigDatalog for Tree11 and Grid150 graphs. BigDatalog suffers from the overhead of scheduling caused by the large number of iterations, whereas no such overhead is present in Cog as it performs iterative programs in a cyclic dataflow job [10]. However, this overhead is negligible when there is only a small number of iterations (see Table 1). Cog can suffer performance loss due to data spilling during the CoGroup operation with the solution set, which is visible in the case of G10K. With default settings, BigDatalog always crashed due to running out of memory as it was caching resilient distributed datasets (RDDs) in memory and clearing lineage in order to avoid stack overflow from long lineages. For TC queries, we disabled such caching of RDDs to avoid crashes.

SG. We used Listing 4 for calculating SG. We found that Cog is 2x faster than BigDatalog for Grid150 and G10K graphs, despite RDD caching to memory was enabled for BigDatalog. Though SG program produces a small number of output rows when Grid150 is used as input, however, it is clear from the result of G10K that the scheduling overhead is not the only factor for slow execution speed. Cog suffered performance loss when executing SG on Tree11 dataset. The reason for the inefficiency was the fact that the CoGroup operation with the solution set gets slower when the number of records stored in the solution set increases.

Single-Source Reachability. We used Listing 5 for calculating Reachability from a single vertex. Listing 5 shows that Cog outpaced BigDatalog in all the graph instances we used to evaluate Reachability. The difference in performance between Cog and BigDatalog gets more prominent with the increase in the size of the datasets. When running Reachability on BigDatalog with default configuration (e.g., broadcast join), we saw an increase of approximately 1.5x on each 2x increase in the size of the graphs. Though the overhead of scheduling did not increase (i.e., the number of iterations for 1M, 2M, 4M datasets was 4). With default settings, BigDatalog crashed for all the datasets of sizes greater than 4M. We discovered that BigDatalog uses a broadcast join by default, which broadcasts the entire graph to all the worker nodes. We believe that this was the reason for the crash, since a broadcast join works only if one of the inputs is small enough. Therefore, we changed the configuration to use a repartition join instead, which performed slightly faster and was able to process all of our datasets. The running time growth for Cog on all the datasets was small and steady. Cog was 3.4x faster for the largest dataset we tested.

5 Related Work

There is a large body of work discussing efficient Datalog evaluation [4, 9, 19]. In the following discussion we focus on distributed systems.

Distributed Dataflow Systems. Flink [1, 6] is a modern dataflow system for general-purpose data processing, that employs the incremental iteration model (specifically, delta iterations) [8]. Spark [21] is a scalable, fault-tolerant, distributed in-memory dataflow engine. In contrast to Flink, it has a considerable scheduling overhead when used for iterative jobs, as each iteration is scheduled as a new job. Naiad [15] is a system based on the timely dataflow computational model that supports structured loops for streaming. The iteration mechanism in Naiad is similar to that in Flink. Therefore, it would be possible to implement semi-naive Datalog execution also on Naiad, similarly to how we implemented it for Cog. The Differential Datalog [16] system goes in this direction, but it supports only single-machine execution.

Pregel-Like Graph Processing Systems. The think-like-a-vertex paradigm for graph processing was introduced by Pregel [14], and is used in many large-scale graph processing systems, such as GraphX [11], Giraph [23], and Gelly [25]. Contrary to Datalog, the think-like-a-vertex paradigm provides a stateful computation model, whereas Datalog queries are more declarative. Note that Pregel-like systems usually support deactivating vertices, and thereby support a kind of incrementalization, akin to the incremental nature of semi-naive evaluation.

Datalog Evaluation in Distributed Systems. Several systems implemented Datalog to be executed on a cluster of machines. BigDatalog [18] implemented positive Datalog with recursion, non-monotonic aggregations, and aggregation in recursion with monotonic aggregates on Spark. BigDatalog uses a number of clever tricks to overcome some of the limitations of Spark in the area of iterative computations. It added scheduler-aware recursion by adding a specialized Spark stage (FixPointStage) for recursive queries to avoid the job launching overhead. Furthermore, reusing Spark tasks within a FixPointStage eliminates the cost of task scheduling and task creation; however, task reuse can only happen on so-called decomposable Datalog programs, and only if the joins can be implemented by broadcasting instead of repartitioning, which is not the case for large graphs. BigDatalog added specialized SetRDD and AggregateRDD to enable efficient evaluation of recursive queries. BigDatalog also pays special attention to joins with loop-invariant inputs. It avoids repartitioning the static input of the join, as well as rebuilding the join’s hash table at every iteration. However, it does not ensure co-location of the join tasks with the corresponding cached build-side blocks, and thus cannot always avoid a network transfer of the build-side. (RaSQL [12] uses the same techniques plus operator code generation and operator fusion to implement recursive SQL with aggregations on Spark.)

When implementing Cog, we did not need to perform any of the above optimizations, as Flink has built-in support for efficient iterations with cyclic dataflow jobs. Having cyclic dataflow jobs means that all of the issues that BigDatalog’s optimizations are solving either do not even come up (per-iteration job-launching overhead and task-scheduling overhead), or already have simple solutions by keeping operator states across iterations (loop-invariant join inputs, incremental updates to the solution set). Thus, our view is that relying on Flink’s native iterations being implemented as a single, cyclic dataflow job is a more natural way to evaluate Datalog (or recursive SQL) efficiently.

Distributed SociaLite [17], is a system developed for social network analysis that implemented Datalog with recursive monotone aggregate functions using a delta stepping method and gives the ability to programmers to specify data distribution. It uses message passing mechanism for communication among workers. It shows weaknesses in loading datasets (base relations) and poor shuffling performance on large datasets [18]. Myria [20] is a distributed execution engine that implemented Datalog with recursive monotonic aggregation function in a share-nothing engine and supports synchronous and asynchronous iterative models. Myria, however, suffers from shuffling overhead when running large datasets and becomes unstable (it often runs out of memory) [18].

GraphRex [22] is a recent distributed graph processing system with a Datalog-like interface. It focuses on making full use of the characteristics of modern data center networks, and thus achieves very high performance in such an environment. In contrast to Cog or BigDatalog, it is a standalone system, not built on an existing dataflow engine, such as Flink or Spark. Note that building on an existing dataflow engine has the advantage that declarative Datalog queries can be seamlessly integrated into larger programs written using the (typically more general) native API of the dataflow engine.

6 Conclusion and Future Work

In this paper, we presented Cog, which is a Datalog language implementation for batch processing tasks on Apache Flink. The main advantages of Cog over other systems from a user perspective are its efficiency and conciseness. Cog executes recursive queries of Datalog as a single, cyclic dataflow job, thus avoiding scheduling overhead that is present in acyclic dataflows. In our experiments, Cog outperformed BigDatalog, a state-of-the-art large-scale Datalog system, in most of the test cases. The code and the latest updates of Cog are available at [24].

Future Work. An implementation of negation, non-monotonic aggregations, and aggregation in recursion for Datalog can be added to the system. Datalog for Flink stream processing tasks can also be implemented to facilitate analytics on real-time datasets. We believe that Datalog’s implementation for Flink stream processing API could surpass Cog’s efficiency because a Flink streaming job would not need a synchronization barrier after each iteration. Another future direction is to add support to Flink for recursive SQL queries, which are similar to recursive Datalog queries. Cog already laid the groundwork for this by translating the recursive logical plans to the Flink DataSet API.

Notes

Acknowledgments

This work was funded by the German Ministry for Education and Research as BIFOLD (01IS18025A and 01IS18037A). We would like to thank Jorge-Arnulfo Quiané-Ruiz for helpful comments on a draft of this paper.

References

  1. 1.
    Alexandrov, A., et al.: The stratosphere platform for big data analytics. VLDB J. 23(6), 939–964 (2014)CrossRefGoogle Scholar
  2. 2.
    Arni, F., Ong, K., Tsur, S., Wang, H., Zaniolo, C.: The deductive database system LDL++. Theory Pract. Log. Program. 3(1), 61–94 (2003)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Bancilhon, F.: Naive evaluation of recursively defined relations. In: Brodie, M.L., Mylopoulos, J. (eds.) On Knowledge Base Management Systems, pp. 165–178. Springer, New York (1986).  https://doi.org/10.1007/978-1-4612-4980-1_17CrossRefGoogle Scholar
  4. 4.
    Bancilhon, F., Ramakrishnan, R.: An amateur’s introduction to recursive query processing strategies. In: Readings in Artificial Intelligence and Databases, pp. 376–430. Elsevier (1989)Google Scholar
  5. 5.
    Begoli, E., Camacho-Rodríguez, J., Hyde, J., Mior, M.J., Lemire, D.: Apache calcite: a foundational framework for optimized query processing over heterogeneous data sources. In: Proceedings of the 2018 International Conference on Management of Data, pp. 221–230 (2018)Google Scholar
  6. 6.
    Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 36(4), 28–38 (2015)Google Scholar
  7. 7.
    Ceri, S., Gottlob, G., Tanca, L.: What you always wanted to know about datalog (and never dared to ask). IEEE Trans. Knowl. Data Eng. 1(1), 146–166 (1989)CrossRefGoogle Scholar
  8. 8.
    Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V.: Spinning fast iterative data flows. Proc. VLDB Endow. 5(11), 1268–1279 (2012)CrossRefGoogle Scholar
  9. 9.
    Fan, Z., Zhu, J., Zhang, Z., Albarghouthi, A., Koutris, P., Patel, J.: Scaling-up in-memory Datalog processing: observations and techniques. arXiv preprint arXiv:1812.03975 (2018)
  10. 10.
    Gévay, G.E., Rabl, T., Breß, S., Madai-Tahy, L., Markl, V.: Labyrinth: compiling imperative control flow to parallel dataflows. arXiv preprint arXiv:1809.06845 (2018)
  11. 11.
    Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: 11th USENIX Symposium on Operating Systems Design and Implementation OSDI 14). pp. 599–613 (2014)Google Scholar
  12. 12.
    Gu, J., et al.: RaSQL: greater power and performance for big data analytics with recursive-aggregate-SQL on Spark. In: Proceedings of the 2019 International Conference on Management of Data, pp. 467–484 (2019)Google Scholar
  13. 13.
    Hajiyev, E., Verbaere, M., de Moor, O.: codeQuest: scalable source code queries with datalog. In: Thomas, D. (ed.) ECOOP 2006. LNCS, vol. 4067, pp. 2–27. Springer, Heidelberg (2006).  https://doi.org/10.1007/11785477_2CrossRefGoogle Scholar
  14. 14.
    Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 135–146 (2010)Google Scholar
  15. 15.
    Murray, D.G., McSherry, F., Isaacs, R., Isard, M., Barham, P., Abadi, M.: Naiad: a timely dataflow system. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 439–455 (2013)Google Scholar
  16. 16.
    Ryzhyk, L., Budiu, M.: Differential datalog. In: Datalog 2.0 - 3rd International Workshop on the Resurgence of Datalog in Academia and Industry, CEUR-WS (2019)Google Scholar
  17. 17.
    Seo, J., Park, J., Shin, J., Lam, M.S.: Distributed sociaLite: a datalog-based language for large-scale graph analysis. Proc. VLDB Endow. 6(14), 1906–1917 (2013)CrossRefGoogle Scholar
  18. 18.
    Shkapsky, A., Yang, M., Interlandi, M., Chiu, H., Condie, T., Zaniolo, C.: Big data analytics with datalog queries on spark. In: SIGMOD, pp. 1135–1149 (2016)Google Scholar
  19. 19.
    Subotić, P., Jordan, H., Chang, L., Fekete, A., Scholz, B.: Automatic index selection for large-scale datalog computation. Proc. VLDB Endow. 12(2), 141–153 (2018)CrossRefGoogle Scholar
  20. 20.
    Wang, J., Balazinska, M., Halperin, D.: Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines. Proc. VLDB Endow. 8(12), 1542–1553 (2015)CrossRefGoogle Scholar
  21. 21.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I., et al.: Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)Google Scholar
  22. 22.
    Zhang, Q., et al.: Optimizing declarative graph queries at large scale. In: Proceedings of the 2019 International Conference on Management of Data, pp. 1411–1428 (2019)Google Scholar
  23. 23.
    Apache Giraph. http://giraph.apache.org/. Accessed 12 Apr 2020
  24. 24.
    Cog. https://github.com/imran-4/cog. Accessed 12 Apr 2020
  25. 25.
  26. 26.

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Technische Universität BerlinBerlinGermany

Personalised recommendations