The previous section described the research questions to be answered, the environment that was set up for the experiment and the queries to be run. This section will present the obtained results, answer the research questions (with the help of additional data in some cases) and discuss potential threats to the validity of the work. The raw data and all related source code supporting these results are available from the Aston Data Explorer repository.Footnote 24
Measurements obtained
The median execution times (in milliseconds) and coefficients of dispersion over 1000 executions of the GraBaTs’09 queries from Sect. 3.3.1 are shown in Table 1. Likewise, the results for the Train Benchmark queries from Sect. 3.3.2 are shown in Tables 2, 3, 4. To save space, Hawk with the Neo4j backend is abbreviated to “Hawk/N” and “H/N”. Likewise, Hawk with the OrientDB backend is shortened to “Hawk/O” and “H/O”. These abbreviations will be used throughout the rest of the paper as well.
Medians were picked as a measure of centrality due to their robustness against the occasional outliers that a heavily stressed system can produce. Coefficients of dispersion are dimensionless measures of dispersion that can be used to compare data sets with different means: they are defined as \(\tau /\eta \), where \(\tau \) is the mean absolute deviation from the median \(\eta \). Coefficients of dispersion are robust to non-normal distributions, unlike the better known coefficients of variation [6]. The tables allow for quick comparison of performance levels across tools, queries and number of client threads. Nevertheless, more specific visualizations and statistical analyses will be derived for some of the following research questions.
Table 1 Median execution times in milliseconds and coefficients of dispersion over 1000 executions of the GraBaTs’09 queries, by tool, language, protocol and client threads Table 2 Median execution times in milliseconds and coefficients of dispersion over 1000 executions of the Train Benchmark queries ConnectedSegments and PosLength, by tool, language, protocol and client threads Table 3 Median execution times in milliseconds and coefficients of dispersion over 1000 executions of the Train Benchmark queries RouteSet and SwitchMonitored, by tool, language, protocol and client threads Table 4 Median execution times in milliseconds and coefficients of dispersion over 1000 executions of the Train Benchmark queries SemaphoreNeighbor, and SwitchSet, by tool, language, protocol and client threads One important detail is that SemaphoreNeighbor was not fully run through CDO and Mogwaï, as it runs too slowly to allow for stress testing. More specifically, with only 1 client thread over TCP, the median time for the first 10 runs of SN was 88.44 seconds for CDO and 305.19 seconds for Mogwaï. For this reason, only Hawk was fully evaluated regarding SN.
The next step was to check whether the execution times belonged to a normal distribution for the sake of analysis. Shapiro–Wilk testsFootnote 25 rejected the null hypothesis (“the sample comes from a normal distribution”) with p values \(<0.01\) for almost all of the combinations of query, tool, language, protocol and thread count: only 11 out of 516 tested combinations reported p values \(\ge 0.01\). In order to visualize how they deviated from a normal distribution, further manual inspections with quartile–quartile plots were conducted. These confirmed that most distributions tended to be either heavy-tailed, bimodal, multimodal, or curved.
This is somewhat surprising, as the natural intuition is that execution times should follow a normal distribution: 90% of the Java benchmarks conducted by Georges et al. [16] with single-core processors did follow a Gaussian distribution according to Kolmogorov–Smirnov tests. At the same time, 10% of those benchmarks were not normally distributed (being reportedly skewed or bimodal), and modern machines with multi-core processors have only grown more non-deterministic since then. More recently, Chen et al. [9] concluded that execution times for multithreaded loads in modern multi-core machines do not follow neither normal nor log-normal distributions and that more robust nonparametric methods are needed for performance comparison. Our study in particular involves 3 machines communicating over Ethernet and doing heavy disk-based I/O: even with 1 thread per client machine, the server will experience a non-deterministic multithreaded load as the one studied by Chen at al. For these reasons, the rest of this paper will assume that the query execution times are not normally distributed and will use nonparametric tests.
Some of the configurations had intermittent issues when running queries. This was another goal of our stress testing: finding if the different tools would fail with increased demand and if they could recover from these errors (which they did by themselves). Table 5 shows the configurations that produced server errors, and Table 6 shows the configurations that reported the wrong number of results. The “correct” number of results is computed by running each query across all tools in local installations, without the risk of the network or the stress test influencing the result, and ensuring they all report equivalent results. In our previous conference paper [14], incorrect executions only happened for the CDO HTTP API, but a similar issue exists even in the TCP API for some of the Train Benchmark queries. Hawk over Neo4j and Mogwaï were the only combinations of tool and backend that did not report failed or incorrect executions.
Regarding the failed and incorrect executions of CDO, at this early stage of the study we could only treat it as a black box, as we were merely users of this tool and not their developers. However, our analysis of RQ2 in Sect. 4.3 suggests that this is due to the stateful buffer-based design of the CDO API. As for the failed executions of Hawk with OrientDB, we attribute these problems to concurrency issues in the OrientDB backend, since Hawk with Neo4j does not report any issues and has otherwise the exact same code.
RQ1: impact of protocol
A quick glance at the results on Tables 1, 2, 3, 4 shows that there are notable differences in some cases between HTTP and TCP, but not always: in fact, sometimes HTTP appears to be faster.
To clarify these differences, pairwise Mann–Whitney U tests [24] were conducted between the HTTP and TCP results of every configuration. p values < 0.01 were required to reject the null hypothesis that there was the same chance of HTTP and TCP being slower than the other for that particular configuration. Where the null hypothesis was rejected, Cliff deltas were computed to measure effect size [19]. Cliff delta values range between +1 (for all pairs of execution times, HTTP was always slower) and −1 (HTTP was always faster). Cohen d effect sizes [10] were not considered since execution times were not normally distributed. The results are summarized in Tables 7 and 8. 99% confidence intervals of the difference between HTTP and TCP execution times were also computed during the Mann–Whitney U tests, but due to space constraints they were not included in those tables. Some of those confidence intervals will be mentioned in the following paragraphs.
CDO is the simplest case here: all tested configurations have significant differences and report positive effect sizes, meaning that HTTP was consistently slower than TCP. Cliff deltas become much weaker (closer to 0) with increasing number of threads, except for the SM and SS queries. This is also confirmed through the confidence intervals: for OQ with 1 thread, it is \([+6929\text {ms}, +6944\text {ms}]\), while with 64 threads it is only \([+78\text {ms}, +1406\text {ms}]\). By comparing the medians, it can be seen that HTTP can be over 3000% slower than TCP in extreme cases, such as SS with 4 client threads.
Mogwaï has conflicting results across the two case studies. For the OQ GraBaTs’09 query, HTTP is quite often faster, though HTTP and TCP become rather similar for 32 and 64 threads with absolute values below 0.35. For the Train Benchmark queries, TCP is more often faster than HTTP, though this difference again drops off as the number of client threads increases. For SM and SS, the two faster running queries for Mogwaï in the TB case study, the difference is again very small. This suggests that for Mogwaï, in addition to the protocol used, the way concurrency is handled by the server and how it interacts with the query might have an impact as well. In particular, the Jetty HTTP server and the TCP server use different types of network I/O: non-blocking for Jetty (which decouples network I/O from request processing) and blocking for Thrift (which simplifies the Thrift message format).
The results from Hawk are the most complex to analyse. Regarding the GraBaTs’09 queries, HQ1 is consistently slower on HTTP for Neo4j and OrientDB, with strong effect sizes for all numbers of client threads. HQ2 is only slower on HTTP for OrientDB, especially with few client threads: with Neo4j, HTTP is faster for 1–8 threads. HQ3 and HQ4 sometimes run slower on HTTP, but effect sizes are weaker overall and in most cases there is not a significant difference. It appears that once queries are optimized through derived and indexed attributes, there is not that much difference between HTTP and TCP.
As for the Train Benchmark queries under Hawk, a first step is studying the Cliff deltas for each query:
-
CS does not show a consistent pattern neither by backend nor by query language: effect sizes are only moderate with Neo4j (with absolute values below 0.30), and with OrientDB effect sizes are positive when using EOL and negative when using EPL.
-
PL and RS are usually faster on HTTP, especially with OrientDB.
-
SM on the other hand is consistently slower on HTTP. This time, the strongest effect sizes are produced when using Neo4j.
-
SN is consistently slower on HTTP with Neo4j, and consistently faster on HTTP with OrientDB.
-
SS effect sizes are usually positive with Neo4j and negative with OrientDB, but they are weak, with absolute values below 0.22.
From these results, it appears that the largest factor on HTTP slowdown patterns for Hawk is the chosen backend, suggesting that the interaction between the concurrency and I/O patterns of the Jetty HTTP server, the Thrift TCP server and the Hawk backend may be relevant. While the Hawk Neo4j backend took advantage of the thread safety built into Neo4j, the OrientDB backend has only recently implemented its own thread pooling to preserve database caches across queries. The query language was only important for CS, showing there may be an interaction but it could be relevant only for certain queries.
As for the coefficients of dispersion, the general trend is that they increase as more client threads are used. This is to be expected from the increasingly non-deterministic multithreaded load, but the exact pattern changes depending on the tool and the protocol. For most configurations, Hawk shows very similar CDs with HTTP and TCP, and so does Mogwaï (except for some rare cases such as SM and SS over 32 threads): this is likely due to the fact that the message exchanges are the same across both solutions (a single request/response pair). However, CDO shows consistently different CDs over HTTP and over TCP, suggesting that they may be fundamentally different in design from each other. This will be the focus of the next section.
RQ2: impact of API design
One striking observation from RQ1 was that CDO over HTTP had much higher overhead than Hawk and Mogwaï over HTTP. Comparing the medians of OQ and HQ1 with 1 client thread, CDO+HTTP took \(635.66\%\) longer than CDO+TCP, while Hawk+HTTP only took \(24.16\%\) longer than Hawk+TCP. This contrast showed that CDO used HTTP to implement their APIs very differently from the other tools.
Table 5 Failed executions (timeout / server error), by query, tool, protocol and client threads (“1t”: 1 thread). Only combinations with failed executions are shown Table 6 Executions with incorrect number of results, by query, tool, protocol and client threads (“1t”: 1 thread). Only combinations with incorrect executions are shown To clarify this issue, the Wireshark packet sniffer was used to capture the communications between the server and the client for one invocation of OQ and HQ1 (with Hawk over Neo4j). These captures showed quite different approaches for an HTTP-based API:
-
CDO involved exchanging 58 packets (10203 bytes), performing 11 different HTTP requests. Many of these requests were very small and consisted of exchanges of byte buffers between the server and the client, opaque to the HTTP servlet itself.
Most of these requests were either within the first second of the query execution time or within the last second. There was a gap of approximately 6 seconds between the first group of requests and the last group. Interestingly, the last request before the gap contained the OCL query and the response was an acknowledgement from CDO. On the first request after the gap, the client sent its session ID and received back the results from the query.
Table 7 Cliff deltas for GraBaTs’09 query execution (−1: HTTP is faster for all pairs, 1: HTTP is slower for all pairs), for configurations where Mann–Whitney U test reports significance (p value < 0.01), by tool, query and number of client threads (“1t”: 1 thread) Table 8 Cliff deltas for Train Benchmark query execution (−1: HTTP is faster for all pairs, 1: HTTP is slower for all pairs), for configurations where Mann–Whitney U test reports significance (p value < 0.01), by tool, query and number of client threads (“1t”: 1 thread) The capture indicates that these CDO queries are asynchronous in nature: the client sends the query and eventually gets back the results. While the default Net4j TCP connector allows the CDO server to talk back to the client directly through the connection, the experimental HTTP connector relies on polling for this task. This has introduced unwanted delays in the execution of the queries. The result suggests that an alternative solution for this bidirectional communication would be advisable, such as WebSockets.
-
Hawk involved exchanging 14 packets (2804 bytes), performing 1 HTTP request and receiving the results of the query in the same response. Since its API is stateless, there was no need to establish a session or keep a bidirectional server–client channel: the results were available as soon as possible.
While this synchronous and stateless approach is much simpler to implement and use, it does have the disadvantage of making the client block until all the results have been produced. Future versions of Hawk could also implement asynchronous querying as suggested for CDO.
One side note is that Hawk required using much less bandwidth than CDO: this was due to a combination of using fewer requests, using gzip compression on the responses and taking advantage of the most efficient binary encoding available in Apache Thrift (Tuple).
In summary, CDO and Hawk use HTTP in very different ways. The CDO API is stateful and consists of exchanging pending buffers between server and client: queries are asynchronous. This is not a problem when using TCP, since messages can be exchanged both ways. However, HTTP by itself does not allow the server to initiate a connection to the client to send back the results when they are available: to emulate this, polling is used. This could be solvable with technologies such as WebSockets, which is essentially a negotiation process to upgrade an HTTP connection to a full-duplex connection.
This stateful and buffer-based communication explains some of the intermittent communication issues that were shown for CDO in Tables 5 and Tables 6. In a heavily congested multithreaded environment, concurrency issues (race conditions or thread-unsafe code) may result in buffers being sent out of order or mangled together. If the state of the connection becomes inconsistent, it may either fail to produce a result or may miss to collect some of the results that were sent across the connection.
In comparison, the Hawk API is stateless and synchronous: query results are sent back in a single message. Since there are no multiple messages that need to be correlated to each other, this problem is avoided entirely.
These results suggest that while systems may benefit from supporting both synchronous querying (for small or time-sensitive queries) and asynchronous querying (for large or long-running queries), asynchronous querying can be complex to implement in a robust manner. Proper full-duplex channels are required to avoid delays (either raw TCP or WebSockets over HTTP), and adequate care must be given to thread safety and message ordering.
RQ3: impact of caching and indexing
This section will focus on the results from the TCP variants, since they were faster or equivalent to the HTTP variants in the previous tests. It will also focus on the times in the ideal situation where there is only 1 client thread: later questions will focus on the scenarios with higher numbers of client threads.
GraBaTs’09 queries
A Kruskal–Wallis test reported there were significant differences in TCP execution times across tool/query combinations with 1 client thread (p value below 0.01). A post hoc Dunn test [12] was then used to compute p values for pairwise comparisons, using the Bonferroni correction. There was only one pairwise comparison with p value higher than 0.01: HQ3 with Hawk/Neo4j against HQ3 with Hawk/Orient (p value = 0.054). These two configurations will be considered to be similar in performance. All other comparisons will be based on the medians shown in Table 1.
Looking at the OQ and HQ1 times for CDO, Hawk and Mogwaï, CDO is the fastest, with a median of 1088ms compared to 5673ms from Mogwaï, 1631ms from Hawk/Neo4j and 3491ms from Hawk/OrientDB. This is interesting, as normally one would assume that the join-free adjacency of the graph databases used in Hawk and Mogwaï would give them an edge over the default H2 relational backend in CDO.
Enabling the SQL trace log in CDO showed that after the first execution of OQ, later executions only performed one SQL query to verify whether there were any new instances of TypeDeclaration. Previous tests had already rejected the possibility that CDO was caching the query results. Instead, an inspection of the CDO code revealed a collection of generic caches. Among others, CDO keeps a CDOExtentMap from EClasses to all their EObject instances, and also keeps a CDORevisionCache with the various versions of each EObject. CDO keeps a cache of prepared SQL queries as well.
In comparison, Hawk and Mogwaï do not use object-level caching by default, relying mostly on the graph database caches instead. Neo4j caches are shared across all threads, whereas in OrientDB they are specific to each thread, requiring more memory. OrientDB caches can be configured to free up memory as soon as possible (weak Java references) or use up as much memory as possible (soft Java references): for this study, the second mode was used, but the authors identified issues with this particular mode. The issues were initially notified and resolved,Footnote 26 but the lack of an LRU policy in the OrientDB in-house cache prompted the authors to have Hawk replace it with a standard Guava cache.
Beyond object-level caching, Hawk caches type nodes and Mogwaï caches the compiled version of the ATL script that transforms OCL to Gremlin. The ATL caching in Mogwaï was in fact added during this study, as a result of communication with the Mogwaï developers that produced several iterations tackling limitations in the OCL transformer, reducing query latency and resolving concurrency issues.
The above results indicate that a strong caching layer can have an impact large enough to trump a more efficient persistent representation in some situations. Nevertheless, the results of HQ2, HQ3 and HQ4 confirm the findings of our previous work in scalable querying [3, 4]: adding derived attributes to reduce the levels of iteration required in a query speeds up running times by orders of magnitude, while adding minimal overhead due to the use of incremental updating. These derived attributes can be seen as application-specific caches that precompute parts of a query, unlike the application-agnostic caches present in CDO:
-
HQ2 replaces the innermost loop in HQ1 with the use of precomputed derived attributes (isStatic, isPublic and isSameReturnType) of a generic nature. These derived attributes produce a 2.80x speedup on Hawk/Neo4j and 2.04x speedup on Hawk/OrientDB. OrientDB receives less of a boost as following edges in general appears to be less efficient than in Neo4j.
-
HQ3 uses the same attributes but rearranges the query to have them appear in the outermost “select”, so Hawk can transform the iteration transparently into a lookup. Compared to HQ2, HQ3 is 2.71x faster on Neo4j and 9.47x faster on OrientDB. From the previous Dunn post hoc test, it appears that indexing in the Hawk and OrientDB backends is similarly performant in this case.
-
HQ4 uses a much more specific derived attribute (isSingleton) that eliminates one more level of iteration, turning the query into a simple lookup. HQ4 is one order of magnitude faster than HQ3 both on Neo4j and OrientDB, but here Hawk/Neo4j is somewhat faster. This suggests that a single index lookup is faster on Neo4j, whereas multiple index lookups are faster on OrientDB. This may be due to the way OrientDB caches index pages internally, compared to Neo4j.
Train Benchmark
The Train Benchmark results span over 6 queries of very different nature: some are very lightweight, while others require a more intensive traversal of the underlying graph. For each query, a Kruskal–Wallis test confirmed that there were significant differences in TCP execution times across configurations (p value < 0.01). A post hoc Dunn test confirmed that most pairwise combinations of configurations had significant differences as well (p value < 0.01 with Bonferroni correction), except for SwitchSet between Hawk/Orient/EOL and Mogwaï. Having established most differences in times are significant, this section will use the medians in Tables 2 and 4 to compare the tools.
To simplify the comparison, rather than using the tables directly, this section will use the more intuitive radar plots in Fig. 10 to guide the discussion. Comparing the relative area of each different tool gives a general impression of their standing: tools with smaller areas are faster in general. The Hawk side and the CDO/Mogwaï side use the same scales, to allow for comparisons across plots. CDO and Mogwaï do not have any data points for SN, since they were too slow for a full run (Sect. 4.1).
The Hawk side compares the relative performance of the four tested configurations (two backends, two query languages). It can be seen that the OrientDB backend is close to the Neo4j backend in some queries (CS and SM), twice as slow in most queries, and noticeably slower in RS. Examining these results suggests that while derived/indexed attributes are effective on both backends, range queries in OrientDB do not deal well with high-cardinality attributes:
-
The two queries that ran in similar times (CS and SM) use custom Hawk indices: CS performs an indexed range query on a derived attribute (nMonitoredSegments > 5), and SM performs an indexed lookup (isMonitored = false).
-
However, PL is still slow even though it uses an indexed range query (length\(\le \) 0), which apparently contradicts the results obtained with CS. One important difference between the queries is that there are many more distinct values of length (978) than of nMonitoredSegments (2): the indexed range query in PL will need to read many more SB-Tree nodes than in CS.
Looking at the CDO/Mogwaï side, it appears that the generic caching in CDO helped obtain good performance in PL, RS (where it slightly outperformed even Hawk with Neo4j) and CS, but it was not that useful for SM. In SM, Mogwaï can follow the monitoredBy reference faster than CDO, and Hawk can use an indexed lookup to fetch directly the 35 unmonitored Switches instead of going through all 1501 of them. In general, it appears that CDO deals quite well with queries that involve few types, in addition to queries with few nested reference traversals.
While Mogwaï does not support indexed attributes, its use of Neo4j through NeoEMF should have given it similar performance to that of Hawk with Neo4j through the default Neo4j caching. Instead, it is always slower than Hawk with Neo4j and EOL, and it is only faster than Hawk with OrientDB and EOL on RS. After a discussion with the Mogwaï/NeoEMF developers, it seems that this difference may be due to the use of Neo4j 1.9.6 in NeoEMF (Hawk uses 2.0.5, after testing various 2.x releases), and to inefficiencies in the bundled implementation of Gremlin.
RQ4: impact of mapping from query to backend
In a database-backed model querying solution, the query language is the interface shown to the user for accessing the stored models, and a query engine is the component that maps the query into an efficient use of the backend. Good solutions are those whose queries are easy to read and write and are mapped to the best possible use of the backend.
Since the query language, the query engine and the backend are all interrelated, it is hard to separate their individual contributions. CDO and Mogwaï use the same query language, but run it in very different ways. Likewise, Mogwaï and Hawk share a backend (Neo4j), but they store models differently and use different APIs to access it. For this reason, it is not possible to talk about what is the “best” query language in isolation of the other factors, or make other similar general statements. Instead, the answer to RQ4 will start from each source language and draw comparisons on how their queries were mapped to the capabilities of the backends, for the different tools that supported them:
-
OCL is reasonably straightforward to use for queries with simple pattern matching, like OQ/MQ from GraBaTs’09 or the Train Benchmark PL and SM queries. However, it quickly becomes unwieldy with queries that have more complex pattern matching, requiring many nested select/collect invocations in cases such as SN (Fig. 4 on page 17).
CDO and Mogwaï map OCL in very different ways. CDO parses the OCL query into a standard Eclipse OCL abstract syntax tree of Java objects and evaluates the tree, providing a CDO environment that integrates caching and reads from the database as needed with multiple SQL queries. This allows it to start running the query very quickly, but it also implies that OCL queries need to switch back and forth between the H2 database layer and the model query layer, reducing performance. This may have been one of the main reasons for CDO’s inclusion of an object-level cache.
Mogwaï, on the other hand, parses the OCL query as a model, transforms it into Gremlin, compiles the Gremlin script into bytecode, executes the query entirely within Gremlin and deserializes the results back into EMF objects. This process increases query latency over an interpreted approach, but queries could potentially run faster thanks to less back and forth between layers. However, as mentioned for RQ3, the use of an old release of Neo4j (1.9.6) in the current version of Mogwaï has made it run quite slow, negating this advantage over CDO and Hawk.
-
EOL is inspired by OCL, and while the examples show that it is slightly more concise, it still suffers from the same nested collect/select problem when performing complex graph pattern matching. The execution approach is also similar: the EOL query is turned into an abstract syntax tree, which is visited in a post-order manner to produce the final value.
However, the EOL-Hawk bridge [4] takes advantage of several features in the underlying graph database: custom indices (already discussed for RQ3) and the bidirectional navigability of the edges. It also allows for following references in reverse (from target to source), and certain queries can be written much more efficiently. This was the reason why the median time for SN was 352ms with Hawk/Neo4j/EOL and over 300s with Mogwaï. It is a missed opportunity for Mogwaï, which could have exposed this capability as well through OCL.
-
EPL is a refined version of EOL which is specialized towards pattern matching. Looking at SN again, the EPL version is much easier to understand, with no explicit nesting: these nested loops are implicit in EPL’s execution. Like EOL, EPL is also interpreted instead of compiled, reducing latency for some queries.
As shown in Tables 3 and Fig. 10, EPL appears to be consistently slower than EOL, even though queries are very similar. The overhead is especially notable for SN, where EPL is twice as slow as EOL. To clarify this issue, a profiler was used to follow 5 executions of the EOL and EPL versions of SN. It revealed that the additional type checking done implicitly by EPL on every match candidate was the main reason for the heavy slowdown. While this check is painless on traditional in-memory models, on the graph databases built by Hawk this check requires following one more edge and potentially performing disk I/O. Disabling this type check by referring to the “Any” root supertype in Epsilon instead returned execution times to values similar to those of EOL.
Table 9 Bounds of the 99% confidence interval for median execution time ratios between CDO and other tools (GraBaTs’09). Values greater than 1 indicate that CDO is slower, while values less than 1 indicate that the other tool is slower Table 10 Bounds of the 99% confidence interval for median execution time ratios between CDO and other tools (Train Benchmark) In closing, these experiences show that while query compilation may have a higher potential for performance, it may be more important to focus on selecting a stronger database technology and fully expose the strengths of this technology through the query language and the query engine. Developers wishing to repurpose existing “declarative” query languages need to test whether any language features interact negatively with the chosen technology, as the cost of certain common operations may have changed dramatically.
RQ5: scalability with demand
The next question was concerned about how well relational and graph-based approaches scale as demand increases: one approach could do well with few clients, but then quickly drop in performance with more clients. Ideally, we would simply swap relational backends with graph-based backends in each tool and do separate comparisons. Unfortunately, CDO does not include a graph-based backend, and Hawk and Mogwaï do not support relational backends. Instead, we will make the comparison across tools, assuming that each tool was specially tailored to their backend and that therefore they are good representatives for their type of approach. These results could be revisited if new backends were developed, but they should serve as a good snapshot of their standing at the time of writing this paper.
In this section, the relational approaches will be represented by CDO (based on the embedded H2 database), and the graph-based approaches will be represented by Hawk (combined with Neo4j 2.0.5 or OrientDB 2.2.8) and Mogwaï (backed by NeoEMF, which uses Neo4j 1.9.6). CDO is one of the most mature model persistence layers and has considerable industrial adoption, so it can be considered a good representative for the relational approaches.
First, Kruskal–Wallis tests confirmed (with p values < 0.01) that for each combination of query and client threads, TCP execution times had significant differences across the tested combinations of tool, backend and query language. Post hoc Dunn tests were used to evaluate the null hypotheses that CDO execution times were similar to each of the non-CDO configurations (p values < 0.01). In most cases, the null hypothesis was rejected, but there were some exceptions (2 out of 63 for the GraBaTs’09 queries, and 4 out of 175 for the Train Benchmark queries).
After confirming significant differences for most CDO vs. non-CDO pairs, the next step was quantifying how those pairs scaled relative to each other. Cliff deltas would have been able to express if a certain configuration started being faster more often than the other at a certain point, but they could not show whether the gap between CDO and the non-CDO configuration increased, stayed the same or decreased together with the client threads. Instead, it was decided to use the median of the \(t_c/t_o\) ratios between random pairings of the \(t_c\) CDO TCP execution times and the \(t_o\) non-CDO TCP execution times: values larger than 1 would imply that CDO was slower, and values smaller than 1 would mean that CDO was faster. To increase the level of confidence of the results, bootstrapping over 10, 000 rounds was used to estimate a 99% confidence interval of this “median of ratios” metric. The confidence intervals produced for the GraBaTs’09 and Train Benchmark queries are shown in Tables 9 and 10, respectively. Cells with “1.0–1.0” represent cases where CDO and the tool did not report significantly different times according to the Dunn tests.
In absolute terms, in most cases if a query runs faster or slower on a certain tool than on CDO, it will remain that way for all client threads. However, there are some exceptions:
-
Mogwaï becomes slightly faster than CDO for the PL query with 2 or more threads, and slower than CDO for SS with 4+ threads. In fact, all non-CDO solutions experience a noticeable drop in performance for SS with 4 threads: it is just that Mogwaï did not have enough leeway to stay ahead of CDO. It appears that when running queries with no specific optimizations (e.g. indexed attributes), there may be less thread contention on CDO than on the other tools, closing the gap that originally existed in some cases.
-
Hawk with Neo4j/EPL and Hawk with OrientDB/EOL start with better performance than CDO for SS, but quickly drop to similar or slightly inferior performance when using 4 or more client threads. In the first case, the additional type checks performed by EPL are weighing Hawk down. In the second case, the lower performance of the OrientDB backend gives Hawk less margin to handle the CPU saturation at 4 threads—with OrientDB/EPL, Hawk is already slightly slower than CDO with 1 thread.
One interesting observation is that depending on the combination between the query and the tool, some queries maintain a consistent ratio with CDO (e.g. OQ on Mogwaï), others raise then fall (PL for Mogwaï and Hawk), and others simply fall (RS and SS on all tools). This further supports the idea that thread contention profiles among the different tools vary notably for the same query. While further studies would be necessary to find out the specific reasons for most of these cases, there are some configurations for which it is easier to explain. The reason behind HQ4 having consistently increasing ratios for Hawk/Neo4j and Hawk/OrientDB is that it reduces multiply nested loops with a single lookup, changing the underlying order of the computation: the heavier the load, the larger the contrast created by this change.
As a general conclusion, graph databases by themselves are not a silver bullet — Mogwaï, for instance, did not outspeed CDO in many queries. It is important to use recent releases and take advantage of every feature at their disposal in order to achieve a solid advantage over mature relational technologies.
Threats to validity
This section discusses the threats to the internal and external validity of the results, as well as the steps we have taken to mitigate them. Starting with the internal validity of the results, these are the threats we have identified:
-
There is a possibility that CDO, Hawk or Mogwaï could have been configured or used in a more optimal way. Since the authors developed Hawk, this may have allowed them to fine-tune Hawk better than CDO or Mogwaï.
However, the servers did not show any undesirable virtual memory usage, excessive garbage collection or unexpected disk I/O. The H2 backend was chosen for CDO due to its maturity in comparison with the other backends, and the Neo4j backend has consistently produced the best results for Hawk according to previous work. Mogwaï is only available for the Neo4j backend of NeoEMF, so using an alternative configuration was out of the question.
The authors contacted the CDO developers regarding how to compress responses and limit results by resource, to make it more comparable with Hawk, and were informed that these were not supported yet.Footnote 27 The authors also collaborated with the Mogwaï developers to improve performance as much as possible during the writing of the paper, contributing bugfixes and suggesting various improvements that reduced query latency.
-
The queries for CDO/Mogwaï and Hawk were written in different languages, so part of the differences in their performance may be due to the languages and not the systems themselves. The aim in this study was to use the most optimized language for each system, since Hawk does not support OCL and Mogwaï and CDO do not support EOL.
Analytically, we do not anticipate that this is likely to have a strong impact on the obtained results for CDO and Hawk as both languages are very similar in nature and are executed via mature Java-based interpreters. It may only be an issue with Mogwaï, whose OCL-to-Gremlin transformation is still a work in progress and may change when Mogwaï transitions to Neo4j 2.x.
As for whether the results can be generalized beyond this study, there are a few threats that must be acknowledged:
-
This study has not considered running several different queries concurrently. While multiple configurations for Hawk have been considered (all 4 combinations of Neo4j/Orient and EOL/EPL), only one configuration was studied for CDO and for Mogwaï. The tested configurations would be quite typical in most organizations, but it would be interesting to perform studies that mix different queries running in different models concurrently and configure Hawk and CDO with different backends, memory limits and model sizes.
-
The experiment has compared a specific set of tools: one for model repositories (CDO), one for graph-based model indexing (Hawk) and one for querying models persisted as graphs (Mogwaï on top of NeoEMF/Graph). This raises the question of whether the results could be extended to other tools of the same types.
The first part of our answer is that this categorization was not relevant for this study: any tool could have been used as long as it provided a high-level remote querying API and relied on a database for persisting the models. CDO, Mogwaï (in combination with NeoEMF/Graph) and Hawk are three instances of these same requirements, and therefore, any generalizations are backed by not one, but the three tools.
The next part is that while some of the detailed results are specific to certain tools (e.g. comparisons between Neo4j releases), there are higher-level results which reaffirm knowledge from other areas in software engineering. For instance, RQ1 showed that HTTP’s overhead was roughly constant if the message patterns were similar, and RQ2 confirmed just how much of an impact a different message pattern could have. RQ3 compared generic against application-specific caching, RQ4 discussed readability and query implementation quality, and RQ5 confirmed using a graph backend may not always bring better performance by itself. The high-level observations collected during these studies can be extended to any database-backed remote model querying solution in the future: indeed, part of our intention with this paper was to make future developers aware of these aspects.
-
The results are based on two specific case studies: it could be argued that different case studies could have yielded different results. To avoid introducing bias, the authors refrained from defining custom benchmarks and instead adopted benchmarks from the existing literature. These benchmarks were picked as they covered different application areas (software engineering versus critical systems modelling), different metamodels (highly hierarchical software metamodels versus “flat” railway metamodels), and different workloads (localized pattern matching in GraBaTs’09 versus a combination of complex pattern matching and simple “all X with attribute Y meeting Z” queries in TB).
For these reasons, we argue that the 7 queries across the 2 case studies are representative of pattern matching queries on models, where we want to find elements whose state and relationships meet certain conditions. We do not expect other model querying case studies to change the results significantly. However, our case studies do not cover other model management tasks, such as code generation or model transformation: those would require their own case studies. Incidentally, Hawk did significantly speed up code generation in our previous work [13].