1 Introduction

Model-driven engineering (MDE) has received considerable attention due to its demonstrated benefits of improving productivity, quality and maintainability. However, industrial adoption has ran into various challenges regarding the maturity and scalability of MDE. Mohagheghi et al. [25] interviewed participants from four companies and noted concerns that the tools at the time did not scale to the large projects that would merit the use of MDE. Several ways in which MDE practice could learn from widely used programming environments to handle large models better were pointed out by Kolovos et al. [21], with a strong focus on the need for modularity in modelling languages to improve scalability and simplify collaboration. Three categories of scalability issues in MDE were identified by Barmpis and Kolovos [2]:

  • Model persistence: storage of large models, ability to access and update such models with low memory footprint and fast execution time.

    The simplest solution (using one file per model) has not scaled well as models increase in size. One alternative approach is fragmenting the models into multiple smaller files. Another option is writing a model persistence layer that stores the model in a database of a certain type (relational, graph-oriented, and so on).

  • Model querying and transformation: ability to perform intensive and complex queries and transformations on large models with fast execution time.

    Efficient queries and transformations are closely related to the type of persistence used for the models. Fragmented models can be backed with an incrementally maintained model index (such as Hawk)Footnote 1 that can answer queries faster than going through the fragments. For database-backed models, the query must be transformed to an efficient use of the database, and the database must provide a high level of performance. This is the approach taken by Mogwaï,Footnote 2 a query engine for models stored in the NeoEMFFootnote 3 layer that transforms OCL queries into GremlinFootnote 4 API calls.

  • Collaborative work: multiple developers being able to query, modify and version control large-scale shared models in a non-invasive manner.

    With fragmented models, existing version control systems can be reused. Database-backed systems need to implement their own version control approaches: this is the approach taken in model repositories such as CDO.Footnote 5

Regardless of how models are stored, high-performance querying is crucial when dealing with very large models. For instance, within the building industry it is common to use building information models (BIM) containing millions of elements and covering the logical and physical structure of entire buildings. These models need to be queried, e.g. to compute quantity takeoffs which estimate the materials needed to complete construction [1]. Reverse engineering source code into models [7] also produces very large models, and these need to be queried to find design flaws or elements to be modernized, among other things. Complex graph pattern matching may further complicate things, as when validating railway models [32].

Sharing models by sending files manually is inefficient (in effort and transmission time) and prone to mistakes (e.g. having someone use an outdated version). Instead, it is considered better to use model repositories such as CDO or file repositories such as Git, and to expose the models for querying/modification through networked services. As an example, in previous work [13], we demonstrated how Hawk enabled Constellation model repositories to offer dashboards with model metrics and advanced searching from a web interface. Within the MONDO project, one of the tools for collaborative modelling implemented an “online” approach where multiple concurrent users accessed the model over a web interface [27].

Exposing models through networked services introduces new layers of complexity, such as the design and implementation of the service, or the interactions between the layers as more and more clients try to access a model at the same time. Existing studies have not analysed these new factors, considering only local queries within the same machine or the “best-case” scenario with only one remote user. It is important to stress-test these networked services, as solutions may exhibit various issues in high-load situations.

In this empirical study, we will evaluate the impact of several design decisions in the remote model querying services offered by multiple existing solutions (CDO, Hawk and Mogwai). While these tools have different goals in mind, they all offer this same functionality, and they all had to choose a particular network protocol, messaging style, caching/indexing style, query language and persistence mechanism. The results of this study aim to inform developers and end users of future remote model querying services on the trade-offs between these choices.

This paper is an extended version of our prior conference work [14], which discussed a smaller study with fewer tools, queries and research questions. The new contributions of this paper are:

  • An updated and extended discussion of the state of the art, with recent works on prefetching, partial loading and non-relational model stores.

  • A largely expanded experimental design, testing four additional tool configurations (Hawk with Neo4j/EPL, Hawk with OrientDB/EOL, Hawk with OrientDB/EPL and Mogwaï), new and revised queries for the GraBaTs’09 case study and a new case study based on the queries from the Train Benchmark by  Szárnyas et al. [32]. The previous research question on the impact of the internals of the tools (RQ3) has been refined into multiple research questions.

  • A revamped and expanded results section, with a stronger focus on statistical tests in order to manage the much larger volume of data in this work. Only the results from RQ2 have remained intact, since the APIs for CDO and Hawk have not changed.

  • A revised set of conclusions, taking into account the more nuanced results produced by the Train Benchmark case study.

The rest of this work is structured as follows: Sect. 2 provides a discussion on existing work on model stores, Sect. 3 introduces the research questions and the design of the experiment, Sect. 4 discusses the obtained results, and Sect. 5 presents the conclusions and future lines of work.

2 Background and related work

Persisting and managing large models has been extensively investigated over the past decade. This section presents the main state-of-the-art tools and technologies, with a focus on the tools used in this empirical study.

2.1 File-based model persistence

One of the most common formats for storing models is files containing a serialized model representation. Tools like the Eclipse Modeling Framework (EMF) [31], ModelCVS [23], ModelioFootnote 6 and MagicDrawFootnote 7 all use XML-based model serialization. StarUMLFootnote 8 stores models in JSON. To improve performance, many tools offer binary formats as well: this is the case for EMF, for instance.

Files are easy to deploy and use, and many tools (e.g. EMF) default to using a one-file-per-model approach. However, storing one model per file impacts scalability negatively as shown in [2, 17]. In this case, even a simple query or a small change requires loading the entire model in memory at once: this is impractical for large models. Recent work by Wei et al. [33] demonstrated a specialization of the EMF XMI parser which can load only the subset required by the query to be run: while this reduced loading times and memory usage, changes in the partially loaded models cannot be saved back without losing information.

These limitations in scalability suggest that it could be beneficial to break up large models into smaller units (or “fragments”) to allow for on-demand partial loading. Modelio does this by default in recent versions: for instance, each UML class is stored in a separate file, and links between files are resolved through a purpose-built index. For EMF-based models, the EMF-Splitter framework by Garmendia et al. [15] can take a metamodel annotated with modularity information and produce editors that produce fragmented XMI-based models natively. Nevertheless, in a worst-case scenario, certain types of queries (e.g. a query that looked for all instances of a type) could still require loading the full set of fragments.

2.2 Database-backed model persistence

In light of the scalability limitations resulting from storing models as text files, various database-backed model persistence formats have been proposed. Database persistence allows for partial loading of models as only accessed elements have to be loaded in each case. Furthermore, such technologies can leverage database indices and caches for improving element lookup performance as well as query execution time.

Most of these database-backed solutions store each object as its own database entity (e.g. row, document or graph node). This is the case for Teneo/Hibernate,Footnote 9 one of the first object-relational mappings (ORMs) for EMF models. More recent systems which store models in databases rely on NoSQL technologies to take advantage of their flexible schema-free storage and/or quick reference navigation, such as MongoEMFFootnote 10 (based on the MongoDB document store) or NeoEMF [17]. NeoEMF in particular implements a multi-backend solution: NeoEMF/Graph uses graph-based databases (Neo4jFootnote 11 in particular), NeoEMF/Map uses file-backed maps (as implemented by MapDB)Footnote 12, and NeoEMF/HBase uses HBaseFootnote 13 distributed stores.

However, there are also approaches that operate at the fragment level: this is the case for EMF-Fragments by Scheidgen [29]. In this tool, the model is broken up along the EMF containment references that have been marked to be “fragmenting”, and these fragments are addressable through a key-value store. The EMF-Fragments tool supports both MongoDB and HBase. Users can choose how to represent each inter-object reference in the metamodel: these can be kept as part of the source object (as usual in EMF XMI-based persistence) or separately from it (as usual in database-backed persistence).

For most of these database-backed solutions, querying is an orthogonal concern: existing query languages can be used, but the languages will not be able to leverage the underlying data structures to optimize certain common cases (e.g. OCL’s “Type.allInstances()”) or avoid constructing intermediate objects in memory. Mogwaï is a model query framework that tackles this issue for models stored in NeoEMF/Graph, translating OCL queries to Tinkerpop Gremlin through ATL and reporting reductions in execution up to a factor of 20 [11].

2.3 Model repositories

When collaborative modelling is involved, simply storing models in a scalable form such as inside a database stops being sufficient; in this case, issues such as collaborative access and versioning need to also be considered. Examples of model repository tools are Morsa [26], ModelCVS,Footnote 14 Connected Data Objects (CDO), EMFStore [20], Modelio, MagicDraw and MetaEdit\(+\).Footnote 15 Model repositories allow multiple developers to manage models stored in a centralized repository by ensuring that models remain in a consistent state, while persisting them in a scalable form, such as in a database.

CDO in particular is one of the most mature solutions, having been developed since 2009 as an Eclipse project and being currently maintained by Obeo.Footnote 16 It implements a pluggable storage architecture that enables it to use various solutions such as relational databases (H2, MySQL) or document-oriented databases (MongoDB), among others. CDO includes Net4j, a messaging library that provides bidirectional communication over TCP, HTTP and in-memory connections, and uses it to provide an API that exposes remote models as EMF resources. In addition to storing models, CDO includes a CDOQuery API that makes it possible to run queries remotely on the server, reducing the necessary bandwidth.

2.4 Heterogeneous model indexing

An alternative to using model repositories for storing models used in a collaborative environment is to store them as file-based models in a classical version control system, ideally in a fragmented form. As discussed by Barmpis et al. [4], this approach leverages the benefits of widely used file-based version control systems such as SVN and Git, but retains the issues file-based models face (Sect. 2.1). To address this issue, a model indexer can be introduced that monitors the models and mirrors them in a scalable model index. The model index is synchronized with the latest version of the models in the repository and can be used to perform efficient queries on them, without having to check them out locally or load them into memory.

One example of such a technology is Hawk.Footnote 17 Hawk can maintain a graph database which mirrors the contents of the models stored in one or more version control repositories and perform very efficient queries on them. Hawk can be used as a Java library, as a set of plug-ins for the Eclipse IDE or as a network service through an Apache ThriftFootnote 18-based API.

Hawk can be extended to add support for various file formats, storage backends and query languages. As part of the integration efforts with the Softeam Modelio and Constellation products [13], two new components were added: a model parser for Modelio EXML/RAMC files, and a storage backend based on OrientDB. OrientDB is an open-source multi-paradigm database engine which can operate as a key-value store, as a document database or as a graph database. While studies from 2014 showed that OrientDB had lower performance than Neo4j for model querying [2], its relative performance with regard to Neo4j has improved since then, and its more permissive licence makes it more appealing to industrial users (ASL2.0 instead of Neo4j’s GPLv3).

3 Experiment design

As mentioned in introduction, once we have scalable modelling and scalable querying, the next problem to solve is how to share those huge models across the organization. Exposing them through a model querying service over the network is convenient, as they can provide answers without waiting for the model itself to be transferred. However, the design and implementation of the service is not trivial, and the underlying implementation may not react well to serving multiple concurrent clients.

This section presents the design of an empirical study that evaluates the impact of several factors in the performance of the remote model querying services of multiple tools: a model repository (CDO), several configurations of a model index (Hawk with Neo4j/OrientDB backends and EOL/EPL queries) and a database-backed model storage layer (NeoEMF). By studying the performance of these queries, we will be evaluating the responsiveness of the underlying tools with increasing levels of demand and how their different layers interact with each other.

3.1 Research questions


What is the impact of the network protocol on remote query times and throughputs?

In order to connect to a remote server, two of the most popular options are using raw TCP connections (for the sake of performance and flexibility) or sending HTTP messages (for compatibility with web browsers and interoperability with proxies and firewalls). Both Hawk and CDO support TCP and HTTP. Since NeoEMF did not officially have a remote querying API at the time of writing this paper, it was extended by the authors with TCP and HTTP-based APIs implemented in the same way as Hawk’s.

Properly configured HTTP servers and clients can reuse the underlying TCP connections with HTTP 1.1 pipelining and avoid repeated handshakes, but the additional overhead imposed by the HTTP fields may still impact the raw performance of the tool.


What is the impact of the design of the remote query API on remote query times and throughputs?

Application protocols for network-based services can be stateful or stateless. Stateful protocols require that the server keeps track of part of the state of the client, while stateless protocols do not have this requirement. In addition, the protocol may be used mostly for transporting opaque blocks of bytes between server and client, or it might have a well-defined set of operations and messages.

While a stateful protocol may be able to take advantage of the shared state between the client and server, a stateless protocol is generally simpler to implement and use. Service-oriented protocols need to also take into account the granularity of each operation: “fine” operations that do only one thing may be easier to recombine, but they will require more invocations than “coarse” operations that perform a task from start to finish. One example of a fine operation could be fetching a single model element by ID. A coarse operation would be running an entire query in the server and retrieving the results.

CDO implements a stateful protocol on top of the Net4j library, which essentially consists of sending and receiving buffers of bytes across the network. On the other hand, Hawk and our extended version of NeoEMF implement a stateless service-oriented API on top of the Apache Thrift library, exposing a set of specific operations (e.g. “query”, “send object” or “register metamodel”). The Hawk API supports both fine- and coarse-grained operations (fetching single elements or running queries), whereas the Mogwai API only supports running entire queries. Invoking a query for Hawk and Mogwai only requires one pair of HTTP request/response messages.

While the stateful CDO clients and servers may cooperate better with each other, the simpler and coarser APIs in Hawk and Mogwaï may reduce the total network roundtrip for a query by exchanging fewer messages.


What is the impact of the internal caching and indexing mechanisms on remote query times and throughputs?

Database-backed systems generally implement various caching strategies to keep the most frequently accessed data in memory, away from slow disk I/O. At the very least, the DBMS itself will generally keep its own cache, but the system might use additional memory to cache especially important subsets or to keep them in a form closer to how it is consumed.

Another common strategy is to prepare indices in advance, speeding up particular types of queries. DBMSs already provide indices for common concepts such as primary keys and unique values, but these systems may add their own application-specific indices that precompute parts of the queries to be run.


What is the impact of the mapping from the queries to the backend on remote query times?

Remote query APIs are usually bound to certain model querying languages: CDO embeds an OCL interpreter, Hawk has the Epsilon languages, and NeoEMF translates a subset of OCL to Gremlin through Mogwaï. Once the query is written, it has to be run by a query engine against the chosen backend.

The interactions between the query language, the engine and the underlying backend need to be analysed. Declarative query languages delegate more work into the query engine, whereas imperative query languages rely on the user to fine-tune accesses. Query engines have to map the query into an efficient use of the backend. In some cases, there may be useful features in a backend that are not made available to users, whether due to a limitation in the mapping of the query engine, or to the lack of a matching concept in the query.


Do graph-based tools scale better against demand than tools that store models in relational databases?

Various authors (including the authors of this paper) have previously reported considerable performance gains when running single queries on graph-based solutions when compared to solutions backed by databases or flat files. It may seem that graph databases are always the better choice, but they have been around for less time than relational approaches and usually require more fine-tuning to achieve the ideal performance. This question will focus on whether this advantage is common across graph-based tools and whether it extends to situations with very high levels of demand.

Fig. 1
figure 1

Network diagram for the experimental setup

3.2 Experiment setup

In order to provide answers for the above research questions, a networked environment was set up to emulate increasing numbers of clients interacting with a model repository (CDO 4.4.1.v20150914-0747), a model index (Hawk or a graph-based model persistence layer (NeoEMF on commit 375e077 combined with Mogwaï on commit 543fec9) and collect query response times. The environment is outlined in Fig. 1 and consists of the following:

  • One “Controller” machine that supervises the other machines through SSH connections managed with the Fabric Python library.Footnote 19 It is responsible for starting and stopping the client and server processes, monitoring their execution and collecting the measured values. It does not run any queries itself, so it has no impact on the obtained results.

  • Two “Client” machines that invoke the queries on the server, fetch the results and measure query response times. The two client machines were running Ubuntu Linux 14.04.3, Linux 3.19.0-47-generic and Oracle Java 8u60 on an Intel Core i5 650 CPU, 8GiB of RAM and a 500 GB SATA3 hard disk.

    The client machines had three client programs installed: one for CDO, one for Hawk and one for Mogwaï/NeoEMF. Only one of these programs ran at a time. Each of these programs received the address of the server to connect to, the size of the Java fixed thread pool to be used, the number of queries to be distributed across these threads and the query to be run. The clients sent their queries to the server and simply waited to receive the response from the server: they did not fetch model elements directly.Footnote 20

  • One “Server” machine that hosts the CDO model repository, the Hawk model index and the NeoEMF model store, and provides TCP and HTTP ports exposing the standard CDO and Hawk APIs for remote querying and a small proof of concept API for NeoEMF/Mogwaï. The server machine had the same configuration as the client machines. The server waits to receive a query and runs it locally through an embedded database and then replies back with the identifiers of the matching model elements.

    The server machine also had three server programs installed: one for CDO, one for Hawk and one for Mogwaï/NeoEMF. Again, only one of these programs ran at a time. All server programs were Eclipse products based on Eclipse Mars and used the same embedded HTTP server (Eclipse Jetty 9.2.13). All systems were configured to use up to 4096MB of memory (-Xmx4096m -Xms2048m).Footnote 21

    In particular, the CDO server was based on the standard CDO server product, with the addition of the experimental HTTP Net4j connector. No other changes were made to the CDO configuration. The CDO DB Store storage component was used in combination with the default H2 database adapter. DB Store was the most mature and feature-complete option at the time of writing.Footnote 22

  • One 100Mbps network switch that connected all machines together in an isolated local area network.

As the study was intended to measure query performance results with increasing numbers of concurrent users, the client programs were designed to first warm up the servers into a steady state. Query time was measured as the time required to connect to the server, run the query on the server and retrieve the model element identifiers of the results over the network. Queries would be run 1000 times in all configurations, to reduce the impact of variations due to external factors (CPU and I/O scheduling, Java just-in-time recompilation, disk caches, virtual memory, and so on).

Fig. 2
figure 2

Relevant excerpt of the JDTAST metamodel for the GraBaTs’09 queries

Several workloads were defined. The lightest workload used only 1 client machine with 1 thread sending 1000 queries to the server in sequence. The other workloads used 2 client machines generating load at the same time using a producer/consumer design where the producer thread would queue 500 query invocations, and \(t \in \{2,4,8,16,32\}\) consumer threads (client threads) would execute them as quickly as possible. For instance, with 2 client threads in a machine, each thread would be expected to execute approximately 250 invocations: the exact number might slightly vary due to differences in execution time across invocations. These workloads could therefore simulate between 1 (1 machine with 1 client thread) and 64 (2 machines at the same time, with 32 client threads each) concurrent clients.

3.3 Queries under study

After defining the research questions and preparing the environment, the next step was to populate CDO, Hawk and Mogwaï/NeoEMF with the contents to be queried, and to write equivalent queries in their supported query languages. Two use cases were considered, each with their own sets of queries: one related to reverse engineering existing Java code, and one related to pattern matching in railway models.

3.3.1 Singletons in Java models: GraBaTs’2009 queries

The first use case, SharenGo Java Legacy Reverse-Engineering,Footnote 23 was based on MoDisco and was originally presented at the GraBaTs 2009 tool contest [18]. It has been widely used for research in scalable modelling [2, 5, 8, 26], as it provides a set of models reverse-engineered from increasingly large open-source Java codebases. The largest codebase in the case study was selected, covering all the org.eclipse.jdt projects and producing over 4.9 million model elements. CDO required 1.4 GB to store the model, Hawk required 2.0 GB with Neo4j and 3.7 GB with OrientDB, and NeoEMF required 6.0 GB.

These model elements conformed to the Java Development Tools AST (JDTAST) metamodel. Some of the types within the JDTAST metamodel include the TypeDeclarations that represent Java classes and interfaces, the MethodDeclarations that represent Java methods, and the Modifiers that represent Java modifiers on the methods (such as static or public). The relevant excerpt of the metamodel is shown in Fig. 2.

Based on these types, task 1 in the GraBaTs 2009 contest required defining a query (from now on referred to as the GraBaTs query) that would locate all possible applications of the Singleton design pattern in Java [30]. In other words, it would have to find all the TypeDeclarations that had at least one MethodDeclaration with public and static modifiers that returned an instance of the same TypeDeclaration.

figure f

To evaluate CDO, the GraBaTs query was written in OCL as shown in Listing 1. The query (named OQ after “OCL query”) filters the TypeDeclarations by iterating through their MethodDeclarations and their respective Modifiers.

To evaluate Hawk, we used the three EOL implementations of the GraBaTs query of our previous work [3]. The first version of the query (“Hawk query 1” or HQ1, shown in Listing 2) is a translation of OQ to EOL and follows the same approach.

The second version (HQ2), shown in Listing 3, makes use of the derived attributes on MethodDeclarations: isStatic (the method has a static modifier), isPublic (the method has a public modifier), and isSameReturnType (the method returns an instance of its TypeDeclaration). A detailed discussion about how derived attributes are declared in Hawk and how they are incrementally re-computed upon model changes is available in our previous works [3, 4].

The third version (HQ3), shown in Listing 4, uses the same derived attributes but starts off from the MethodDeclarations so Hawk can take advantage of the fact that derived attributes can also be indexed, replacing iterations by lookups and noticeably speeding up execution.

The fourth version (HQ4), shown in Listing 5, assumed instead that Hawk extended TypeDeclarations with the isSingleton derived attribute, setting it to

figure g

when the TypeDeclaration has a static and publicMethodDeclaration returning an instance of itself. This derived attribute eliminates one more level of iteration, so the query only goes through the TypeDeclarations.

figure h
figure i
figure j
figure k
figure l
Fig. 3
figure 3

Containment hierarchy and references of the metamodel of the Train Benchmark [32]

The query for Mogwaï (MQ) is shown in Listing 6. Ideally, we would have used the same OCL query in CDO and in Mogwaï, but unfortunately CDO OCLQuery only accepts raw expressions and Mogwaï only accepts constraints within packages and contexts. Additionally, there are limitations in Mogwaï ’s implementation (particularly, the ATL transformation from OCL to the Gremlin API) that require making small changes in the queries. For instance, the OCL translator in Mogwaï does not support the Eclipse OCL-specific “selectByKind” operation, and additional type conversions are needed.

The GraBaTs query has been translated to one OCL query for CDO (OQ), 1 OCL query for Mogwaï (MQ) and four possible EOL queries for Hawk (HQ1 to HQ4). It must be noted that since CDO and Mogwaï/NeoEMF do not support derived attributes like Hawk, it was not possible to rewrite OQ or MQ in the same way as HQ1. Since the same query would be repeatedly run in the experiments, the authors inspected the code of CDO, Mogwaï and Hawk to ensure that neither tool cached the results of the queries themselves: this was verified by re-running the queries while adding unique trivially true conditions, and comparing execution times.

3.3.2 Railway model validation: Train Benchmark queries

To improve the external validity of the answers for the research questions in Sect. 3.1, a second case study with a wider assortment of queries was needed. For this purpose, it was decided to use some of the queries and models from the Train Benchmark by Szárnyas et al. [32].

The Train Benchmark (TB) was originally developed within the MONDO EU FP7 project on scalability in model-driven engineering, in order to compare the querying performance of various technical solutions with regard to model validation. The original benchmark divided execution into four stages (read, check, edit and re-check), and tested two scenarios: a batch scenario with only read and check, and an incremental scenario with all four stages. Since the focus of the present study is on scalability to user demand rather than reacting to changes, only the batch scenario will be adopted.

The queries on the TB operate on domain-specific models of railway systems: the containment hierarchy and references of the underlying metamodel are shown in Fig. 3. The RailwayContainer acts as the root of the model, which contains Routes and the Semaphores between them. A Route is formed of two or more Sensors which monitor Switches and Segments. The Switches have a particular SwitchPosition for each Route.

Based on this metamodel, the TB includes automatic model generators that can produce synthetic models of arbitrary size by producing a random number of small fragments and reconnecting them in a random manner. After this, a small portion of the elements (<1%) is modified to produce the validation errors that will be detected by the later queries. The benchmark includes generators for both the repair case (the edit stage corrects validation errors) and the inject case (edit introduces validation errors).

For the present study, the repair generator was used. Multiple models were generated during initial experiments (with between 1418 and 3,355,152 elements) by varying the size parameter of the generator between 1 and 2048. However, some of the queries were too slow on CDO and Mogwaï for stress testing, and a medium-sized model had to be selected (size = 32, with 49, 334 elements). In general, the simplicity of the TB metamodel ensures that queries access larger portions of the model than the GraBaTs queries in Sect. 3.3.1, and some of the queries perform more complex pattern matching as well.

Fig. 4
figure 4

Train Benchmark ConnectedSegments query. a Orignal OCL, b OCL query for Mogwaï, c EOL, d EPL

Fig. 5
figure 5

Train Benchmark PosLength query. a Orignal OCL, b OCL for Mogwaï, c EOL, indexed “length”, d EPL, indexed “length”

Fig. 6
figure 6

Train Benchmark RouteSensor query. a Orignal OCL, b OCL for Mogwaï, c EOL, d EPL

Fig. 7
figure 7

Train Benchmark SemaphoreNeighbor query. a Orignal OCL, b OCL for Mogwaï, c EOL, uses reverse reference navigation (“revRefNav_”), d EPL, uses reverse reference navigation (“revRefNav_”)

Fig. 8
figure 8

Train Benchmark SwitchMonitored query. a Orignal OCL, b OCL for Mogwaï, c EOL, with derived “isMonitored”, d EPL, with derived “isMonitored” reverse reference navigation (“revRefNav_”)

The present study used the OCL versions of the TB queries for CDO and Mogwaï, with some adjustments in the case of Mogwaï. For Hawk, the OCL queries were translated to the Epsilon Object Language (EOL), optimized for Hawk features and then further translated to the Epsilon Pattern Language (EPL). EPL [22] is a specialization of EOL, providing a more declarative and readable syntax for graph pattern matching in models. The queries look for violations of various well-formedness constraints:

  • ConnectedSegments (CS) each Sensor must have 5 or fewer Segments connected to them. The queries in Fig. 4 find Sensors that are monitoring a sequence of 6 or more Segments.

    The Mogwaï version is similar to the original OCL one, but it can only return the sixth TrackElement that produces the violation. The original OCL query packed all the participants in each match into a list of tuples, but Mogwaï queries can only return a flat list of individual model elements. Matching a sequence of six consecutive elements is rather awkward, requiring many nested repetitions of select and collect. EOL has a similar readability issue, but EPL has a much cleaner syntax for this sort of graph matching problem.

  • PosLength (PL) Segments must have positive length. The queries in Fig. 5 find Segments that have zero or negative length.

    In this case, Hawk can be told to index the “length” attribute of Segment in advance to jump directly to the relevant elements.

  • RouteSensor (RS) Sensors associated with a Switch that belongs to a Route must also be associated with the same Route. The queries on Fig. 6 find Sensors that are connected to a Switch, but the Sensor and the Switch are not connected to the same Route.

    The EOL and EPL versions filter the Sensors by taking advantage of a derived attribute, “nMonitoredSegments”, defined through the EOL expression “self.monitors.size” (where “self” takes each of the Sensors as a value). This reduces the problem to a lookup of the relevant Sensors and a quick pattern matching to find the offending sixth Segment.

  • SemaphoreNeighbor (SN): the exit Semaphore of a Route must be the entry Semaphore of the Route that it connects to. The queries on Fig. 7 find Routes that are reachable from another Route but do not have their Semaphores as entry point.

    There is an important difference between the original OCL and the EOL/EPL versions: Hawk can traverse a reference “x” in reverse by using “revRefNav_x”, since Neo4j and OrientDB edges are navigable in both directions. This allows the query to be written more without the inefficient nested “Route.allInstances” that was required by the OCL version.

  • SwitchMonitored (SM) every Switch must be monitored by a Sensor. The queries on Fig. 8 find Switches that are not being monitored.

    The EOL/EPL variants use a derived attribute “isMonitored” on every Switch, defined as “not self.monitoredBy.isEmpty()”.

  • SwitchSet (SS) the entry Semaphore of a Route can only show “GO” if all Switches along the Route are in the same position. The queries in Fig. 9 find Switches that do not have the right position.

    In this case, there is only a minor change due to the fact that in Hawk, enumerated values are stored as simple strings.

Fig. 9
figure 9

Train Benchmark SwitchSet query. a Orignal OCL, b OCL for Mogwaï, c EOL, d EPL

4 Results and discussion

The previous section described the research questions to be answered, the environment that was set up for the experiment and the queries to be run. This section will present the obtained results, answer the research questions (with the help of additional data in some cases) and discuss potential threats to the validity of the work. The raw data and all related source code supporting these results are available from the Aston Data Explorer repository.Footnote 24

4.1 Measurements obtained

The median execution times (in milliseconds) and coefficients of dispersion over 1000 executions of the GraBaTs’09 queries from Sect. 3.3.1 are shown in Table 1. Likewise, the results for the Train Benchmark queries from Sect. 3.3.2 are shown in Tables 2, 3, 4. To save space, Hawk with the Neo4j backend is abbreviated to “Hawk/N” and “H/N”. Likewise, Hawk with the OrientDB backend is shortened to “Hawk/O” and “H/O”. These abbreviations will be used throughout the rest of the paper as well.

Medians were picked as a measure of centrality due to their robustness against the occasional outliers that a heavily stressed system can produce. Coefficients of dispersion are dimensionless measures of dispersion that can be used to compare data sets with different means: they are defined as \(\tau /\eta \), where \(\tau \) is the mean absolute deviation from the median \(\eta \). Coefficients of dispersion are robust to non-normal distributions, unlike the better known coefficients of variation [6]. The tables allow for quick comparison of performance levels across tools, queries and number of client threads. Nevertheless, more specific visualizations and statistical analyses will be derived for some of the following research questions.

Table 1 Median execution times in milliseconds and coefficients of dispersion over 1000 executions of the GraBaTs’09 queries, by tool, language, protocol and client threads
Table 2 Median execution times in milliseconds and coefficients of dispersion over 1000 executions of the Train Benchmark queries ConnectedSegments and PosLength, by tool, language, protocol and client threads
Table 3 Median execution times in milliseconds and coefficients of dispersion over 1000 executions of the Train Benchmark queries RouteSet and SwitchMonitored, by tool, language, protocol and client threads
Table 4 Median execution times in milliseconds and coefficients of dispersion over 1000 executions of the Train Benchmark queries SemaphoreNeighbor, and SwitchSet, by tool, language, protocol and client threads

One important detail is that SemaphoreNeighbor was not fully run through CDO and Mogwaï, as it runs too slowly to allow for stress testing. More specifically, with only 1 client thread over TCP, the median time for the first 10 runs of SN was 88.44 seconds for CDO and 305.19 seconds for Mogwaï. For this reason, only Hawk was fully evaluated regarding SN.

The next step was to check whether the execution times belonged to a normal distribution for the sake of analysis. Shapiro–Wilk testsFootnote 25 rejected the null hypothesis (“the sample comes from a normal distribution”) with p values \(<0.01\) for almost all of the combinations of query, tool, language, protocol and thread count: only 11 out of 516 tested combinations reported p values \(\ge 0.01\). In order to visualize how they deviated from a normal distribution, further manual inspections with quartile–quartile plots were conducted. These confirmed that most distributions tended to be either heavy-tailed, bimodal, multimodal, or curved.

This is somewhat surprising, as the natural intuition is that execution times should follow a normal distribution: 90% of the Java benchmarks conducted by Georges et al. [16] with single-core processors did follow a Gaussian distribution according to Kolmogorov–Smirnov tests. At the same time, 10% of those benchmarks were not normally distributed (being reportedly skewed or bimodal), and modern machines with multi-core processors have only grown more non-deterministic since then. More recently, Chen et al. [9] concluded that execution times for multithreaded loads in modern multi-core machines do not follow neither normal nor log-normal distributions and that more robust nonparametric methods are needed for performance comparison. Our study in particular involves 3 machines communicating over Ethernet and doing heavy disk-based I/O: even with 1 thread per client machine, the server will experience a non-deterministic multithreaded load as the one studied by Chen at al. For these reasons, the rest of this paper will assume that the query execution times are not normally distributed and will use nonparametric tests.

Some of the configurations had intermittent issues when running queries. This was another goal of our stress testing: finding if the different tools would fail with increased demand and if they could recover from these errors (which they did by themselves). Table 5 shows the configurations that produced server errors, and Table 6 shows the configurations that reported the wrong number of results. The “correct” number of results is computed by running each query across all tools in local installations, without the risk of the network or the stress test influencing the result, and ensuring they all report equivalent results. In our previous conference paper [14], incorrect executions only happened for the CDO HTTP API, but a similar issue exists even in the TCP API for some of the Train Benchmark queries. Hawk over Neo4j and Mogwaï were the only combinations of tool and backend that did not report failed or incorrect executions.

Regarding the failed and incorrect executions of CDO, at this early stage of the study we could only treat it as a black box, as we were merely users of this tool and not their developers. However, our analysis of RQ2 in Sect. 4.3 suggests that this is due to the stateful buffer-based design of the CDO API. As for the failed executions of Hawk with OrientDB, we attribute these problems to concurrency issues in the OrientDB backend, since Hawk with Neo4j does not report any issues and has otherwise the exact same code.

4.2 RQ1: impact of protocol

A quick glance at the results on Tables 1, 2, 3, 4 shows that there are notable differences in some cases between HTTP and TCP, but not always: in fact, sometimes HTTP appears to be faster.

To clarify these differences, pairwise Mann–Whitney U tests [24] were conducted between the HTTP and TCP results of every configuration. p values < 0.01 were required to reject the null hypothesis that there was the same chance of HTTP and TCP being slower than the other for that particular configuration. Where the null hypothesis was rejected, Cliff deltas were computed to measure effect size [19]. Cliff delta values range between +1 (for all pairs of execution times, HTTP was always slower) and −1 (HTTP was always faster). Cohen d effect sizes [10] were not considered since execution times were not normally distributed. The results are summarized in Tables 7 and 8. 99% confidence intervals of the difference between HTTP and TCP execution times were also computed during the Mann–Whitney U tests, but due to space constraints they were not included in those tables. Some of those confidence intervals will be mentioned in the following paragraphs.

CDO is the simplest case here: all tested configurations have significant differences and report positive effect sizes, meaning that HTTP was consistently slower than TCP. Cliff deltas become much weaker (closer to 0) with increasing number of threads, except for the SM and SS queries. This is also confirmed through the confidence intervals: for OQ with 1 thread, it is \([+6929\text {ms}, +6944\text {ms}]\), while with 64 threads it is only \([+78\text {ms}, +1406\text {ms}]\). By comparing the medians, it can be seen that HTTP can be over 3000% slower than TCP in extreme cases, such as SS with 4 client threads.

Mogwaï has conflicting results across the two case studies. For the OQ GraBaTs’09 query, HTTP is quite often faster, though HTTP and TCP become rather similar for 32 and 64 threads with absolute values below 0.35. For the Train Benchmark queries, TCP is more often faster than HTTP, though this difference again drops off as the number of client threads increases. For SM and SS, the two faster running queries for Mogwaï in the TB case study, the difference is again very small. This suggests that for Mogwaï, in addition to the protocol used, the way concurrency is handled by the server and how it interacts with the query might have an impact as well. In particular, the Jetty HTTP server and the TCP server use different types of network I/O: non-blocking for Jetty (which decouples network I/O from request processing) and blocking for Thrift (which simplifies the Thrift message format).

The results from Hawk are the most complex to analyse. Regarding the GraBaTs’09 queries, HQ1 is consistently slower on HTTP for Neo4j and OrientDB, with strong effect sizes for all numbers of client threads. HQ2 is only slower on HTTP for OrientDB, especially with few client threads: with Neo4j, HTTP is faster for 1–8 threads. HQ3 and HQ4 sometimes run slower on HTTP, but effect sizes are weaker overall and in most cases there is not a significant difference. It appears that once queries are optimized through derived and indexed attributes, there is not that much difference between HTTP and TCP.

As for the Train Benchmark queries under Hawk, a first step is studying the Cliff deltas for each query:

  • CS does not show a consistent pattern neither by backend nor by query language: effect sizes are only moderate with Neo4j (with absolute values below 0.30), and with OrientDB effect sizes are positive when using EOL and negative when using EPL.

  • PL and RS are usually faster on HTTP, especially with OrientDB.

  • SM on the other hand is consistently slower on HTTP. This time, the strongest effect sizes are produced when using Neo4j.

  • SN is consistently slower on HTTP with Neo4j, and consistently faster on HTTP with OrientDB.

  • SS effect sizes are usually positive with Neo4j and negative with OrientDB, but they are weak, with absolute values below 0.22.

From these results, it appears that the largest factor on HTTP slowdown patterns for Hawk is the chosen backend, suggesting that the interaction between the concurrency and I/O patterns of the Jetty HTTP server, the Thrift TCP server and the Hawk backend may be relevant. While the Hawk Neo4j backend took advantage of the thread safety built into Neo4j, the OrientDB backend has only recently implemented its own thread pooling to preserve database caches across queries. The query language was only important for CS, showing there may be an interaction but it could be relevant only for certain queries.

As for the coefficients of dispersion, the general trend is that they increase as more client threads are used. This is to be expected from the increasingly non-deterministic multithreaded load, but the exact pattern changes depending on the tool and the protocol. For most configurations, Hawk shows very similar CDs with HTTP and TCP, and so does Mogwaï (except for some rare cases such as SM and SS over 32 threads): this is likely due to the fact that the message exchanges are the same across both solutions (a single request/response pair). However, CDO shows consistently different CDs over HTTP and over TCP, suggesting that they may be fundamentally different in design from each other. This will be the focus of the next section.

4.3 RQ2: impact of API design

One striking observation from RQ1 was that CDO over HTTP had much higher overhead than Hawk and Mogwaï over HTTP. Comparing the medians of OQ and HQ1 with 1 client thread, CDO+HTTP took \(635.66\%\) longer than CDO+TCP, while Hawk+HTTP only took \(24.16\%\) longer than Hawk+TCP. This contrast showed that CDO used HTTP to implement their APIs very differently from the other tools.

Table 5 Failed executions (timeout / server error), by query, tool, protocol and client threads (“1t”: 1 thread). Only combinations with failed executions are shown
Table 6 Executions with incorrect number of results, by query, tool, protocol and client threads (“1t”: 1 thread). Only combinations with incorrect executions are shown

To clarify this issue, the Wireshark packet sniffer was used to capture the communications between the server and the client for one invocation of OQ and HQ1 (with Hawk over Neo4j). These captures showed quite different approaches for an HTTP-based API:

  • CDO involved exchanging 58 packets (10203 bytes), performing 11 different HTTP requests. Many of these requests were very small and consisted of exchanges of byte buffers between the server and the client, opaque to the HTTP servlet itself.

    Most of these requests were either within the first second of the query execution time or within the last second. There was a gap of approximately 6 seconds between the first group of requests and the last group. Interestingly, the last request before the gap contained the OCL query and the response was an acknowledgement from CDO. On the first request after the gap, the client sent its session ID and received back the results from the query.

    Table 7 Cliff deltas for GraBaTs’09 query execution (−1: HTTP is faster for all pairs, 1: HTTP is slower for all pairs), for configurations where Mann–Whitney U test reports significance (p value < 0.01), by tool, query and number of client threads (“1t”: 1 thread)
    Table 8 Cliff deltas for Train Benchmark query execution (−1: HTTP is faster for all pairs, 1: HTTP is slower for all pairs), for configurations where Mann–Whitney U test reports significance (p value < 0.01), by tool, query and number of client threads (“1t”: 1 thread)

    The capture indicates that these CDO queries are asynchronous in nature: the client sends the query and eventually gets back the results. While the default Net4j TCP connector allows the CDO server to talk back to the client directly through the connection, the experimental HTTP connector relies on polling for this task. This has introduced unwanted delays in the execution of the queries. The result suggests that an alternative solution for this bidirectional communication would be advisable, such as WebSockets.

  • Hawk involved exchanging 14 packets (2804 bytes), performing 1 HTTP request and receiving the results of the query in the same response. Since its API is stateless, there was no need to establish a session or keep a bidirectional server–client channel: the results were available as soon as possible.

    While this synchronous and stateless approach is much simpler to implement and use, it does have the disadvantage of making the client block until all the results have been produced. Future versions of Hawk could also implement asynchronous querying as suggested for CDO.

    One side note is that Hawk required using much less bandwidth than CDO: this was due to a combination of using fewer requests, using gzip compression on the responses and taking advantage of the most efficient binary encoding available in Apache Thrift (Tuple).

In summary, CDO and Hawk use HTTP in very different ways. The CDO API is stateful and consists of exchanging pending buffers between server and client: queries are asynchronous. This is not a problem when using TCP, since messages can be exchanged both ways. However, HTTP by itself does not allow the server to initiate a connection to the client to send back the results when they are available: to emulate this, polling is used. This could be solvable with technologies such as WebSockets, which is essentially a negotiation process to upgrade an HTTP connection to a full-duplex connection.

This stateful and buffer-based communication explains some of the intermittent communication issues that were shown for CDO in Tables 5 and Tables 6. In a heavily congested multithreaded environment, concurrency issues (race conditions or thread-unsafe code) may result in buffers being sent out of order or mangled together. If the state of the connection becomes inconsistent, it may either fail to produce a result or may miss to collect some of the results that were sent across the connection.

In comparison, the Hawk API is stateless and synchronous: query results are sent back in a single message. Since there are no multiple messages that need to be correlated to each other, this problem is avoided entirely.

These results suggest that while systems may benefit from supporting both synchronous querying (for small or time-sensitive queries) and asynchronous querying (for large or long-running queries), asynchronous querying can be complex to implement in a robust manner. Proper full-duplex channels are required to avoid delays (either raw TCP or WebSockets over HTTP), and adequate care must be given to thread safety and message ordering.

4.4 RQ3: impact of caching and indexing

This section will focus on the results from the TCP variants, since they were faster or equivalent to the HTTP variants in the previous tests. It will also focus on the times in the ideal situation where there is only 1 client thread: later questions will focus on the scenarios with higher numbers of client threads.

4.4.1 GraBaTs’09 queries

A Kruskal–Wallis test reported there were significant differences in TCP execution times across tool/query combinations with 1 client thread (p value below 0.01). A post hoc Dunn test [12] was then used to compute p values for pairwise comparisons, using the Bonferroni correction. There was only one pairwise comparison with p value higher than 0.01: HQ3 with Hawk/Neo4j against HQ3 with Hawk/Orient (p value = 0.054). These two configurations will be considered to be similar in performance. All other comparisons will be based on the medians shown in Table 1.

Looking at the OQ and HQ1 times for CDO, Hawk and Mogwaï, CDO is the fastest, with a median of 1088ms compared to 5673ms from Mogwaï, 1631ms from Hawk/Neo4j and 3491ms from Hawk/OrientDB. This is interesting, as normally one would assume that the join-free adjacency of the graph databases used in Hawk and Mogwaï would give them an edge over the default H2 relational backend in CDO.

Fig. 10
figure 10

Radar plot for median Train Benchmark TCP query execution times in milliseconds over 1000 executions, with 1 client thread

Enabling the SQL trace log in CDO showed that after the first execution of OQ, later executions only performed one SQL query to verify whether there were any new instances of TypeDeclaration. Previous tests had already rejected the possibility that CDO was caching the query results. Instead, an inspection of the CDO code revealed a collection of generic caches. Among others, CDO keeps a CDOExtentMap from EClasses to all their EObject instances, and also keeps a CDORevisionCache with the various versions of each EObject. CDO keeps a cache of prepared SQL queries as well.

In comparison, Hawk and Mogwaï do not use object-level caching by default, relying mostly on the graph database caches instead. Neo4j caches are shared across all threads, whereas in OrientDB they are specific to each thread, requiring more memory. OrientDB caches can be configured to free up memory as soon as possible (weak Java references) or use up as much memory as possible (soft Java references): for this study, the second mode was used, but the authors identified issues with this particular mode. The issues were initially notified and resolved,Footnote 26 but the lack of an LRU policy in the OrientDB in-house cache prompted the authors to have Hawk replace it with a standard Guava cache.

Beyond object-level caching, Hawk caches type nodes and Mogwaï caches the compiled version of the ATL script that transforms OCL to Gremlin. The ATL caching in Mogwaï was in fact added during this study, as a result of communication with the Mogwaï developers that produced several iterations tackling limitations in the OCL transformer, reducing query latency and resolving concurrency issues.

The above results indicate that a strong caching layer can have an impact large enough to trump a more efficient persistent representation in some situations. Nevertheless, the results of HQ2, HQ3 and HQ4 confirm the findings of our previous work in scalable querying [3, 4]: adding derived attributes to reduce the levels of iteration required in a query speeds up running times by orders of magnitude, while adding minimal overhead due to the use of incremental updating. These derived attributes can be seen as application-specific caches that precompute parts of a query, unlike the application-agnostic caches present in CDO:

  • HQ2 replaces the innermost loop in HQ1 with the use of precomputed derived attributes (isStatic, isPublic and isSameReturnType) of a generic nature. These derived attributes produce a 2.80x speedup on Hawk/Neo4j and 2.04x speedup on Hawk/OrientDB. OrientDB receives less of a boost as following edges in general appears to be less efficient than in Neo4j.

  • HQ3 uses the same attributes but rearranges the query to have them appear in the outermost “select”, so Hawk can transform the iteration transparently into a lookup. Compared to HQ2, HQ3 is 2.71x faster on Neo4j and 9.47x faster on OrientDB. From the previous Dunn post hoc test, it appears that indexing in the Hawk and OrientDB backends is similarly performant in this case.

  • HQ4 uses a much more specific derived attribute (isSingleton) that eliminates one more level of iteration, turning the query into a simple lookup. HQ4 is one order of magnitude faster than HQ3 both on Neo4j and OrientDB, but here Hawk/Neo4j is somewhat faster. This suggests that a single index lookup is faster on Neo4j, whereas multiple index lookups are faster on OrientDB. This may be due to the way OrientDB caches index pages internally, compared to Neo4j.

4.4.2 Train Benchmark

The Train Benchmark results span over 6 queries of very different nature: some are very lightweight, while others require a more intensive traversal of the underlying graph. For each query, a Kruskal–Wallis test confirmed that there were significant differences in TCP execution times across configurations (p value < 0.01). A post hoc Dunn test confirmed that most pairwise combinations of configurations had significant differences as well (p value < 0.01 with Bonferroni correction), except for SwitchSet between Hawk/Orient/EOL and Mogwaï. Having established most differences in times are significant, this section will use the medians in Tables 2 and 4 to compare the tools.

To simplify the comparison, rather than using the tables directly, this section will use the more intuitive radar plots in Fig. 10 to guide the discussion. Comparing the relative area of each different tool gives a general impression of their standing: tools with smaller areas are faster in general. The Hawk side and the CDO/Mogwaï side use the same scales, to allow for comparisons across plots. CDO and Mogwaï do not have any data points for SN, since they were too slow for a full run (Sect. 4.1).

The Hawk side compares the relative performance of the four tested configurations (two backends, two query languages). It can be seen that the OrientDB backend is close to the Neo4j backend in some queries (CS and SM), twice as slow in most queries, and noticeably slower in RS. Examining these results suggests that while derived/indexed attributes are effective on both backends, range queries in OrientDB do not deal well with high-cardinality attributes:

  • The two queries that ran in similar times (CS and SM) use custom Hawk indices: CS performs an indexed range query on a derived attribute (nMonitoredSegments > 5), and SM performs an indexed lookup (isMonitored = false).

  • However, PL is still slow even though it uses an indexed range query (length\(\le \) 0), which apparently contradicts the results obtained with CS. One important difference between the queries is that there are many more distinct values of length (978) than of nMonitoredSegments (2): the indexed range query in PL will need to read many more SB-Tree nodes than in CS.

Looking at the CDO/Mogwaï side, it appears that the generic caching in CDO helped obtain good performance in PL, RS (where it slightly outperformed even Hawk with Neo4j) and CS, but it was not that useful for SM. In SM, Mogwaï can follow the monitoredBy reference faster than CDO, and Hawk can use an indexed lookup to fetch directly the 35 unmonitored Switches instead of going through all 1501 of them. In general, it appears that CDO deals quite well with queries that involve few types, in addition to queries with few nested reference traversals.

While Mogwaï does not support indexed attributes, its use of Neo4j through NeoEMF should have given it similar performance to that of Hawk with Neo4j through the default Neo4j caching. Instead, it is always slower than Hawk with Neo4j and EOL, and it is only faster than Hawk with OrientDB and EOL on RS. After a discussion with the Mogwaï/NeoEMF developers, it seems that this difference may be due to the use of Neo4j 1.9.6 in NeoEMF (Hawk uses 2.0.5, after testing various 2.x releases), and to inefficiencies in the bundled implementation of Gremlin.

4.5 RQ4: impact of mapping from query to backend

In a database-backed model querying solution, the query language is the interface shown to the user for accessing the stored models, and a query engine is the component that maps the query into an efficient use of the backend. Good solutions are those whose queries are easy to read and write and are mapped to the best possible use of the backend.

Since the query language, the query engine and the backend are all interrelated, it is hard to separate their individual contributions. CDO and Mogwaï use the same query language, but run it in very different ways. Likewise, Mogwaï and Hawk share a backend (Neo4j), but they store models differently and use different APIs to access it. For this reason, it is not possible to talk about what is the “best” query language in isolation of the other factors, or make other similar general statements. Instead, the answer to RQ4 will start from each source language and draw comparisons on how their queries were mapped to the capabilities of the backends, for the different tools that supported them:

  • OCL is reasonably straightforward to use for queries with simple pattern matching, like OQ/MQ from GraBaTs’09 or the Train Benchmark PL and SM queries. However, it quickly becomes unwieldy with queries that have more complex pattern matching, requiring many nested select/collect invocations in cases such as SN (Fig. 4 on page 17).

    CDO and Mogwaï map OCL in very different ways. CDO parses the OCL query into a standard Eclipse OCL abstract syntax tree of Java objects and evaluates the tree, providing a CDO environment that integrates caching and reads from the database as needed with multiple SQL queries. This allows it to start running the query very quickly, but it also implies that OCL queries need to switch back and forth between the H2 database layer and the model query layer, reducing performance. This may have been one of the main reasons for CDO’s inclusion of an object-level cache.

    Mogwaï, on the other hand, parses the OCL query as a model, transforms it into Gremlin, compiles the Gremlin script into bytecode, executes the query entirely within Gremlin and deserializes the results back into EMF objects. This process increases query latency over an interpreted approach, but queries could potentially run faster thanks to less back and forth between layers. However, as mentioned for RQ3, the use of an old release of Neo4j (1.9.6) in the current version of Mogwaï has made it run quite slow, negating this advantage over CDO and Hawk.

  • EOL is inspired by OCL, and while the examples show that it is slightly more concise, it still suffers from the same nested collect/select problem when performing complex graph pattern matching. The execution approach is also similar: the EOL query is turned into an abstract syntax tree, which is visited in a post-order manner to produce the final value.

    However, the EOL-Hawk bridge [4] takes advantage of several features in the underlying graph database: custom indices (already discussed for RQ3) and the bidirectional navigability of the edges. It also allows for following references in reverse (from target to source), and certain queries can be written much more efficiently. This was the reason why the median time for SN was 352ms with Hawk/Neo4j/EOL and over 300s with Mogwaï. It is a missed opportunity for Mogwaï, which could have exposed this capability as well through OCL.

  • EPL is a refined version of EOL which is specialized towards pattern matching. Looking at SN again, the EPL version is much easier to understand, with no explicit nesting: these nested loops are implicit in EPL’s execution. Like EOL, EPL is also interpreted instead of compiled, reducing latency for some queries.

    As shown in Tables 3 and Fig. 10, EPL appears to be consistently slower than EOL, even though queries are very similar. The overhead is especially notable for SN, where EPL is twice as slow as EOL. To clarify this issue, a profiler was used to follow 5 executions of the EOL and EPL versions of SN. It revealed that the additional type checking done implicitly by EPL on every match candidate was the main reason for the heavy slowdown. While this check is painless on traditional in-memory models, on the graph databases built by Hawk this check requires following one more edge and potentially performing disk I/O. Disabling this type check by referring to the “Any” root supertype in Epsilon instead returned execution times to values similar to those of EOL.

Table 9 Bounds of the 99% confidence interval for median execution time ratios between CDO and other tools (GraBaTs’09). Values greater than 1 indicate that CDO is slower, while values less than 1 indicate that the other tool is slower
Table 10 Bounds of the 99% confidence interval for median execution time ratios between CDO and other tools (Train Benchmark)

In closing, these experiences show that while query compilation may have a higher potential for performance, it may be more important to focus on selecting a stronger database technology and fully expose the strengths of this technology through the query language and the query engine. Developers wishing to repurpose existing “declarative” query languages need to test whether any language features interact negatively with the chosen technology, as the cost of certain common operations may have changed dramatically.

4.6 RQ5: scalability with demand

The next question was concerned about how well relational and graph-based approaches scale as demand increases: one approach could do well with few clients, but then quickly drop in performance with more clients. Ideally, we would simply swap relational backends with graph-based backends in each tool and do separate comparisons. Unfortunately, CDO does not include a graph-based backend, and Hawk and Mogwaï do not support relational backends. Instead, we will make the comparison across tools, assuming that each tool was specially tailored to their backend and that therefore they are good representatives for their type of approach. These results could be revisited if new backends were developed, but they should serve as a good snapshot of their standing at the time of writing this paper.

In this section, the relational approaches will be represented by CDO (based on the embedded H2 database), and the graph-based approaches will be represented by Hawk (combined with Neo4j 2.0.5 or OrientDB 2.2.8) and Mogwaï (backed by NeoEMF, which uses Neo4j 1.9.6). CDO is one of the most mature model persistence layers and has considerable industrial adoption, so it can be considered a good representative for the relational approaches.

First, Kruskal–Wallis tests confirmed (with p values < 0.01) that for each combination of query and client threads, TCP execution times had significant differences across the tested combinations of tool, backend and query language. Post hoc Dunn tests were used to evaluate the null hypotheses that CDO execution times were similar to each of the non-CDO configurations (p values < 0.01). In most cases, the null hypothesis was rejected, but there were some exceptions (2 out of 63 for the GraBaTs’09 queries, and 4 out of 175 for the Train Benchmark queries).

After confirming significant differences for most CDO vs. non-CDO pairs, the next step was quantifying how those pairs scaled relative to each other. Cliff deltas would have been able to express if a certain configuration started being faster more often than the other at a certain point, but they could not show whether the gap between CDO and the non-CDO configuration increased, stayed the same or decreased together with the client threads. Instead, it was decided to use the median of the \(t_c/t_o\) ratios between random pairings of the \(t_c\) CDO TCP execution times and the \(t_o\) non-CDO TCP execution times: values larger than 1 would imply that CDO was slower, and values smaller than 1 would mean that CDO was faster. To increase the level of confidence of the results, bootstrapping over 10, 000 rounds was used to estimate a 99% confidence interval of this “median of ratios” metric. The confidence intervals produced for the GraBaTs’09 and Train Benchmark queries are shown in Tables 9 and 10, respectively. Cells with “1.0–1.0” represent cases where CDO and the tool did not report significantly different times according to the Dunn tests.

In absolute terms, in most cases if a query runs faster or slower on a certain tool than on CDO, it will remain that way for all client threads. However, there are some exceptions:

  • Mogwaï becomes slightly faster than CDO for the PL query with 2 or more threads, and slower than CDO for SS with 4+ threads. In fact, all non-CDO solutions experience a noticeable drop in performance for SS with 4 threads: it is just that Mogwaï did not have enough leeway to stay ahead of CDO. It appears that when running queries with no specific optimizations (e.g. indexed attributes), there may be less thread contention on CDO than on the other tools, closing the gap that originally existed in some cases.

  • Hawk with Neo4j/EPL and Hawk with OrientDB/EOL start with better performance than CDO for SS, but quickly drop to similar or slightly inferior performance when using 4 or more client threads. In the first case, the additional type checks performed by EPL are weighing Hawk down. In the second case, the lower performance of the OrientDB backend gives Hawk less margin to handle the CPU saturation at 4 threads—with OrientDB/EPL, Hawk is already slightly slower than CDO with 1 thread.

One interesting observation is that depending on the combination between the query and the tool, some queries maintain a consistent ratio with CDO (e.g. OQ on Mogwaï), others raise then fall (PL for Mogwaï and Hawk), and others simply fall (RS and SS on all tools). This further supports the idea that thread contention profiles among the different tools vary notably for the same query. While further studies would be necessary to find out the specific reasons for most of these cases, there are some configurations for which it is easier to explain. The reason behind HQ4 having consistently increasing ratios for Hawk/Neo4j and Hawk/OrientDB is that it reduces multiply nested loops with a single lookup, changing the underlying order of the computation: the heavier the load, the larger the contrast created by this change.

As a general conclusion, graph databases by themselves are not a silver bullet — Mogwaï, for instance, did not outspeed CDO in many queries. It is important to use recent releases and take advantage of every feature at their disposal in order to achieve a solid advantage over mature relational technologies.

4.7 Threats to validity

This section discusses the threats to the internal and external validity of the results, as well as the steps we have taken to mitigate them. Starting with the internal validity of the results, these are the threats we have identified:

  • There is a possibility that CDO, Hawk or Mogwaï could have been configured or used in a more optimal way. Since the authors developed Hawk, this may have allowed them to fine-tune Hawk better than CDO or Mogwaï.

    However, the servers did not show any undesirable virtual memory usage, excessive garbage collection or unexpected disk I/O. The H2 backend was chosen for CDO due to its maturity in comparison with the other backends, and the Neo4j backend has consistently produced the best results for Hawk according to previous work. Mogwaï is only available for the Neo4j backend of NeoEMF, so using an alternative configuration was out of the question.

    The authors contacted the CDO developers regarding how to compress responses and limit results by resource, to make it more comparable with Hawk, and were informed that these were not supported yet.Footnote 27 The authors also collaborated with the Mogwaï developers to improve performance as much as possible during the writing of the paper, contributing bugfixes and suggesting various improvements that reduced query latency.

  • The queries for CDO/Mogwaï and Hawk were written in different languages, so part of the differences in their performance may be due to the languages and not the systems themselves. The aim in this study was to use the most optimized language for each system, since Hawk does not support OCL and Mogwaï and CDO do not support EOL.

    Analytically, we do not anticipate that this is likely to have a strong impact on the obtained results for CDO and Hawk as both languages are very similar in nature and are executed via mature Java-based interpreters. It may only be an issue with Mogwaï, whose OCL-to-Gremlin transformation is still a work in progress and may change when Mogwaï transitions to Neo4j 2.x.

As for whether the results can be generalized beyond this study, there are a few threats that must be acknowledged:

  • This study has not considered running several different queries concurrently. While multiple configurations for Hawk have been considered (all 4 combinations of Neo4j/Orient and EOL/EPL), only one configuration was studied for CDO and for Mogwaï. The tested configurations would be quite typical in most organizations, but it would be interesting to perform studies that mix different queries running in different models concurrently and configure Hawk and CDO with different backends, memory limits and model sizes.

  • The experiment has compared a specific set of tools: one for model repositories (CDO), one for graph-based model indexing (Hawk) and one for querying models persisted as graphs (Mogwaï on top of NeoEMF/Graph). This raises the question of whether the results could be extended to other tools of the same types.

    The first part of our answer is that this categorization was not relevant for this study: any tool could have been used as long as it provided a high-level remote querying API and relied on a database for persisting the models. CDO, Mogwaï (in combination with NeoEMF/Graph) and Hawk are three instances of these same requirements, and therefore, any generalizations are backed by not one, but the three tools.

    The next part is that while some of the detailed results are specific to certain tools (e.g. comparisons between Neo4j releases), there are higher-level results which reaffirm knowledge from other areas in software engineering. For instance, RQ1 showed that HTTP’s overhead was roughly constant if the message patterns were similar, and RQ2 confirmed just how much of an impact a different message pattern could have. RQ3 compared generic against application-specific caching, RQ4 discussed readability and query implementation quality, and RQ5 confirmed using a graph backend may not always bring better performance by itself. The high-level observations collected during these studies can be extended to any database-backed remote model querying solution in the future: indeed, part of our intention with this paper was to make future developers aware of these aspects.

  • The results are based on two specific case studies: it could be argued that different case studies could have yielded different results. To avoid introducing bias, the authors refrained from defining custom benchmarks and instead adopted benchmarks from the existing literature. These benchmarks were picked as they covered different application areas (software engineering versus critical systems modelling), different metamodels (highly hierarchical software metamodels versus “flat” railway metamodels), and different workloads (localized pattern matching in GraBaTs’09 versus a combination of complex pattern matching and simple “all X with attribute Y meeting Z” queries in TB).

    For these reasons, we argue that the 7 queries across the 2 case studies are representative of pattern matching queries on models, where we want to find elements whose state and relationships meet certain conditions. We do not expect other model querying case studies to change the results significantly. However, our case studies do not cover other model management tasks, such as code generation or model transformation: those would require their own case studies. Incidentally, Hawk did significantly speed up code generation in our previous work [13].

5 Conclusions and further work

This study was a largely extended version of our prior conference paper, going from 2 configurations to 6 (CDO, Mogwaï, and all 4 combinations of Hawk with Neo4j/Orient and with EOL/EPL), and adding 6 new queries written in 3 languages (OCL, EOL and EPL). This wider study confirmed some prior results, while giving a more nuanced outlook on others.

It was confirmed once more that the network protocol used had very different impact depending on how it was used: CDO once more had dramatic overheads of 600%, while Hawk and our simple HTTP server for Mogwaï had at most a 20% overhead. In fact, statistical tests showed that for the more efficient GraBats’09 queries, there was no significant difference beyond a certain number of client threads. For the Train Benchmark queries, some queries even ran faster on HTTP thanks to the more fine-tuned default thread management on the Jetty HTTP server. One worrying result is that for some Train Benchmark queries, CDO showed incorrect and failed queries even over TCP—this could point to underlying thread safety or race condition issues in the framework or the networking library.

Comparing CDO/Hawk packet captures confirmed that the problem with CDO over HTTP was the naïve way in which server-to-client communications had been implemented, which used simple polling instead of state-of-the-art approaches such as WebSockets.

Regarding caching and indexing, CDO’s application-agnostic caching performed quite well in both the GraBaTs’09 and Train Benchmark queries. However, Hawk was able to outspeed CDO easily when derived and indexed attributes (a form of application-specific caching) were used, as it happened for the HQ3, HQ4, CS, PL and SM queries. The Hawk OrientDB backend did show some performance degradation when performing ranged queries on attributes with high cardinalities, however. The current version of Mogwaï did not perform as well in this regard, as it had no support for indexed attributes and does not implement a caching layer of its own: the only caching is for the compiled ATL script that transforms OCL queries into Gremlin programs. We suggest that Mogwaï should adopt one in the future.

As for the impact of the query language, it was found that Mogwaï ’s full recompilation of OCL into native Gremlin queries did not give it a definitive advantage over CDO’s on-the-fly SQL query generation: in fact, it seemed to perform the worst among all tools, though this may have been due to the use of an older Neo4j release. The interpreted nature of EOL and EPL did not result in performance issues, but it was found that without taking the appropriate precautions, EPL would perform additional work that would result in a severe drop of performance for queries with many nested loops. Beyond the implementation approach of the language, we found that Mogwaï missed the opportunity to integrate Neo4j’s ability to traverse edges in both directions into its OCL dialect: if it had done so, it would have readily outsped CDO on the SN query, as Hawk did (median was 300ms for Hawk/Neo4j/EOL compared to 100s for CDO).

Finally, 99% confidence intervals for the execution time ratios of CDO against the other configurations were computed. For the most part, tools retained their relative performance as the number of client threads increased. There were some exceptions, however: some configurations that started faster than CDO using the Mogwaï tool, the Hawk OrientDB backend or the EPL query language would become slower than CDO as the number of threads increased—the only configuration that did not show this was Hawk with Neo4j and EOL. However, even this optimal configuration could somewhat lose its performance edge against CDO in some queries: a future study comparing levels of thread contention across tools could be useful to shed light on the reasons.

In closing, this study showed that achieving high-performance and scalable remote model querying is not only a matter of choosing the right backend and using it efficiently: every other part of the system must be carefully engineered. Our ideal system would meet these requirements:

  • The API should support both synchronous and asynchronous querying. Synchronous querying is more robust against high loads (as seen with Hawk and Mogwaï), since it does not require maintaining a correlation between multiple responses. Asynchronous querying, where the results are trickled back to the client, can handle larger result sets but is hard to protect against stressful situations (as seen with CDO).

  • Any server-to-client communication needed for asynchronous querying should be conducted over a real full-duplex channel rather than through polling, to avoid introducing unnecessary delays.

  • To reduce roundtrip times, APIs should support running entire queries in the server rather than simply fetching individual elements to be filtered on the client. In other words, the API should include two levels of granularity: one at the query level, and one at the model element level.

  • The query engine must include a caching layer and ideally should be able to precompute the results of common subqueries.

  • The query language must allow users to take advantage of important features on the backend, while not imposing unexpected work on it.

  • Using a graph database can noticeably improve performance in queries that require following many references, but it is not a silver bullet: graph databases are young in comparison with relational databases, and presently their use requires more fine-tuning and benchmarking.

For future work, we would like to examine scalability within a real collaborative modelling environment instead of producing synthetic loads, where a mix of queries is run concurrently according to the needs of the users over time. Another direction for future work is analysing the queries to split the work in a query efficiently between the client and the server, using the server for model retrieval and the client to transform the retrieved values. This will require balancing the reduced workload on the server with the increased network latency and transmission costs.

One more possible line of work is studying how to scale systems such as Hawk and CDO horizontally over multiple servers, either by sharding or splitting the data according a domain-specific criteria (e.g. Java projects in the GraBaTs’09 data set, or subsets of the rail network in the Train Benchmark data set), or by replicating all the data. Sharding could be less expensive per server, but it would require breaking down queries into smaller parts and integrating the results: this could be done in the client, or in an intermediate “broker” node. Effectively, this would increase the number of requests done through the network, and it may not be worth it except for very large queries. Querying with replication would be simpler, only requiring the addition of a load balancer in front of the servers. In fact, this particular approach would be easier to study in the short term, as Hawk already has an experimental integration with the multi-master replication mode of OrientDB. So far it has only been used for increased availability, but increased performance could be achieved as well by developing a load balancer node that exposed the same API as current Hawk servers.