1 Introduction

The decline of the one-size-fits-all paradigm has pushed researchers and practitioners towards the idea of polyglot persistence (Sadalage & Fowler, 2013), where a multitude of databases is employed to support data storage and querying. The motivations are manifold, including the exploitation of the strongest features of each system, the off-loading of historical data to cheaper database systems, and the adoption of different storage solutions by different branches of the same company. This trend has also influenced the discipline of data science, as analysts are being steered away from traditional data warehousing and towards a more flexible and lightweight approach to data analysis. Multistore contexts are characterized by 1) the replication of data across different storage systems (i.e., there is no sharp horizontal partitioning) with possibly conflicting records (e.g., the same customer with a different country of residence in different databases), and 2) a high level of schema heterogeneity: records of the same real-world entity may be represented with different structures, using different naming conventions for the same kind of data. The large volume and the frequent evolution of these data hinder the adoption of a traditional integration approach.

In recent work (Ben Hamadou et al., 2019; Forresi et al., 2021; Forresi et al., 2021) we have proposed a multistore solution that relies on a dataspace to provide the user with an integrated view of the data. A dataspace is a lightweight integration approach providing basic query expressiveness on a variety of data sources, bypassing the complexity of traditional integration approaches and possibly returning best-effort or approximate answers (Franklin et al., 2005). The dataspace is built in accordance with a pay-as-you-go philosophy, i.e., by applying simple matching rules to recognize relationships between data structures and by letting the users progressively refine the dataspace as new relationships are discovered (Jeffery et al., 2008). Users exploit the dataspace to formulate GPSJ (generalized projection, selection and join) queries, i.e., the most common class of queries in OLAP applications (Golfarelli et al., 1998). Queries are translated into execution plans that consist of many local computations (carried out by the single databases) and a global computation (carried out by the middleware layer).

In this paper, we propose a technique to optimize the execution of GPSJ queries by finding the most efficient execution plan on the multistore. The main challenge lies in devising a cross-database execution plan that couples data fusion operations with the resolution of schema heterogeneity and efficiently provides a correct result; remarkably, data fusion consists in merging duplicated records that refer to the same real-world entity into a single representation by resolving possible conflicts (Mandreoli & Montangero, 2019). In particular, the paper provides the following contributions. (1) The extension of the multistore scenario presented by Forresi et al. (2021) to consider data that is replicated across different databases. (2) The introduction of an advanced query planner that exploits different characteristic of the multistore and its data. (3) The presentation of a multi-database and self-learning cost model to compare the complexity of execution plans and choose the most efficient one. The cost model keeps into account both the execution plan features and the database resources. (4) The evaluation of the proposed on multiple multistores with different characteristics based on a realistic real-world scenario. An exploratory study in this direction (only limited to a simple example) had been done by Forresi et al. (2021); in this paper, the work is completed with the full formalization of the approach, the introduction of a realistic real-world scenario, the detailed discussion of the generated query plans, the improvement of the cost model, and the substantial extension of the experimental evaluation.

The paper is structured as follows. Section 2 gives an overview of the multistore scenario and Section 3 discusses related work. In Sections 4 and 5 we introduce the background knowledge on our multistore and the multistore algebra that is extended from Nested Relational Algebra (NRA), while Section 6 presents the approach to query planning. In Section 7 we discuss the cost model and Section 8 shows the experimental evaluation. The conclusions are drawn in Section 9.

2 Multistore Overview

The multistore operates under the following assumptions.

  1. 1.

    Being a multistore, there exist multiple databases, supporting a variety of data models: relational, document-based, wide-column, and key-value.

  2. 2.

    Records may be replicated in collections of different databases, possibly with conflicting values.

  3. 3.

    Schema heterogeneity is present at different levels: records representing the same kind of entity (e.g., a customer) may have missing (or additional) attributes or adopt different naming conventions; this issue may occur between different collections, but also within the same collection (due to the schemaless nature of NoSQL databases).

An exemplification of these problems is given in Fig. 1, where overlapping records of customers and orders from two DBMSs (relational and document-based) need to be reconciled in order to obtain a clean representation that can be used for analyses purposes. Notice the overlap of customer 123 and order O1 in different schema representations; orders have different attributes, customers have different naming conventions and conflicting values for name and age.

Fig. 1
figure 1

An exemplification of data model heterogeneity, schema heterogeneity, and record overlapping in a multistore

As discussed by Forresi et al. (2021), the above conditions are met in many real-world applications, which – in practical contexts – often refer to data virtualization systems for data analysis. Below we describe two specific contexts that emerged during our interaction with Denodo,Footnote 1 one of the market-leading tools on this subject.

  • Analytical data offloading: to reduce costs and optimize performance, the historical depth of databases is kept limited; typically, it is 1-2 years for operational systems, and 3-5 for analytical ones (Golfarelli and Rizzi, 2009). After these periods, data are offloaded to cheaper as well as bigger storages, such as cloud storages or data lakes. Offloading implies a change of data model, a change of schema, and obviously an overlapping of instances with the original data. For example, offloading a relational data warehouse could imply turning instances stored in a star schema to a single JSON document including both measures and dimensional attributes; alternatively, a relational flat schema could be adopted. Similarly, invoices stored in an ERP can be offloaded to a key-value repository, where the value stores an object including only the attributes relevant for fiscal purposes. In the meanwhile, the in-place data may evolve in terms of structures or values. In this context, unforeseen analyses are often needed, such as data enthusiasts asking to compare the offloaded data with the in-place ones.

  • Multi-cloud architecture: this context combines different storage technologies and resources from multiple cloud platforms (Mazumdar et al., 2019). It allows application providers to manage the risks associated with technology, vendor lock-in, provider reliability, data security, and privacy thus, it is an increasingly popular tactic for designing the storage tier of cloud-based applications (Rafique et al., 2017). The multi-cloud architecture and related frameworks (e.g. data fabric) accelerate digital transformation since they enable the exploitation of data spread across different providers and architectures, all the while overcoming data silos through data virtualization. Multi-cloud architectures are a panacea in presence of many company branches. For example, consider a holding or a federation of companies (e.g., hospitals in the health sector). In this case, a lot of data is shared between the branches, but each branch is free to choose its own storage provider (either on cloud or on-premise), data model, and schema. To keep it simple, let us consider the case of ICD-9-CM,Footnote 2 which is often used in OLAP analysis in the healthcare domain. ICD-9-CM changes some of its attributes and values across the years; thus, depending on the ICD-9-CM version adopted by each branch, data overlapping and schema heterogeneity must be resolved when cross-queries are issued over the branches’ databases. Furthermore, every hospital or local health unit can store such data in different data models and schemas, depending on the adopted software.

In this paper, we consider a multi-cloud architecture case study, where different branches of the same holding store the same data but rely on different storage systems. Figure 2 shows a conceptual view through an ER diagram. Figure 3 shows the physical implementation. C1 to C7 represent the collections of data, while the “:” notation is used to indicate the entities appearing in the collection (notice that the document-based database contains a single collection which uses nested structures to embed orders within customers, and order lines within orders). While Cloud 1 fully relies on a relational database, Cloud 2 satisfies the need for data variety support by relying on NoSQL systems; also notice that Cloud 2 additionally stores orders’ invoices. As the two branches belong to the same holding, both customers and products are partially overlapped in the two cloud environments.

Fig. 2
figure 2

The ER diagram of the case study

Fig. 3
figure 3

A graphical representation of the physical implementation of the case study. Different colors represent different databases with different data models

Figure 4 provides a functional overview of the multistore system and the supported user interactions. For starters, the user interacts with the system to create and continuously refine the dataspace, i.e., an abstract global representation of the data in the multistore. The dataspace is composed of two main concepts: entities, corresponding to the real-world entities in the multistore (e.g., customers, products), and features, corresponding to unique representations of the attributes describing entities (e.g., the name of customers, the brand of products). As discussed by Forresi et al. (2021), the dataspace is built and maintained through an incremental and semi-automatic approach that embraces the pay-as-you-go philosophy (Jeffery et al., 2008).

Fig. 4
figure 4

Overview of our multistore

The dataspace is used to formulate GPSJ queries, which are well-suited for data analysis; a typical OLAP query consists of a group-by set (i.e., the features used to carry out an aggregation), one or more numerical features to be aggregated by some function (e.g., sum, average), and (possibly) some selection predicates. A GPSJ queries formulated on the case study would aggregate events (e.g., Orderline and Order) to calculate KPIs or measures (e.g., the total Quantity, the average Price) by a grouping criteria (e.g., by Product, by Customer). Based on the user’s query, the system’s Optimizer defines the query plan to be executed on the multistore in two steps: first, the Query planner generates multiple query plans, then a Cost model is used to choose the most convenient one. Query plans are decomposed into subplans, each identified by macro operators that embed a tree of operations. Local subplans are computed directly on the local databases; global subplans are computed on the middleware’s execution framework to combine the partial results from local subplans and obtain the final result to return to the user.

3 Related Work

3.1 Multistore Systems

The variety in terms of data models responds to different requirements of modern data-intensive applications, but providing transparent querying mechanisms to query large-scale collections on heterogeneous data stores is an active research area (Tan et al., 2017). A naive approach to solve the problem of querying several data models is to transform all datasets into a reference data model — usually the relational one (DiScala and Abadi, 2016). This kind of solution leads to the loss of the schemaless flexibility and requires continuous maintenance to support schema evolution. A different approach is proposed by multimodel systems, which directly support several data models within the same platform; examples of multimodel databases are OrientDBFootnote 3 and ArangoDB.Footnote 4 Inter-data model querying is enabled by a custom query language to support nested structures and graph queries. Besides being limited to a predefined set of data models, these systems do not directly support data fusion. Indeed, a data warehousing approach relying on a multimodel system is presented by Bimonte et al. (2021), but neither schema heterogeneity nor data fusion is tackled.

In recent years, multistore and polystore have emerged to provide integrated access and querying to several heterogeneous stores through a mediator layer (middleware) (Tan et al., 2017). The difference between multistores and polystores lies in whether they offer a single or multiple querying interfaces, respectively. Among the most notable are BIGDAWG (Gadepally et al., 2016), TATOOINE (Bonaque et al., 2016), and CloudMDsQL (Kolev & et al., 2016). These systems vary in the functionalities they support (e.g., the available data models and storage systems, the support to the ingestion process, the expressiveness of the querying language, the possibility to move data from one database to another). However, none of them supports data fusion. Remarkably, adopting these approaches in a scenario with duplicated records would require to carry out a pre-processing activity to clean duplicates and resolve conflicts, which may be hindered (i) by the inability to overwrite data in the original databases and/or (ii) by the frequent evolution of data and schemas. Otherwise, failing to deal with duplicated records may lead to wrong results and, ultimately, to inaccurate decision-making.

Effectively supporting querying on a heterogeneous system with overlapping records requires the adoption of data fusion techniques (Bleiholder & Naumann, 2008). The literature on this subject is very wide, thus we refer the reader to a recent survey (Mandreoli & Montangero, 2019). Remarkably, related work in this area does not apply directly to a polyglot system. To the best of our knowledge, the only proposal that considers a scenario requiring data fusion in a polyglot system is QUEPA (Maccioni & Torlone, 2018), where the authors present a polystore-based approach to support query augmentation. The approach is meant to complement the other polystore systems that actually support cross-database querying, and record linkage techniques are only used to find related instances in different databases, but not to solve conflicts. Another work that proposes on-the-fly integration and schema heterogeneity resolution in an analytical context is (Gallinucci et al., 2019); however, the proposed approach is limited to document-oriented databases and does not consider data fusion.

3.2 Multistore Optimization

Besides the support to querying over heterogeneous databases, multistore systems must adopt optimization strategies to define efficient execution plans. In the literature, these optimization strategies are usually implemented through rule-based mechanisms and cost models; the main differences lie in the abstraction of the execution plan (i.e., whether it is considered as a whole or it is subdivided into its local and global parts) and the level at which they operate (i.e., at the logical or physical level).

Rule-based optimization consists in defining a set of rules to reorder the operations within logical plans (e.g., by bringing selections closer to the start of the plan) (Gog et al., 2015) or to choose the most efficient algorithm in the translation of a logical plan to a physical one (Wang et al., 2017). Solely relying on rule-based optimization is considered impractical and ineffective: these rules typically make very simplistic decisions based on the different cardinality and complexity of each operator (Agrawal et al., 2018), whereas the cost actually depends on many input parameters (e.g., selectivity, disk I/O, CPU cycles). Moreover, as new platforms and applications emerge, maintaining a rule-based optimizer becomes cumbersome as the number of rules grows rapidly.

Cost-based optimization relies on cost models to estimate the cost of query plans (usually in terms of time or I/O operations) and to choose the one that minimizes such cost. Some approaches are limited to the optimization of local databases computations — using either a single (Agrawal et al., 2018) or multiple cost models (Deshpande & Hellerstein, 2002) —, while others adopt a global optimization technique to decide which computations should be pushed down to local databases (Sellami & Defude, 2018). In general, using distinct cost models for each engine increases the accuracy of the optimization and enables a finer characterization (Duggan et al., 2015) (e.g., different databases may be more or less efficient in carrying out the same operations (Forresi et al., 2021)). The accuracy of cost models usually depends on the level of detail that they capture; however, the risk in fine-grained cost modeling is to make the training phase too complex and expensive (Singhal et al., 2019). To this end, an active learning technique (Golfarelli et al., 2019) can be used, as it allows for a simpler initial training and continuously improves the accuracy of the model by automatically updating it as queries are issued by the users. In the literature, cost models are often distinguished between white-box and black-box models.

  • White-box cost models associate theoretical formulas to the different query operators and build up the cost of a query by summing the cost of each operation. In multistore systems, white-box approaches usually require a separate cost model for each execution engine, as the latter are too different to be represented by a unique cost model (Bondiombouy & Valduriez, 2016). Some works break down the problem and focus only on either the local databases (Deshpande & Hellerstein, 2002) or the middleware (Subramanian & Subramanian, 1998). Complete works are presented by Sellami and Defude (2018) and Agrawal et al. (2018), where queries are broken down into sets of logical operators, the cost of which is determined in terms of disk I/O, CPU, and network; however, the full details of the adopted formulas are not disclosed. A similar approach was adopted by Forresi et al. (2021), where the cost of query plan operations is estimated in terms of disk pages read and written on each execution engine.

  • Black-box cost models hide the behavior of an execution engine within a black-box, where the known information is mostly limited to the issued queries and the given response times. The estimates for query execution times are then obtained by comparing the characteristics of the current query with those of the previously executed ones. Singhal et al. (2019) suggests using black-box modeling in a multistore/polystore system; this kind of approach is also used in non-strictly related contexts, such as Kaitoua et al. (2019) (where the goal is to optimize data migration between different databases) and Golfarelli et al. (2019) (where the goal is to estimate the cost of queries run on external web services, whose internal characteristics are not known).

Both approaches are not exempt from risks, as devising an effective cost model requires a careful evaluation of cost factors. On the one hand, white-box approaches require a deep understanding of each database’s internal mechanisms - as well as inter-database communication patterns - in order to algorithmically reproduce the same behavior. On the other hand, black-box approaches require the identification of all features (of queries and databases) that influence the cost of queries.

Among the systems mentioned in Section 3.1: BIGDAWG (Gadepally et al., 2016; Duggan et al., 2015) implements black-box models for each execution engine, even though their usage is limited to the optimization of query portions within the same engine; TATOOINE (Bonaque et al., 2016) makes no mention of cost optimization; and CloudMDsQL (Kolev & et al., 2016) blends rule-based optimization with a combination of both white and black-box cost modeling, but discloses no implementative details. The multistore proposed in this paper relies on rule-based optimization for a logical reordering of operations at both local and global levels, and uses a black-box cost model with active learning to identify the most efficient execution plan. The choice of a black-box cost model allows to overcome the many challenges behind white-box ones, which either require an enormous effort to effectively model the many factors that contribute to query costs in a complex and heterogeneous environment like a multistore, or suffer the limitations due to the assumptions that must be made to keep the model simple. Indeed, black-box cost models automatically learn and continuously fine-tune a model of the systems’ behavior, thus unloading the burden of this task from the user.

4 Multistore Formalization

4.1 Basic Concepts

The multistore is described by a dataspace, i.e., an abstract global representation of the data scattered across different databases. It is composed of two main concepts: entities, corresponding to the real-world entities in the multistore (e.g., customers, products), and features, corresponding to the attributes that describe entities (e.g., the name of customers, the brand of products). These concepts are built in a pay-as-you-go fashion by analyzing the schemas in the data and detecting relationships between attributes.

The coexistence of different databases and the absence of a unique reference data model carries the need for common terminology. Consistently with the terminology proposed by Atzeni et al. (2014), we refer to collections as the data structures that contain sets of records associated with a schema, and to records as the instances in a collection. Our notion of records perfectly corresponds to the tuples of a relational database, but the rows and documents of wide-column and document-based databases potentially correspond to multiple records. In fact, non-relational data models comply with the aggregate data modeling property, which enables the nesting of records within other records through the array data type. We do not consider documents and rows as a whole, but we separately model the records available at each nesting level. Differently from Atzeni et al. (2014), we use the term attribute (instead of field) to refer to the single properties of each record.

Definition 1 (Collection, Record, Attribute)

A collection C is a set of records; a record \(r = \{(a_{1},v_{1}), \dots , (a_{n},v_{n}) \}\) is a set of key-value pairs, where each value vi is associated to an attribute ai. Values are either of primitive type (e.g., number or string) or arrays of records.

From this point forward, we refer to primitive attributes or array attributes based on the type of associated values. If an attribute is nested within one or more array attributes, its name includes the dot-concatenation of the names of those array attributes.

Example 1

Figure 5 shows a sample document of a document-based database, corresponding to a customer, her orders, and the respective order details; the two order line records (in orange) are nested within the order record (in green), which is nested itself within the customer record (in blue).

Fig. 5
figure 5

A sample document corresponding to a customer, her orders, and the respective order details; four records are shown in the boxes, and each color (blue, green, and orange) corresponds to a different schema

Our notion of schema applies to the records rather than to the entire collections. Thus, several schemas may be found for a collection, due to the possible presence of both schema variability and nested records.

Definition 2 (Schema)

A schema \(S = \{ a_{1}, \dots , a_{\mid S \mid } \}\) is a set of attributes that applies to one or more records in a collection C. Each schema has a key attribute key(S), whose values uniquely identify the records with schema S within C.

A record r with schema S in a collection C may be nested within another record \(r^{\prime }\); in such a case, arr(S,C) denotes the array attributes of \(r^{\prime }\) that contain r. The set of all schemas in the dataspace is indicated with \(\mathcal {S}\). For the sake of simplicity, we assume all keys to be simple. Given a record r, its schema (denoted with Sr) is the set of attributes directly available in r. If r is contained within an array attribute a of a record t, then (i) in Sr we also include key(St) (this is necessary to maintain the relationship between the schema of a nested record and the one of the parent recordFootnote 5), and (ii) arr(Sr,C) extends arr(St,C) with a, so that arr(Sr,C) provides the full list of parent array attributes. Concerning the schemaless property of non-relational databases, we take into account every schema variation in a collection (i.e., if two records differ even for a single attribute, we model two separate schemas, each with its own set of attributes).

Example 2

The sample document of a collection C shown in Fig. 5 contains four schemas:

  • Sbl = {id,firstName,gender,orders}

  • Sgr = {id,orders.orderId,orders.orderDate,orders.totalPrice,orders.orderLines}

  • \(S_{or_{1}} = \{ \textsf {orders.orderId}, \textsf {orders.orderLines.quantity},\\ \textsf {orders.orderLines.asin},\\ \textsf {orders.orderLines.price} \}\)

  • \(S_{or_{2}} = \{ \textsf {orders.orderId}, \textsf {orders.orderLines.qty},\\ \textsf {orders.orderLines.asin},\\ \textsf {orders.orderLines.price} \}\)

It is arr(Sbl,C) = [ ], arr(Sgr,C) = [orders], and \(arr(S_{or_{1}},C)\) = \(arr(S_{or_{2}},C)\) = [orders,orders.orderLines].

Due to both schema variability and schema denormalization, several attributes may be found in different schemas to represent the same property. For example, in Fig. 5 different order line records use attributes with different names to indicate the quantity of product bought (i.e., quantity and qty, respectively). To resolve the different classes of heterogeneity and model the equivalence between different attributes of the dataspace we exploit mappings.

Definition 3 (Mapping)

A mapping m is a triple \(m = (a_{i},a_{j},\varphi _{(a_{i},a_{j})})\) that expresses a relationship between two primitive attributes ai and aj belonging to different schemas; \(\varphi _{(a_{i},a_{j})}\) is a bijective transcoding function to express the values of aj in the format of ai (if no transcoding is necessary, \(\varphi _{(a_{i},a_{j})} = I()\) where I() is the identity function). The existence of a mapping between ai and aj is indicated with aiaj.

Mappings are considered between single attributes; given two schemas Si,Sj and an attribute aiSi, we assume there exists at most an attribute ajSj such that aiaj. Mappings recognize that there is a semantic equivalence between two attributes in different schemas, thus we introduce features to provide a unique reference for equivalent attributes in several different schemas.

Definition 4 (Feature)

A feature represents a set of attributes mapped to each other. We define a feature f as a triple, where:

  • a is the representative attribute;

  • M is the set of mappings that link all the feature’s attributes to the representative a. The set of attributes represented by f is indicated with attr(f), and ∀mjM it is \(m = (a, a_{j}, \varphi _{(a,a_{j})})\).

  • is an associative and commutative function that resolves the possible conflicts between the values vi and vj of any two attributes ax and ay represented by f by returning a (possibly new) value vk.

Given a record r, the conflict resolution function can be applied to r[ai] and r[aj] if \(\{ a_{i}, a_{j} \} \subseteq attr(f)\); we refer the reader to Bleiholder and Naumann (2005) for different methods to define conflict resolution functions. Notice that an attribute is always represented by one and only one feature; thus, for any two features fi and fj, it is \(attr(f_{i}) \cap attr(f_{j}) =~\varnothing \) (otherwise, it would mean that \(\exists ~a : feat(a) \subseteq \{ f_{i},f_{j} \}\), but |feat(a)| = 1 by definition). We use feat(a) to refer to the feature of an attribute a, rep(f) to refer to the representative attribute of f. Features are defined even for attributes that are not associated to any mapping; in such a case, it is ∣attr(f)∣ = 1 and \(M = \varnothing \). A feature is a key feature if its attributes act as a key in at least one schema (and possibly play the role of foreign keys in other schemas).

Features describe properties of real-world concepts (e.g., customers, products) that we refer to as entities.

Definition 5 (Entity)

An entity E is an abstract representation of a real-world entity. It is associated with a set of features \(F_{E} \subseteq F\) and it is identified by a key feature key(E) ∈ FE; \(\mathcal {S}_{E}\) denotes the set of schemas such that \(\forall ~ S \in \mathcal {S}_{E}\) it is feat(key(S)) = key(E).

Remarkably, key features may be associated with multiple entities, while non-key features are associated with a single entity.Footnote 6

The schemaless nature of NoSQL databases allows different records in the same collection with different key features. We say that a collection is well-formed if it maintains internal consistency, i.e., if there exists a single key feature. From this point forward, we assume all collections to be well-formed.

Definition 6 (Collection well-formedness)

A collection C is well-formed if there exists a feature f = key(E) such that ∀ rC it is key(Sr) = f.

Relationships between entities are identified as follows.

  • A many-to-one relationship from Ei to Ej on a feature f is indicated with \(E_{i} \overset {f}{\rightarrow } E_{j}\); it is inferred if \(\exists {f}: \forall S_{i} \in \mathcal {S}_{E_{i}}, S_{j} \in \mathcal {S}_{E_{j}}\) it is attr(f) ∩{Sikey(Si)}≠key(Sj) ∈ attr(f). In other words, the many-to-one relationship is inferred if the attributes of f in the schemas of Ei are non-key attributes mapped to the keys of the schemas of Ej.

  • A one-to-one relationship between Ei and Ej on f is indicated with \(E_{i} \overset {f}{\leftrightarrow } E_{j}\); it is inferred if \(\exists f : \forall ~ S_{i} \in \mathcal {S}_{E_{i}}, S_{j} \in \mathcal {S}_{E_{j}}\) it is key(Si) ∈ attr(f) ∧ key(Sj) ∈ attr(f).

In a many-to-one relationship \(E_{i} \overset {f}{\rightarrow } E_{j}\), Ej is coarser than Ei and Ei is finer than Ej. It is \(E_{i} \rightarrow E_{k}\) if there exists a path of -to-one relationships from Ei to Ek.

Example 3

Table 1 presents our motivating example in terms of schemas, attributes, features and entities. On the columns, the schemas are organized by entities; on the rows, attributes are organized by features, and the mappings are implicit between attributes of the same feature. For instance, it is a15a16 since feat(a15) = feat(a16) = f6. Mappings reveal the relationship between the schemas. It is \(\mathcal {S}_{E_{pr}} = \{ S_{1}, S_{2}, S_{3} \}\) since key(S1) ≡ key(S2) ≡ key(S3); similarly, it is \(\mathcal {S}_{E_{ol}} = \{ S_{4}, S_{5} \}\). From f1 it is inferable a many-to-one relationship \(E_{ol} \overset {f_{1}}{\rightarrow } E_{pr}\). Notice that (i) each attribute is contained only in one schema, (ii) each schema contains one key attribute, (iii) each schema contains at most one attribute per feature, and (iv) there exist features (e.g., f1) that overlap with more entities.

Table 1 An extract of the correspondences between attributes and schemas in our case study from Fig. 3; cell [i,j] has a checkmark if aiSj, or the letter “K” if ai = key(Sj). Attributes are organized by features fk and indicate the collection Cl they belong to, while schemas are organized by entity Em

As discussed by Forresi et al. (2021), features and entities are obtained incrementally in a pay-as-you-go fashion (Jeffery et al., 2008). An automatic procedure is first run to discover mappings and infer the relationships to define features and entities; then, users can refine mappings and update the knowledge on features and entities, as long as the specified constraints are not violated.

4.2 Dataspace and Supporting Structures

The dataspace is the data structure that provides the integrated, high-level view of the data in the multistore.

Definition 7 (Dataspace)

The dataspace \(\mathcal {D}\) is a directed graph \(\mathcal {D}=(\mathcal {E},L)\) where \(\mathcal {E}\) is the set of entities in the dataspace and L is the set of links (i.e., many-to-one or one-to-one relationships) between the entities. A fact of the dataspace is an entity E for which there is no finer entity in \(\mathcal {D}\), i.e., \(\nexists ~E \in \mathcal {E} : E \rightarrow E^{*}\); the set of facts in \(\mathcal {D}\) is \(fact(\mathcal {D}) = \mathcal {E}^{*} \subseteq \mathcal {E}\).

Example 4

Figure 6 shows the dataspace of the case study. Since customer and product records are overlapped, it is ϕ(Epr) = ϕ(Ecu) = true, while ϕ(Eol) = ϕ(Eor) = ϕ(Ein) = false. The only fact entity is Eol (shown with a bolder border).

Fig. 6
figure 6

The dataspace \(\mathcal {D}\) of the case study

To denote the relationships between collections and entities, we introduce the concept of granularity and the expressions of entity representation and entity description are used.

Definition 8 (Granularity, entity representation, and entity description)

Let C a well-formed collection whose key feature f; its granularity gran(C) is the entity E that corresponds to f, i.e., key(E) = f. If gran(C) = E, then we say that C is represented by E. C is described by E if there exists an attribute aC such that feat(a) ∈ FE,feat(a)≠key(E). A collection can be described by more entities, and an entity can describe more collections. The entity representing a collection (i.e., its granularity) also describes it.

A collection contains a portion of the multistore data that refers to some of the dataspace’s entities. A collection graph is the portion of the dataspace indicating the entities describing a collection.

Definition 9 (Collection graph)

Given a collection C, its collection graph \(CG_{C} = (\mathcal {E}_{C},L_{C})\) is a subgraph of \(\mathcal {D}\) limited to the entities of \(\mathcal {D}\) that describe C.

When a collection is described by a single entity, its records contain only primitive attributes (i.e., first normal form) and values of all non-key attributes only depend on the key attribute (i.e., third normal form). In presence of two (or more) entities, the collection makes use of either nesting or flattening: we refer to nesting when non-primitive attributes are used and to flattening when values of non-key attributes depend on other non-key attributes.

Example 5

Figure 7 shows examples of different collection graphs representing the same set of entities, where Eol is the only fact entity for each collection. Ci is a fully nested collection, where the granularity is Ecu; Ck is a fully flat collection, where the granularity is Eol (indeed, there are two records, one for each order line, and order and customer attributes are duplicated); Cj is an intermediate solution, where the granularity is Eor: customer attributes are flattened, whereas order line attributes are nested.

Fig. 7
figure 7

Three examples of collection graphs representing the same data in different ways (i.e., \(CG_{C_{i}},CG_{C_{j}},CG_{C_{k}}\)), alongside the corresponding records in a document-based representation (i.e., Ci,Cj,Ck). The fact entity is in bold, while the granularity is in grey

Collection graphs may be of arbitrary complexity. We distinguish three patterns of collection graphs, which are used by the multistore algebra. The key aspect in common is the presence of a single fact entity. Let \(CG_{C} = (\mathcal {E}_{C}, L_{C})\).

  • A normal graph (NoR) is composed by a single entity. In this case, it is \(L_{C} = \varnothing \) and \(\mathcal {E}_{C} = E = gran(C) = fact(CG_{C})\).

  • A nested graph (NeR) is composed by at least two entities and there is a single path of edges that are directed towards gran(C), i.e., there is only one fact entity fact(CGC), which is the one entity farthest away from gran(C). In this case, \(\forall ~E \in \{\mathcal {E}_{C} \setminus gran(C) \}\) it is Egran(C).

  • A flat graph (FlR) is composed by at least two entities and all edges are directed away from gran(C) = fact(CGC). In this case, \(\forall ~E \in \{\mathcal {E}_{C} \setminus gran(C) \}\) it is gran(C) ⇒ E.

If a collection graph conforms to one of these patterns, we indicate it with rep(CGC) ∈{NoR,NeR,FlR}. Notice that, given our internal consistency assumption on the granularity of collections, each collection is associated to a single collection graph.

Example 6

In Fig. 7, it is \(rep(CG_{C_{i}})=\text {NeR}\) and \(rep(CG_{C_{k}})=\text {FlR}\), while \(rep(CG_{C_{j}})=\varnothing \).

5 Multistore Algebra

In this section, we present the algebra we adopt to formulate query plans. Section 5.1 presents an extension of traditional Nested Relational Algebra (NRA) with data fusion operations; Section 5.2 introduces entity views as a high-level abstraction for query plans and the high-level operators to formulate query plans with entity views.

5.1 NRA and Data Fusion Operations

Query execution plans are formulated in Nested Relational Algebra (NRA). A summary of the supported NRA operators is shown in Table 2. In this paper, NRA is slightly extended to support the data fusion operations required by our multistore scenario.

Table 2 NRA operators

The most important addition to NRA is the extension of the join operator’s semantics to handle data fusion. We do this by relying on the merge operator (⊔) (Forresi et al., 2021), which addresses the extensional and intensional overlap between collections. In particular, given two collections represented by the same entity (e.g., customers), the merge operator exploits mappings to resolve schema heterogeneity (e.g., to recognize that the customers’ names are reported in both collections with different naming conventions) and record overlapping (e.g., to produce a single record for customers that are replicated in both collections). Its goal is to keep as much information as possible, both from the extensional and the intensional points of view. The merge operator (⊔) answers this need by (i) avoiding any loss of records, (ii) resolving mappings by providing output in terms of features instead of attributes, and (iii) resolving conflicts whenever necessary.

First, let us formalize the merge operation between two generic recordsets.

Definition 10 (Merge operation)

Let Ri and Rj be the recordsets of two schemas Si and Sj, with akSi and alSj such that akal, i.e., \(\exists f : \{ a_{k}, a_{l} \} \subseteq attr(f)\). The merge of the two recordsets Rif Rj produces a recordset Rij with schema \(S_{ij} = S^{*}_{i} \cup S^{*}_{j} \cup S^{\cap }_{ij}\) such that:

  • \(S^{*}_{i} = \{ a \in S_{i} : \nexists ~ a^{\prime } \in S_{j}~s.t.~a \equiv a^{\prime } \}\)

  • \(S^{*}_{j} = \{ a^{\prime } \in S_{j} : \nexists ~ a \in S_{i}~s.t.~ a \equiv a^{\prime } \}\)

  • \(S^{\cap }_{ij} = \{ rep(a) : a \in S_{i}, \exists ~ a^{\prime } \in S_{j}~s.t.~a \equiv a^{\prime } \}\)

Rij results in a full-outer join between Ri and Rj where the values of attributes linked by a mapping are merged through function . In particular, given a record rRij obtained by joining sRi and tRj (i.e., s[ai] = t[aj]), it is \(~s.t.~ a \in S_{i}, a^{\prime } \in S_{j},\)\( a \equiv a^{\prime }\).

Example 7

With reference to Table 1, consider the merge between two product schemas, i.e., \(S_{1} ~\sqcup _{f_{1}}~ S_{3}\), and consider two records s and t, sC4 with schema S1, tC6 with schema S3. Let the values of ProductName be s[a6] = “Blueseventy Vision Goggles” and t[a8] = “B70 VG”. The merge of s and t produces a record r where and is a conflict-resolution function that decides between “Blueseventy Vision Goggles” and “B70 VG”.

Ultimately, the merge operator is applicable to two collections Ci and Cj represented by the same entity (i.e., E = gran(Ci) = gran(Cj)) and it is declared as Cikey(E)Cj. We further impose that the two collections share the same collection graph (i.e., \(CG_{C_{i}} = CG_{C_{j}}\)); this limitation is necessary (i) to enable the application of the merge operator to all entities in the two collections and (ii) to return a collection Cij with the same representation of the two input collections. In particular, depending on the representation of the collection graphs, we distinguish the following three situations.

  • If NoR, the result of the merge operation is trivial, as it simply produces a consistent recordset of E.

  • If NeR (i.e., it is \(E^{\prime } \rightarrow E\), where \(E^{\prime }\) is nested within E), the merge operation is recursively applied to the nested levels of the two collections. As per its definition, the merge on key(E) produces a single attribute for the primitive attributes in common between the two collections, while the array attributes ai and aj respectively contain the records of \(E^{\prime }\) in Ci and Cj are kept distinct. Thus, the merge operation is also applied between the recordsets within ai and aj to produce a consistent recordset of \(E^{\prime }\) within a unique array aij. This is recursively applied if there is another entity \(E^{\prime \prime } \rightarrow E^{\prime }\), where \(E^{\prime \prime }\) is nested within \(E^{\prime }\).

  • If FlR (i.e., it is \(E \rightarrow E^{\prime }\) and the records contain attributes of both E and \(E^{\prime }\)), the merge cannot be applied, because each collection contains multiple records for the instances of \(E^{\prime }\) (with potentially conflicting values). Indeed, this operation would need to be expanded to a complex set of operations, where the instances of \(E^{\prime }\) are extrapolated from the respective collections, merged to resolve potential conflicts, and then joined back with the respective instances of E (see Forresi et al., 2021 for further details).Footnote 7

Example 8

Figure 8 shows the effects of the merge operation on two records in NeR with \(E_{or} \rightarrow E_{cu}\). First, the merge operation produces a consistent representation of Ecu; since two attributes with the same name cannot coexist, the orders arrays are renamed to orders1 and orders2. Then, the merge is applied on the two arrays to produce a consistent representation of Eor as well. Figure 9 shows an example of two records in FlR with \(E_{or} \rightarrow E_{cu}\) that cannot be merged as resolving conflicts of Ecu instances in the same collection would require a complex set of operations.

Fig. 8
figure 8

Examples of merge operation on two records in NeR

Fig. 9
figure 9

Examples of two records in FlR that cannot be merged

5.2 Entity Views

To simplify the discussion on query plans, we introduce the notion of entity views. An entity view (EV) is a runtime-computed collection that provides a standard representation for the records modeling a set of entities. It is called a view because it is not persisted; differently from a typical view, however, it is not exposed to the user, but it is used internally to generate query plans.

Being a collection, an EV is associated with a collection graph as well, but only in one of the three special patterns (i.e., EVs can be only in either normal, nested or flat form).

Definition 11 (Entity view)

An entity view is a collection χ whose records represent the features of a given set of entities in accordance to a schema representation. Its collection graph CGχ is such that rep(CGχ) ∈{NoR,NeR,FlR}.

An EV is either local or global. A local entity view (LEV) is obtained from collections belonging to the same database, thus it provides a partial representation of a set of entities. A global entity view (GEV) provides a complete and cleansed representation of a set of entities in the multistore; it can be obtained in two ways: either by merging LEVs represented by the same set of entities across all databases or through a join of GEVs.

Limiting EVs to the three schema representations poses certain limits; for instance, collection Cj in Fig. 7 mixes different representations, thus it cannot be considered an EV without applying first some schema transformations. However, enforcing these schema representations enables the definition of an algebra of operations on the EVs. In particular, operations on entity views are defined through EV operators, i.e., macro-NRA operators (distinguished from simple ones by the hat \(\hat {}\) symbol) that embed a tree of NRA operations. Macro-NRA operators are high-level operators that aid the creation and discussion of execution plans.

5.2.1 LEV Creation

A LEV creation is defined as \(\hat {\pi }(\mathcal {C}_{\chi }, CG_{\chi }, F_{\chi }, p_{\chi })\). This operation creates a LEV χ defined by CGχ from a set of collections \(\mathcal {C}_{\chi }\) belonging to the same database. Notice that \(\hat {\pi }\) also embeds the optional application of selection predicates pχ; each ppχ is in the form (f,ω,v), where f is a feature, ω ∈{=;>;<;≠;≥;≤} and v is a value. The underlying tree of NRA operations mainly depends on how the collection graphs of the collections in \(\mathcal {C}_{\chi }\) differ from rep(CGχ). The complexity of the creation operation is non-trivial and it is discussed in full detail in the Appendix. Figure 10 shows some examples.

  • On the left the operations to obtain \(\chi ^{\prime }\), i.e., a NeR view of Ecu and Eor from the relational collections C1 and C2. This requires to join the two collections (which produces a flat representation of Eor and Ecu) and then to change the collection’s granularity to Ecu by nesting the order records within an array attribute (i.e., orders), created for each customer. Conversely, no operation would be necessary to obtain such a view from C5, which is natively in NeR.

  • On the right the operations to obtain \(\chi ^{\prime \prime }\), i.e., a NoR view of Ecu from the document-based collection C5. This requires to unnest the orders array attribute and to project only the features corresponding to Ecu. Conversely, no operation would be necessary to obtain such a view from C2, which is natively in NoR.

Fig. 10
figure 10

Examples of NRA operations corresponding to LEV creations

5.2.2 GEV Creation

A GEV creation is defined as \(\hat {\sqcup }~(X,p_{X})\), where X is a set of LEVs, |X|≥ 2, and pX is an optional conjunction of selection predicates defined as in Section 5.2.1. This operation creates a GEV \(\chi ^{\prime }\) by resolving conflicts between duplicated records from two or more LEVs sharing the same collection graph \(CG^{\prime }\), i.e., ∀χX it is \(CG_{\chi }=CG^{\prime }\). The details of the NRA plan produced by the GEV creation are discussed in the in the Appendix. Essentially, this macro-operator produces a left-deep tree of binary merge operations between LEVs. Once all LEVs have been merged, the optional selection predicates are applied.

5.2.3 Join of GEVs

A join of GEVs is defined as \(\hat {\bowtie }~(X)\), where X is a set of GEVs, |X|≥ 2. The obtained GEV \(\chi ^{\prime }\) is the result of join operations between the GEVs in X, all of which represent connected but non-overlapping sets of entities: on the one hand, given any two entity views χiX and χjX, it is \(\mathcal {E}_{\chi _{i}} \cap \mathcal {E}_{\chi _{i}} = \varnothing \); on the other hand, the sets of entities must be connected, i.e., \(\mathcal {E}_{\chi } = \bigcup _{\chi _{i} \in X}{\mathcal {E}_{\chi _{i}}}\) is a connected set in \(\mathcal {D}_{q}\). The details of the NRA plan produced by the entity view join are discussed in the Appendix. Similar to the merge, this macro-operator produces a left-deep tree of binary join operations between two entity views. The result is a GEV that provides a cleansed representation of all the records in the multistore that are required to answer the query.

5.2.4 Optimization of Entity View Operations

The three entity view operations defined above implement some logical rules to produce an optimized NRA tree.

  • Despite the possible presence of multiple schemas in the same collection, the plans avoid accessing the same collection multiple times (unlike Forresi et al., 2021). In particular, mappings are exploited to produce predicates that take schema heterogeneity into consideration; for instance, a selection predicate on a feature f is translated to a disjunction of selection predicates on the attributes of attr(f) that appear in the schemas of the collection.

  • Predicate push-down is employed to apply selection predicates as close to the source as possible.

  • Column pruning is employed to project from each collection the attributes corresponding to features that are relevant for the query.

  • Whenever there is a need for joining (or merging) more than two collections (or entity views), a minimum selectivity heuristics (Steinbrunn et al., 1997) is adopted to determine the order of these join/merge operations. The basic idea is to start from the collection with the lowest cardinality and progressively merge it with collections with increasing cardinality. To estimate cardinalities we take into consideration the selection predicates and the statistics collected from the databases. The literature on such topics is very broad. The accuracy of the estimate strictly depends on the collected information and the assumptions made on data distribution. Following several query cost models, in this paper we assume uniformity of attribute values, attribute value independence, and join containment.

6 Query Planning

The multistore supports the formulation of GPSJ queries, consisting of a group-by set (i.e., the features used to carry out an aggregation), one or more numerical features to be aggregated by some function (e.g., sum, average), and (possibly) some selection predicates.Footnote 8

Definition 12 (Query)

A GPSJ query q is a triple as q = (qπ,qγ,qσ), where: qπ specifies the optional set of features to be projected; qγ specifies optional aggregations as a set of pairs (f,op), where f is a numerical feature and op is an aggregation function (e.g., max()); qσ is an optional set of conjunctive (∧) selection predicates in the form of triplets (f,ω,v), where f is a feature, ω ∈{=,>,<,≠,≥,≤} and v is a value. At least one amongst qπ and qγ must be defined.

It is not mandatory that all the three sets qπ, qγ and qσ are present, thus our definition also covers simple selection queries and join queries.

Example 9

Let q be the query that computes, for each customer, the average price of 2020 orders. The group-by set of q is qπ = {f11}; the aggregation set is qγ = {(f9,avg)} and the selection predicate set is qσ = {(f10,≥,“2020/01/01”)}.

Building the plan of a query first requires identifying the entities that need to be accessed, which are not limited to those containing the features selected in the query. For instance, computing the average price for each customer requires to access not only Ecu and Eol but also Eor, even if no feature belonging to Eor is mentioned in the query. Thus, we define the query graph as a conceptual view of the query on the dataspace.

Definition 13 (Query graph)

The query graph QGq is an acyclic subgraph of \(\mathcal {D}\) (i.e., \(QG_{q} = (\mathcal {E}_{q} \subseteq \mathcal {E}, L_{q} \subseteq L)\)) such that:

  • QGq is minimally connected;

  • \(\mathcal {E}_{q} \supseteq attr(q)\);

  • \(\exists ~ E^{*} = fact(QG_{q}) : feat(E^{*}) \supseteq feat(q_{\gamma })\).

Condition (i) ensures that no unnecessary entity is accessed. Condition (ii) ensures that all attributes belonging to the features involved in the query are covered by the entities in \(\mathcal {E}_{q}\). Condition (iii) entails the compliance of query q with the GPSJ semantics, that is, there exists an entity representing the events at the finest level of granularity (i.e., there is a single fact E and the features in qγ belong to feat(E)). Many subgraphs could exist for a given query since more (many/one)-to-one paths could exist, each associated with different semantics (e.g., an entity of sales could be associated with an entity of dates through the mappings on both date of sale and date of shipping); in this case, we rely on user interaction to identify the adequate subgraph. If no subgraph exists, the query is not answerable.

6.1 Enumeration of Query Plans

QGq is the starting point to define the query plan, which can be defined in terms of entity view operations.

Definition 14 (Query plan)

A query plan P is a rooted tree of entity view operations, where (i) the root is a GEV join operation (\(\hat {\bowtie }\)), (ii) the root is preceded by one or more GEV creation operations (\(\hat {\sqcup }\)), and (iii) each of the latter is preceded by one or more (parallel) LEV creation operations (\(\hat {\pi }\)). The root is possibly extended with an NRA aggregation operation (γ).

Example 10

Figure 11 shows a sample plan for a query that computes, for each gender, the average quantities bought for products of brand “BrandABC”; operators’ predicates are omitted for space reasons. In the upper part, two LEV creation operations compute an entity view in NeR with customers, orders, and order lines records from the collection in the document-based database (i.e., C5) and the tables in the relational one (i.e., C1 to C3), respectively; in particular, the latter is the one hiding the most complexity, as multiple join and nest operations are required to compute the NeR representation. The two LEVs are then merged in a GEV creation operation, that returns a cleansed FlR representation of the same data and projects the only features required by subsequent operations. Similarly in the lower part, two other LEV creation operations compute an entity view in NoR with product records from C6 and C4, respectively. The subsequent GEV creation operation merges the products and applies the filter on the reconciled records. Ultimately, the GEV join operation combines the produced GEVs, while the aggregation operation computes the final result.

Fig. 11
figure 11

Example of a full query plan. Data flows are represented by full arcs while dotted lines link collections and EV to sample data

Several query plans can be devised for the same query. For readability purposes, the full enumeration process is detailed in the Appendix. The factors that determine the number of alternative query plans and that drive the enumeration process are summarized below.

  1. #1

    LEV creation. As discussed in Section 5.2.2, a GEV creation consists in left-deep trees of merge operations involving all LEVs with the same collection graph. Thus, different plans can be defined by choosing different schema representations to create the LEVs of a given GEV. Let χ a GEV with \(CG_{\chi } = (\mathcal {E}_{\chi },L_{\chi })\).

    • If \(\vert \mathcal {E}_{\chi } \vert = 1\), then LEVs are created in NoR (by definition, only NoR supports a single entity in \(\mathcal {E}_{\chi }\)).

    • If \(\vert \mathcal {E}_{\chi } \vert > 1\), then LEVs can be created either in FlR or NeR, depending on the assumptions from Section 5.2: FlR is possible only if \(\not \exists E \in \mathcal {E}_{\chi } : \phi (E)=true\) (because the merge operation requires EVs in NoR or NeR); NeR is possible only if there is a single path of -to-one relationships in u (which is a structural requirement for the NeR pattern).

  2. #2

    GEV creation. A query plan may include several alternative combinations of GEVs. These combinations can be found by partitioning \(QG_{q} = (\mathcal {E}_{q}, L_{q})\) into one or more subgraphs and creating a GEV for each subgraph. The number of possible combinations is \(2^{\vert L_{q} \vert }\). Fortunately, not all combinations are feasible: by definition, a GEV creation requires all corresponding LEVs to share the same collection graph, which is not always possible (see Example 11).

  3. #3

    LEV allocation. Each LEV creation operation can be executed either directly by the middleware or pushed down to the database storing the respective data; Section 7 will show that both options can be efficient, depending on several factors. As the computation of each LEV is independent of the others, the number of possibilities is 2y, where y is the number of LEVs to be created.

Example 11

The boxes in Fig. 12 correspond to all possible query plans for the query graph on the left. For a given box, each subgraph corresponds to a GEV, where the coloring scheme defines its collection graph. Horizontally, the plans differ on the way the query graph is partitioned into GEVs (#2); vertically, the plans differ on the schema representation for the given GEVs (#1); factor (#3) is not shown in the figure. Plans with an orange background are unfeasibile: as shown in Fig. 3, there is no collection in the document-based database described by Epr and there is no collection in the wide-column database described by Eol. This makes it impossible to create a LEV χ such that \(E_{\chi } \subseteq \{ E_{pr}, E_{ol} \}\) on either database — which is a requirement to create the corresponding GEV.

Fig. 12
figure 12

A query graph (left) and all the possible plans, shown with the corresponding combinations of GEVs (right). Unfeasible plans are in orange; recall that bold circles correspond to fact entities

7 Cost Model

Enumerating all the feasible query plans is useless if we are not able to identify the one determining the lowest execution time. Identifying the most efficient among the high number of plans obtainable for a single query q is particularly challenging in a multistore due to:

  • DBMSes heterogeneity. In a multistore, several DBMSes coexist each with its own data model and capabilities. A multistore cost model must cover all the different DBMSes.

  • DBMS variability. In a big data and pay-as-you-go context, variability is the norm. New databases may be added and DBMS resources (e.g., number of CPU cores, amount of RAM) could change along time.

In Forresi et al. (2021), we relied on existing literature to model the cost of each NRA operation on each engine in terms of read and written disk pages. While this worked well on the simple example considered by Forresi et al. (2021), (i) it did not consider the amount of resources assigned to each engine, (ii) it made simplistic assumptions about the parallelization of the computation in a distributed engine, (iii) it considered execution costs related to disk I/O only, (iv) it required an advanced knowledge about the internal details of each engine and related algorithms that reduces its extensibility.

In this paper, we overcome these limitations by proposing a self-learning cost model, which implicitly captures the aforementioned aspects without requiring explicit and complex modeling of execution costs. Inspired by Golfarelli et al. (2019), the cost model is composed by a set of multi-regression models H = {h0(),...,hn()}, one for each of the n execution engines composing the multistore including the middleware denoted by h0(). The query plan P is partitioned in a set subplans SP, each corresponding to the execution of an entity view operation on an engine (including the middleware). A multi-regression model \(h_{eng(P^{\prime })}(P^{\prime })\) estimates the execution time for the subplan \(P^{\prime }\) on the corresponding engine \(eng(P^{\prime })\) based on a plan profile. The list of the features composing the profiles is reported in Table 3; for each feature, the domain and the supported engines are indicated (e.g., the number of join operations is not captured on wide-column engines since they do not support this operation). Some of these features are directly obtained from the plan (e.g., number of unnest operations embedded in a \(\hat {\pi }\) or \(\hat {\sqcup }\) operation), while others also require basic statistics on the local databases (e.g., the collections’ cardinalities and attributes’ histograms to compute selectivity and aggregation rate, the presence of indexes to infer whether the engine will be able to exploit them). The execution time for P is estimated by composing the execution time of its subplans SP as follows:

$$ \begin{array}{@{}rcl@{}} Time(P) &=& {\sum}_{P^{\prime} \in SP \mid eng(P^{\prime})=0} h_{0}(P^{\prime})\\ &&+ \max_{i \in [1,n]} {\sum}_{P^{\prime} \in SP \mid eng(P^{\prime})=i} h_{i}(P^{\prime}) \end{array} $$
Table 3 The features of the regression models; the engine’s legend is R=Relational, D=Document-based, W=Wide-column, K=Key-value, M=Middleware

The assumption in taking the maximum execution time of a single engine is that computations on the local databases are carried out concurrently (indeed, all subplans are launched at the same time from the middleware and are assumed to be executed sequentially within each engine). Thus, the cost model explicitly codes inter-engine parallelism by considering only the slowest engine as the one contributing to the plan duration. Conversely, we assume that the computation on the middleware requires all local computations to be completed first; this is a worst-case simplification, as – depending on the actual plan – the global computation could be started to process the results of the first local computation. Noticeably, intra-engine parallelism generated by multi-core execution is transparently captured by the regression models.

Figure 13 reports, for the multistore used in our tests, an excerpt of regression tree for the middleware (i.e. Spark) showing the most important features determining the execution time. The full regression tree includes 98 leaves arranged in 5 levels. Each leaf estimates the execution time limitedly to the portion of the feature space defined by the split nodes. We remind that the advantage of local regression approaches (Baldacci et al., 2016; Loader, 2006) is that they enable the fitting of complex feature spaces without using global regression functions that are implicitly difficult to tune. To reach this goal, several simple regression functions, each modeling a small part of the data space, are combined to create a complex model.

Fig. 13
figure 13

An excerpt of the regression tree defining the cost model for the middleware engine

Models’ training consists in capturing, for each execution engine, the execution times of each entity view operation from a wide set of queries, manually defined by varying (i) the number of involved entities, (ii) the selectivity of the selection predicates, and (iii) the strength of the final aggregation (further details in Section 8.3). Entity view operations are executed in random order and repeated 5 times to smooth occasional outliers. Query profiles are then created and eventually pre-processed: binning is applied to features with a wide range of values (i.e., “Selectivity”, “Aggregation rate”, and “Number of records”); one-hot encoding is used to convert categorical features to numbers; features with a low range of values are not modified (i.e., all “Number of operation” features). Finally, the regression induction algorithm proposed is fed with the training sets to create the regression models.

Models’ drift (typically due to the variation in the amount of engine resources) is detected when the absolute relative error on the estimate becomes higher than a given threshold. The training queries for the drifted engine are re-executed and the corresponding regression tree is rebuilt. Similarly, if a new database is added to the multistore, a new set of queries must be devised and an additional regression tree must be trained.

8 Experiments

8.1 Prototype

The reference architecture is a two-rack big data cluster of 18 Ubuntu machines with a minimum configuration of i7 8-core CPU @3.2GHz, 32GB RAM, and 6TB hard disk drives. Each machine runs the Cloudera Distribution for Apache Hadoop (CDH) 6.2.0. The multistore implementation relies on PostgreSQL, MongoDB, Cassandra, and Redis as relational, document-based, wide-column, and key-value databases, respectively. PostgreSQL is installed on a single machine, while NoSQL stores are distributed across 5 machines. The middleware (including the query planner, the cost model, and the execution framework) is implemented as an Apache Spark application; Spark is one of the most used open-source execution frameworks for Apache Hadoop clusters, providing connectors to most databases (including those in our multistore). The Spark application runs with 4 executors, each with 4 CPU cores and 8GB RAM.

The user interacts with the multistore through a set of REST APIs to visualize and maintain the dataspace and to submit GPSJ queries. Plain Scala is used to generate the regression models for each execution engine (by relying on Weka libraries), as well as to build the query graph (by relying on the Graph for Scala library) and to generate the execution plans in sub-second time (Forresi et al., 2021). Queries are formulated by relying on the SQL APIs exposed by Spark’s DataFrame abstraction in two steps: (i) a DataFrame is initialized for each subplan assigned to a local database by making a custom call to the database systems’ APIs, thus overcoming the current limitations of Spark that prevent the push-down of a custom set of operations (Delaney and Li, 2017); (ii) the subplans assigned to the middleware are formulated through the SQL APIs, where User Defined Functions (UDFs) have been defined to implement the NRA extensions.

8.2 Multistore Benchmarks

The multistore in Fig. 3 is a variation of the multi-model benchmark Unibench (Zhang et al., 2018), which has been extended to inject schema heterogeneity and introduce overlapping records in different databases. Since the choice of the best query plan is determined by several factors (see Section 7), different benchmarks of the same multistore have been generated. In particular, the benchmarks vary on the following parameters.

  • Overlap: the presence/absence of overlap for different entities determines different workloads with different execution plans; as the preliminary results by Forresi et al. (2021) had shown, FlR is favored in absence of overlap, while NeR is favored in its presence. The overlap parameter defines, for each benchmark, the set of entities \(\mathcal {E}_{\phi }\) for which data fusion activities must be carried out (i.e., \(\forall ~E \in \mathcal {E}_{\phi }\) it is ϕ(E) = true). The overlap rate is set to 60% for Epr, and to 20% for both Eor and Ecu.

  • Data skewness: the convenience of pushing-down computations to local databases is partly determined by the amount of the data that must be processed by each database. Thus, different benchmarks are created with an unbalanced distribution of records, so as to put more pressure (i.e., more data) on different databases. Given the query expressiveness limitations of Cassandra and Redis (where no significant operations can be pushed-down), this variation is applied on MongoDB and PostgreSQL. Data skewness is determined by features f10 and f14, so as to enable the formulation of query predicates that select different amounts of data from either database.

The summary of the generated versions is shown in Table 4. Remarkably, the main goal of the paper is not to prove the scalability of the multistore (which is already addressed by Forresi et al. (2021)) but rather to evaluate the performance of the different execution plans in reasonable times. Nonetheless, all multistores are deployed with a scale factors 1, 10, and 100 to consider scalability as well. Scale factor 1 determines a total of 300K order lines relative to 10K customers and 10K products, which offers a good trade-off between a sufficient cardinality and non-excessive execution times for the experimental evaluation. The data used to populate the collections in the different databases is available at https://big.csr.unibo.it/multistore.

Table 4 Generated multistore benchmarks with different entity overlapping settings and different data skewness

8.3 Cost Model Evaluation

The tests in this section are aimed at assessing the quality of the cost model in choosing the query plan for a given query. Given the expressiveness of GPSJ queries (Golfarelli et al., 1998), the workload has been devised by varying the queries in terms of selectivity, group-by strengths, and number of entities involved in the query.Footnote 9

  • The group-by set is either absent (i.e., only a simple projection is carried out, without aggregation), weak (i.e., it involves features with high cardinality, resulting in several groups), or strong (i.e., it involves features with low cardinality, resulting in few groups).

  • The selection predicate is either absent, weak (i.e., its selectivity is low; for instance, a filter on a year of data), or strong (i.e., its selectivity is high; for instance, a filter on a given date).

  • The number of entities involved in the query varies from 1 to all 5 of them.

By varying these components, we obtain 54 queries, 48 of which are used to train the cost model and the remaining ones for testing. Details about all queries are available at available at https://big.csr.unibo.it/multistore. Overall, the estimated execution times differ from the actual ones for 9.4% on average; this is sufficient to correctly estimate the best execution plan for a query.

Figure 14 shows the average number of plans generated by the cost model with increasing number of entities involved in the query. As discussed in Section 6.1, the enumeration of query plans is subject to exponential factors that translate into an increasing (but non-exponential) numbers of plans generated as more entities are queried. Nonetheless, their enumeration does not have a great impact on performance: besides the time required for the query graph generation and partitioning (75ms), the generation and cost estimation of a single plan accounts for \(\tilde {2ms}\). Overall, the cost model always runs in sub-second times, with an average of 210ms.

Fig. 14
figure 14

Average number of enumerated plans with increasing number of queried entities

The effectiveness of the cost model is measured against 5 selected baseline strategies:

  • RCL corresponds to the hypothetical oracle that always chooses the optimal plan;

  • PRV corresponds to our previous multistore implementation (Forresi et al., 2021), where LEVs were always created in NoR and the computation was not pushed down to the databases;

  • NOB, NEB, and FLB adopt a simple strategy to choose the plan that maximizes both computation push-down and the creation of LEVs in a given schema representation (respectively NoR, NeR, and FlR).

OPT refers to the optimized multistore discussed in this paper. Table 5 indicates the average time increase between the execution time of the plan chosen by each strategy and the one of the optimal plan (RCL is not included here, as the time increase would obviously be 0). The data shows that OPT outperforms every other baseline, consistently throughout all benchmarks. More in detail, Fig. 15 separately compares OPT with every baseline to determine how many times OPT chooses a plan that is better than the one chosen by the compared baseline. Again, OPT emerges as a clear winner against all baselines, especially proving a big step forward with respect to Forresi et al. (2021); the bars of FLB and NEB do not reach 100% due to the impossibility to create LEVs in FlR or NeR in some queries (e.g., those involving a single entity) or benchmarks (as FlR is incompatible with the merge operation).

Table 5 Average overhead of the chosen plan with respect to the optimal one
Fig. 15
figure 15

Pair-wise comparison between plan selection strategies

The percentage of ties between OPT and RCL (i.e., 25.9%) indicates there is room for improvement. However, further investigation reveals that OPT mostly returns sub-optimal plans when the optimal one is missed. This is confirmed in Fig. 16, which shows, for each strategy, the percentage of times that the chosen plan is among the top-K most efficient ones; it is observable here how OPT reaches a high accuracy more quickly than other baselines.

Fig. 16
figure 16

Percentage of plan chosen among the top-K for every plan selection strategy

Figure 17 shows how the overhead of OPT and the baselines increases with the increase in scale factor. Noticeably, overheads tend to be higher at higher scale factors; this is mainly due to a performance degradation on the RDBMS. Remarkably, though OPT is affected by such increase, it improves its advantage over all baselines.

Fig. 17
figure 17

Overhead increase with increasing scale factors

Finally, Fig. 18 reports the average overhead of OPT with respect to different levels of training completeness, i.e., by progressively depriving the cost model of the collected training data. Essentially, this corresponds to testing the accuracy of the cost model with queries that are increasingly more distant from the training queries. The reported values are averaged from 5 experiments where different random samples of data are removed. The experiment shows that OPT is able to rapidly converge to a low overhead and that it is able to provide good effectiveness even when the training data is very limited.

Fig. 18
figure 18

Average overhead with different levels of training completeness

8.4 Query Plans Evaluation

The tests in this section analyze the factors that impact on query performance. For this purpose we collected the execution times of every possible execution plan generated by the query planner in the same set of queries from Section 8.3.

Figure 19a shows the average execution time decrease (in %) by varying the percentage of operations pushed down to MongoDB (D) and PostgreSQL (R). The decrease is measured against the execution time in absence of push-down. Given a plan P, let SPD and SPR be the set of subplans corresponding to LEV creation operations where the data is taken from the document-based and the relational database, respectively; also, let nra() be a function that returns the number of NRA operations in a given subplan. Then, the x and y coordinates of P in Fig. 19a respectively correspond to \(\frac {nra(\{ P^{\prime } \in SP_{\texttt {R}} ~:~ eng(P^{\prime }) = \texttt {R} \})}{nra(SP_{\texttt {R}})}\) and \(\frac {nra(\{ P^{\prime } \in SP_{\texttt {D}} ~:~ eng(P^{\prime }) = \texttt {D} \})}{nra(SP_{\texttt {D}})}\). Figure 19b shows the average execution time (in seconds) at different amounts of data (in MB) transferred from the local database (MongoDB above, PostgreSQL below) to Spark. A thorough evaluation of the single execution plans allowed us to obtain the following findings.

  • Push-down on MongoDB emerges as a generally convenient solution. The main reason is found in (i) reducing the amount of data moved from MongoDB to Spark, and (ii) Spark operating less efficiently on nested data. First, applying filters at the middleware level implies transferring a (potentially) large amount of data that will be discarded later; this is particularly true in the application of filters to nested attributes, as pushing down this operation accounts for an average time decrease of 33%. Figure 19b shows that there is a correlation between the amount of moved data and the execution time. Additionally, we observe that MongoDB is much faster than Spark in handling unnest operations (required to compute LEVs in NoR or FlR from the nested collection C5) and in projecting features out of nested attributes. The latter is especially observable in plans where the computation pushed down to MongoDB is minimum (because computing the LEV in NeR from C5 for Ecu, Eor, and Eol requires no extra computation besides the projection of features from attributes), yet there is an average 24% decrease with respect to carrying out the same operations on Spark.

  • No clear trend emerges on PostgreSQL. On the one hand, the push-down of computation implies less data movement and a better exploitation of indexes and statistics; on the other, Spark is a more powerful engine and is able to exploit operation pipelining to minimize the amount of data shuffling. Differently from MongoDB, Fig. 19b shows that execution times are not directly correlated to the amount of moved data. Interestingly, we have observed a performance decrease in query plans with LEVs on relational data where a large amount of operations is split between the database and the middleware (however, this is not reflected in the figure due to the averaging with other query plans).

Fig. 19
figure 19

On the left, the average execution time decrease (in %) at different percentages of operations pushed down to PostgreSQL (R) and MongoDB (D); the decrease is measured against the execution time in absence of push-down (i.e., the bottom-left corner, measuring 0%). On the right, the average execution time (in seconds) at different amounts of data (in MB) transferred from the local database (MongoDB above, PostgreSQL below) to Spark

The results in Fig. 19a depend on the resources available to each execution engine. In Fig. 20 we measure this dependency by reducing the amount of resources to the local databases: in particular, we move PostgreSQL to a single-core (versus the original 8-core) machine and we consider a non-sharded implementation of MongoDB’s collections (versus the sharding on 5 machines). The results show, for each database, the average time increase (in %) when running the same sets of plans on the less-powerful version of such database. As expected, pushing computation down becomes a less-favorable choice. Notice that execution times increase even at 0% of computation push-down; that is because data still needs to be read from the database (which is obviously slower in the less-powerful version).

Fig. 20
figure 20

Average execution time increase with fewer resources assigned to PostgreSQL or MongoDB

A final consideration concerns the usage of different schema representations in LEV creation operations. Interestingly, in 86% of the queries executed on benchmarks with most data on the document-based database (i.e., MS4 to MS6), the best query plan mostly relies on NeR to create LEVs. In the other benchmarks, the schema representation most adopted by the best query plan also depends on the level of overlap: in case of medium/high overlap (i.e., MS2 and MS3), NeR is chosen for 87% of the queries; in case of low overlap (i.e., MS1), NoR and FlR are chosen for 75% of the queries. These results are consistent with previous findings (Forresi et al., 2021) stating that (i) the need to solve record overlapping influences the choice of the schema representation, and (ii) preserving the original modeling of data usually translates to faster executions. Ultimately, the absence of a clear winner is a further validation of the need of a cost model to choose an efficient execution plan.

9 Conclusions

In this work, we have proposed a cost-based optimization of execution plans in a multistore by devising and evaluating different strategies to carry out joins and data fusion in presence of data replication. The execution plans are generated in terms of a multistore algebra extended from NRA and are based on different schema representations, so as to possibly take advantage of the original modeling of the data in the local databases. Experiments have been carried out on different multistore benchmarks to investigate the factors that drive the performance of different execution plans, demonstrating that the variety of these factors motivates the evaluation of alternative plans. The plan to execute for a given query is chosen by a black-box self-learning cost model; the effectiveness of the latter is shown by the experiments, which demonstrate the success in optimizing query executions with respect to Forresi et al. (2021).

Future work is planned to take several paths. First, we plan to evolve the multistore into a more complex data platform extending functionalities from simple querying to data search and profiling, provenance investigation, and orchestration of application pipelines (Francia et al., 2021). On the other hand, the efficiency of our multistore system can be further improved by considering the addition of data aggregation push-down to the local databases. This translates to the creation of new execution plans that need to be evaluated by the cost model as well. In this direction, the findings of this paper concerning the factors that drive execution plans’ performance can serve as a basis to define rules or heuristics that reduce the number of generated execution plans. Finally, we plan to complete the multistore’s functionalities by adding support to the graph data model.