Theory of Computing Systems

, Volume 61, Issue 1, pp 233–260 | Cite as

Optimal Broadcasting Strategies for Conjunctive Queries over Distributed Data

Article
  • 102 Downloads

Abstract

In a distributed context where data is dispersed over many computing nodes, monotone queries can be evaluated in an eventually consistent and coordination-free manner through a simple but naive broadcasting strategy which makes all data available on every computing node. In this paper, we investigate more economical broadcasting strategies for full conjunctive queries without self-joins that only transmit a part of the local data necessary to evaluate the query at hand. We consider oblivious broadcasting strategies which determine which local facts to broadcast independent of the data at other computing nodes. We introduce the notion of broadcast dependency set (BDS) as a sound and complete formalism to represent locally optimal oblivious broadcasting functions. We provide algorithms to construct a BDS for a given conjunctive query and study the complexity of various decision problems related to these algorithms.

Keywords

Coordination-free evaluation Conjunctive queries Broadcasting 

1 Introduction

We assume the setting introduced in the context of declarative networking [6, 14], where queries are specified on a logical level over a global schema and are evaluated by multiple computing nodes over which the input database is distributed. These nodes can perform local computations and communicate asynchronously with each other via messages. The model then operates under the assumption that messages can never be lost but can be arbitrarily delayed.

It is known that every monotone query can be evaluated in an eventually consistent and coordination-free manner through a naive broadcasting strategy that makes all data available to all nodes [14].1 Indeed, every computing node sends all its local data to every other node and reevaluates the query every time new data arrives. This evaluation is eventually consistent as, because of monotonicity, no facts will be derived which later have to be retracted and, furthermore, when all transmitted data has arrived, the output of every node will correspond to the result of the query. In addition, the computation requires no coordination between the nodes.

Obviously, the above strategy leads to a very careless evaluation as the whole database is send to every node and every node independently computes the complete answer for the targeted query. In the present paper, we are interested in more economical broadcasting strategies where only a subset of the local data is transmitted and where each computing node contributes to the answer of the query by outputting only a subset of the answer tuples. The result of the query then is the union of the tuples output by the computing nodes. In particular, we focus on full conjunctive queries without self-joins and we consider oblivious broadcasting strategies where every computing node determines which facts will be broadcast solely on the content of its own local database (so, oblivious of the data at other nodes). By the latter we particularly mean the initial local database. Our strategies are thus independent of incoming messages and can be thought of as ‘single-shot’ broadcasting strategies.

The sent facts are referred to as broadcast facts. Facts that are not initially broadcast are called static. We illustrate the ideas behind such strategies by means of an example.

Example 1

Let Q1 be the query Q1(x,y,z)←A(x,y),B(y,x),C(x,z) and let I={A(1,2),A(2,2),B(2,1),B(2,2),B(4,4),C(1,3)} be a database instance. Consider a network of two computing nodes c and c containing the facts I(c)={A(2,2),B(2,1),B(2,2)} and I(c)={A(1,2),B(4,4),C(1,3)}, respectively.

Naive broadcasting strategy

The naive broadcasting algorithm outlined above sends all facts in I(c) to c and all facts in I(c) to c. Eventually, both c and c receive all data and both of them compute the result of the query, that is, Q1(I)={(1,2,3)}.

Improved oblivious broadcasting strategy

The just described strategy is clearly oblivious but also rather wasteful. Therefore consider the following strategy which broadcasts all of the C-facts but none of the A-facts. Furthermore, a B-fact B(i,j) is broadcast only when A(j,i) does not occur in the local database. Executing this strategy for every computing node in our example results in c broadcasting the set {B(2,1)} while c broadcasts {B(4,4),C(1,3)}. So, eventually, I(c)={A(2,2),B(2,1),B(2,2),B(4,4),C(1,3)} and I(c)={A(1,2),B(2,1),B(4,4),C(1,3)}. Here, we denote by I(d) the instance at node d when all transmitted messages have arrived. Therefore, Q1(I(c))= and Q1(I(c))={(1,2,3)}, and Q1(I) equals Q1(I(c))∪Q1(I(c)). Intuitively, this strategy is correct in general as the following invariant holds for every computing node d: when a fact B(i,j) is not broadcast at a node d, then every satisfying valuation V for Q1 on I that maps (x,y) to (i,j) can be realized locally in I(d). Notice that, a similar strategy reversing the roles of A- and B-facts would work as well.

We will formalize oblivious broadcasting functions as generic mappings. This means that decisions on whether to broadcast facts do not depend only on the name of the predicate but can also depend on the equality type of the fact under consideration. Therefore, the following strategy would be valid as well: always broadcast facts of the form C(i,j) with ij and keep all facts of the form C(i,i) static; broadcast all B-facts; broadcast a fact A(i,j) only when the fact C(i,i) is not present in the local database. While not immediately obvious, this strategy correctly computes Q on every distributed database.

Both strategies will be presented more formally in Section 5 in terms of broadcast dependency sets and are formalized further in Example 6(1) and 6(2). @@@

In this paper, we make the following contributions:
  • (i) We provide a semantical characterization of when an oblivious broadcasting function (OBF) correctly evaluates a given conjunctive query. While it is desirable to construct OBFs that minimize the overall amount of transmitted facts over all distributed databases, we show that there is no optimal OBF for any conjunctive query with at least two distinct atoms in its body. Therefore, we turn to a slightly weaker notion of optimality, called locally optimal, which requires that an OBF is optimal w.r.t. the local instance at every computing node. Intuitively, this means that no broadcast fact can be made static without sacrificing correctness. We provide a semantical characterization for when an OBF is locally optimal for a given conjunctive query.

  • (ii) We introduce the notion of a broadcast dependency set (BDS) as a formalism to specify OBFs. In brief, a BDS \(\mathcal {S}\) is a set of pairs (τ,T) where τ is a partial equality type w.r.t. a relation and T is a set of partial equality types. Every such pair encodes a rule that can be interpreted roughly as follows: when a fact f matches type τ, it will be broadcast at a computing node c when the set of facts induced by T is not present at c. We present necessary and sufficient syntactic conditions for when a BDS is correct for a given query and also for when it is locally optimal w.r.t. that query. Furthermore, we study the complexity of deciding whether a BDS is correct for a query and whether it is locally optimal. Finally, and most importantly, we show that the formalism of BDS is expressively complete w.r.t. locally optimal OBFs by obtaining that every locally optimal OBF can be represented by a BDS. In fact, every locally optimal OBF can already be represented by a BDS that only uses complete types, that is, types where the equalities between all variables are fully specified.

  • (iii) Based on the syntactic criteria of when a BDS is correct for Q and when it is locally optimal, we obtain an algorithm BDS-BUILD(Q) that computes a locally optimal OBF (represented as a BDS) for a given conjunctive query Q. When restricting to open types (these are types without restrictions on the equalities between variables), BDS-BUILD(Q) computes a locally optimal OBF in time polynomial in the size of Q. When considering complete types, BDS-BUILD(Q) computes a locally optimal OBF in time exponential in the size of Q simply because there are exponentially many complete types.

Outline

We discuss related work in Section 2 and introduce the necessary definitions and concepts in Section 3. In Section 4, we discuss OBFs and locally optimality. In Section 5, we discuss broadcast dependency sets and study their properties. In Section 6, we provide an algorithm to construct a locally optimal OBF for a given conjunctive query. We conclude in Section 7.

The present paper is the full version of the extended abstract [15] and provides the missing proofs.

2 Related Work

CALM

The approach in this paper is motivated by the work on the CALM-conjecture. Hellerstein [14] formulated the CALM-principle which suggests a link between logical monotonicity and distributed consistency without the need for coordination. The latter principle is, for instance, embedded in BLOOM, a declarative language for distributed programming, for which practical program analysis techniques have been developed detecting potential consistency anomalies [3, 4, 11]. Ameloot et al. [6] formalized (and proved) the CALM-conjecture in terms of relational transducer networks. Zinn et al. [20] showed that the generalization of the conjecture to stronger variants of relational transducer networks fails. Ameloot et al. [5] then subsequently provided a more fine-grained answer to the CALM-conjecture by relating these stronger variants of relational transducer networks to weaker notions of monotonicity. All of these works considered naive evaluation strategies that broadcast all of the local data. In particular, none of these works considered more economic broadcasting evaluation of conjunctive queries.

Massive Parallel Model

The networked relational transducer model is just one paradigm for studying distributed query evaluation. In the massively parallel (MP) model, introduced by Koutris and Suciu [16], computation proceeds in a sequence of parallel steps, each followed by global synchronization of all servers. In this model, evaluation of conjunctive queries [7, 16] as well as skyline queries [1] have been considered. Recently, Beame et al. [8] proved a matching upper and lower bound for the amount of communication needed to compute a full conjunctive query without self-joins in one communication round. Notice that this is the same subclass of CQs as we consider in this work. The upper bound is provided by a randomized algorithm called Hypercube which dates back to Ganguli et al. [13] and was described by Afrati and Ullman [2] in the context of MapReduce algorithms. Hypercube is motivated by modern massively distributed systems like, for instance, Spark [19], where entire computations reside in main memory, replay is used to recover, and the dominant cost is that of communication. We note that one-round Hypercube is coordination-free and can be easily employed within the framework of relational transducer networks as well. A characteristic of Hypercube-style algorithms is that the space of computing nodes (over which the input data will be distributed) needs to be known in advance. The broadcasting strategies considered in this paper are motivated by a cloud computing setting where data is already initially distributed and the complete space of computing nodes is not necessarily known in advance. In this respect, Hypercube-style and broadcasting algorithms are orthogonal.

Relevance

One approach to minimize data transfer for a query Q, is to find the smallest subset J of a distributed instance I for which Q(I)=Q(J) and then only broadcast the relevant subset J. Determining which part of a database is relevant for answering a query is a problem that arises in different contexts. For instance, causality in databases aims to determine which tuples in the database instance caused the output to a query [17, 18]. Then, the contingency set asks for the smallest set K such that Q(I)≠Q(IK). So, any set IK extended with one element is relevant. Similarly, “where” and “why” provenance refer to the location(s) in the source databases from which the output was extracted or by which the output was influenced [9, 10]. Fan et al. [12] study the problem of scale independence where, through access patterns, the result of a query depends only on a bounded part of the database. It would be interesting to investigate how these different approaches translate to a distributed setting. Most surely, any lower bounds for the sequential setting imply lower bounds for the distributed setting, but upper bounds need to take into account that the initial database instance I is distributed as well.

The oblivious broadcasting strategies that we introduce operate locally on nodes and are unaware of the data residing on these nodes. In fact, our strategies are also independent of the network configuration itself (i.e., the set of computing nodes). Therefore, these strategies apply for example to (fast) evolving networks, where the exact state of the network at a given time may be unknown, as long as no adjustments in the network configuration happen during the query evaluation.

3 Preliminaries

Instances and Queries

For a finite set S, we denote by |S| its cardinality and by 2S its powerset. We denote {1,…,n} by [n], for \(n\in \mathbb {N}\). We assume an infinite set dom of data values. A database schemaσ is a collection of relation names R where every R has arity ar(R)>0. We call \(R(\bar d)\) a fact when R is a relation name and \(\bar d\) is a tuple in dom. We say that a fact R(d1,…,dk) is over a database schema σ if Rσ and ar(R)=k. A (database) instanceI over σ is simply a finite set of facts over σ. We denote by Adom(I) the set of all values that occur in facts of I. When I={f}, we simply write Adom(f) rather than Adom(f). A query over a schema σ to a schema σ is a generic mapping Q from instances over σ to instances over σ. Genericity means that for every permutation π of dom and every instance I, Q(π(I))=π(Q(I)). For the remainder of the paper, we assume that the database schema σ where queries are defined over is clear from the context, and do not refer to it anymore. A query Q is monotone if Q(I)⊆Q(J) for all instances I,J with IJ. We only consider monotone queries in the sequel.

Conjunctive Queries

Let var be the universe of variables, disjoint from dom. An atomA is of the form R(u1,…,uk) where R is a relation name and each uivar. We call R the predicate and denote it by pred(A). We denote the variables occurring in A by Vars(A)={u1,…,uk}. We say that A is an atom over the database schema σ if pred(A)∈σ and k=ar(pred(A)). A conjunctive queryQ (CQ) is an expression of the form A0A1,…,An, where for every i∈[n], Ai is an atom over the schema and A0 is an atom not over the schema. In particular, A0 is the head of Q, denoted headQ, and A1,…,An is the body of Q, denoted bodyQ. By Vars(Q) we denote all the variables occurring in Q. A valuation for Q on an instance I is a function V:Vars(Q)→dom. The application of V to an atom A=R(u1,…,uk), denoted V(A), results in the fact R(a1,…,ak) where ai=V(ui) for each i∈[k]. The valuation V is said to be satisfying for Q if V(A)∈I for all atoms A in the body of Q. In that case, V derives the fact V(A0). The result of Q on I, denoted Q(I) is defined as the set of facts that can be derived by satisfying valuations.

In what follows, we assume that every CQ is full and does not contain self-joins. Formally, we require that pred(Ai)≠pred(Aj) for ij and \(Vars(A_{0}) = \bigcup _{i\in [n]} Vars(A_{i})\). That is, every atom has a unique relation symbol and all variables occurring in the body occur in the head as well. For instance, Q1(x,y,z)←A(x,y),B(x,z),C(y,y) is full and does not contain self-joins, while Q2(x,y)←A(x,y),B(x,z),C(y,y) is not full and Q3(x,y,z)←A(x,y),A(x,z),C(y,y) contains a self-join.

Distributed Database

A network\(\mathcal {N}\) is a nonempty finite set of values from dom, which we call nodes. A distribution of an instance I over \(\mathcal {N}\) is a function H that maps each \(c\in \mathcal {N}\) to an instance such that \(I = \bigcup _{c\in \mathcal {N}}H(c)\). Notice that facts can be replicated. We also refer to each of the H(c) as the local instances. We consider a model where nodes have unlimited computational power and can send messages to all other nodes. These messages can never be lost but can be arbitrarily delayed.

The latter is formalised in [6] in terms of a local buffer for each computing node that is used to store incoming messages. Computation of the network is then defined as a transition system where in every transition a node becomes active and non-deterministically picks a message from its input buffer. A fairness condition is imposed to ensure that all messages are eventually read.

4 Oblivious Broadcasting

We refrain from introducing the formalism of relational transducer networks from [6], but present a simpler setting more suitable for our needs. In particular, the relational transducer networks needed in this paper only perform two actions: decide which facts to broadcast (and transmit those) and evaluate the query under consideration whenever new data arrives. The only parameter is the used broadcasting strategy and, therefore, forms the focus of our formalization. In brief, we consider broadcasting strategies where computing nodes partition their local database into static and broadcast facts. Static facts are kept local while broadcast facts, as the name already indicates, are sent to all other nodes in the network. As we only consider conjunctive queries which are monotone, the target query can be recomputed whenever new data arrives.

4.1 Oblivious Broadcasting Functions

We now formally define oblivious broadcasting function.

Definition 1

An oblivious broadcasting function (OBF)f is a generic mapping that maps instances to instances such that f(J)⊆J for all instances J.

An OBF specifies which local facts are broadcast. Specifically, f(J) are the broadcast facts while Jf(J) are the static facts. We use the term oblivious as broadcast facts only depend on the local database instance and their choice is independent of the facts at other computing nodes. An OBF f is naive when there are no static facts, that is, f(J)=J for all instances J.

Given a CQ Q, an instance I, a distribution H of I, and a network \(\mathcal {N}\), an OBF f implies a broadcasting algorithm in the following way. Let \(B(f,H)=\bigcup _{c\in \mathcal {N}} f(H(c))\) be the set of broadcast facts. Then, define \(eval(Q,f,H)= \bigcup _{c\in \mathcal {N}} Q(H(c)\cup B(f,H)))\) as the union of the query result at every computing node over the local instance extended with all broadcast facts.2

Remark 1

We note that the function eval(Q,f,H) implies an evaluation that can be executed by a transducer program πf,Q at every node c as follows: (1) R=, output Q(H(c)), broadcast f(H(c)); (2) whenever a fact f arrives, R=R∪{f}, output Q(H(c)∪R). Correctness then follows from the genericity and monotonicity of f. We refer to the execution strategy induced by eval(Q,f,H) as a broadcasting algorithm. Coordination-freeness intuitively follows as πf,Q never waits. Formally, a transducer is coordination-free [6] if there is a so-called ideal distribution, on which the query is already computed by a prefix of a run that does not process any of the incoming facts. For πf,Q this is the distribution that puts the complete instance at every node. We refer to [6] for a more formal treatment of coordination-freeness.

Definition 2

An OBF f is correct for a CQ Q when Q(I)=eval(Q,f,H) for all instances I and all distributions H of I.

When f is correct for Q, we also say that f is an OBF forQ. The following lemma characterizes correctness in that two compatible facts residing at different computing nodes can never be both static. Indeed, if they are, then the valuation witnessing compatibility is never realized at any computing node and consequently f can not be correct for Q.

We say that two distinct facts f and g are compatible w.r.tQ, denoted fQg, when, in some model, they are assigned to two atoms from the body of Q under one valuation, i.e., there is a valuation V for Q and atoms A,BbodyQ, such that V(A)=f and V(B)=g.

Example 2

For an example recall query Q1 from Example 1: Q1(x,y,z)←A(x,y),B(y,x),C(x,z). For Q1, facts A(1,2) and B(2,1) are compatible, because they are in the image of valuation V:{x↦1,y↦2,z↦3} over query Q1. This same valuation also witnesses compatibility of A(1,2) and C(1,3), and B(2,1) and C(1,3).

For an example of facts not compatible for Q1, take A(1,2) and B(2,2), for which it is easy to see that no valuation can assign variable x to both values 1 (for A) and 2 (for B).

Lemma 1

Let Q be a CQ and f be an OBF. Then, the following are equivalent:
  1. 1.

    f is correct for Q; and

     
  2. 2.

    there are no instances I,J, and factsf,g, with fQg, g∉I,f∉J such that f∉f(I∪{f}) and g∉f(J∪{g}).

     

Proof 1

(1) ⇒(2) We start by showing that every OBF for Q satisfies the above condition. The proof is by contraposition, so we assume that there are instances I and J and compatible facts f and g w.r.t. Q, where gI and fJ, but ff(I∪{f}) and gf(J∪{g}). Let K be an instance and let V be a satisfying valuation for Q on K witnessing compatibility of f and g. Then consider a network \(\mathcal {N} = \{1,2,3\}\) and an instance L=IJV(bodyQ) with the following distribution H: H(1)=I∪{f}, H(2)=J∪{g}, and H(3)=V(bodyQ)∖{f,g}. Clearly, V(headQ)∈Q(L). As Q is full, \(V(head_{Q}) \not \in \bigcup _{c\in \mathcal {N}}Q(H(c)\cup B(f,H))\) because none of the computing nodes contain both f and g, and f and g are not broadcast. Thus, \(Q(L) \ne \bigcup _{c\in \mathcal {N}}Q(H(c)\cup B(f,H))=eval(Q, f, H)\) and f is not an OBF for Q.

(2) ⇒(1) It remains to show that if the above condition is satisfied, then f is an OBF for Q. For this, let I be an instance, \(\mathcal {N}\) a network, and H a distribution of I over \(\mathcal {N}\). We prove that \(Q(I) = \textit {eval}(Q,f,H)=\bigcup _{c\in \mathcal {N}}Q(H(c) \cup B(f,H))\). As Q is monotone, Q(H(c)∪B(f,H))⊆Q(I) for every \(c\in \mathcal {N}\). Hence, it suffices to show that \(Q(I) \subseteq \bigcup _{c\in \mathcal {N}}Q(H(c)\cup B(f, H))\). Thereto, let fQ(I), let V be a satisfying valuation for Q over I for which V(headQ)=f. Let J=V(bodyQ)∖B(f,H), and c a node for which |H(c)∩J| is maximal. We claim that JH(c), obviously implying that f will be derived at node c. Towards a contradiction, assume there is an fiJH(c). As fiI there is a \(d \in \mathcal {N}\), cd, such that fiH(d). Moreover, by choice of c, |H(d)∩J|≤|H(c)∩J| and thus there must be a fact fjH(c)∩J that is not in H(d). However, as fiQfj, fiH(c), and fjH(d), the instances H(d), H(c), and the facts fi,fj contradict condition (2). □

4.2 Local Optimality

We are interested in OBFs that transmit as little data as possible. Thereto, we investigate sensible notions of optimality. We fix a query Q, an instance I, a distribution H of I, and a network \(\mathcal {N}\). The total number of transmitted facts equals \(||B(f,H)||={\sum }_{c\in \mathcal {N}} | f(H(c))|\). Of course, ||B(f,H)||≥|B(f,H)|.

Definition 3

An OBFf for a CQ Q is optimal iff ||B(f,H)||≤||B(g,H)|| for every other OBF g for Q and for every instance I and distribution H.

Intuitively, an OBF is optimal when it transmits the least amount of data over all instances and all distributions. The next result, however, shows that this notion of optimality, although desirable, is unattainable.

Lemma 2

There is no optimal OBF for any conjunctive query with at least two distinct atoms in its body.

Proof 2

Let Q be the conjunctive query A0A1,…,An with n≥2. Towards a contradiction assume there is an optimal OBF f for Q. Let I be the canonical instance for Q where for every i∈[n], the relation pred(Ai) is interpreted by the fact Ai.3 Now, consider a network \(\mathcal {N} = [n]\) and a distribution H that places every fact in I on a distinct node. As all of the n facts in I need to be gathered at one node, at least n−1 facts must be broadcast. As the OBF that broadcasts all Ai-facts for i<n and keeps all An-facts static is correct for Q and only transmits n−1 facts on I, by assumption on the minimality of f, ||B(f,H)||=n−1. Let g be the fact in I that is not broadcast by f and assume w.l.o.g. that pred(g)=An. Now, consider I=I∖{g}. And let H equal H restricted to only facts in I over \(\mathcal {N}\). Then, as g is not broadcast in H, ||B(f,H)||=||B(f,H)||. However, the OBF that broadcasts all Ai-facts for i>1 and keeps all A1-facts static is correct for Q and only broadcasts n−2 facts on I contradicting the optimality of f. □

We next turn to a different form of optimality. For two OBFs f and g, we say that f is included in g, denoted fg, iff f(I)⊆g(I) for every instance I.

Definition 4

An OBFf for a CQ Q is locally optimal iff for every other broadcasting function g for Q, gf implies f=g.

Intuitively, when f is locally optimal there is no subdivision of f that transmits only a strict subset of the facts broadcast by f.

The next lemma gives a sufficient criteria for when an OBF can not be locally optimal. Specifically, a condition is given for when a broadcast fact f can be kept static and a more economical OBF f can be derived.

Lemma 3

Let Q be a CQ and let f be an OBF for Q. If there is an instance I and factf for which f∈f(I∪{f}), but there is no instance J and no factgfor whichfQg, g∉I, f∉J, and g∉f(J∪{g}), then there is an OBF ffor Q for which\(f^{\prime } \subsetneq f\).

Proof 3

Assume f, I, and f as given by the statement of the lemma. The proof is now by construction. Let If,J be the set of facts that (by genericity) relate the same way to J, as f to I. That is, If,J={π(f)∣π a permutation s.t. π(I)=J}. Then, define f as the mapping where for every instance J, f(J)=f(J)∖If,J. Notice that \(f^{\prime } \subsetneq f\) by construction of f. Furthermore, f is generic and is an OBF. It remains to show that f is an OBF for Q. Towards a contradiction, assume that f is not an OBF for Q. Then, by Lemma 1, there are instances J1 and J2 and facts g1 and g2, for which g1Qg2,g2J1, g1J2, and \(\mathbf {g}_{1} \not \in f^{\prime }(J_{1}\cup \{\mathbf {g}_{1}\})\) and \(\mathbf {g}_{2} \not \in f^{\prime }(J_{2}\cup \{\mathbf {g}_{2}\})\). As f is an oblivious broadcasting function for Q, it holds that
$$\mathbf{g}_{1} \in f(J_{1}\cup\{\mathbf{g}_{1}\}) \text{ or } \mathbf{g}_{2} \in f(J_{2}\cup\{\mathbf{g}_{2}\}). $$
Say that \(\mathbf {g}_{1} \in f(J_{1}\cup \{\mathbf {g}_{1}\})\). Then, \(\mathbf {g}_{1}\in I_{\mathbf {f},J}een\), implying J1=π(I) and g1=π(f) for some permutation π. As Q does not contain self-joins and \(\mathbf {g}_{1}\sim _{Q} \mathbf {g}_{2}\), this means that \(\mathbf {g}_{2}\not \in I_{\mathbf {f}, J_{2}}\). Therefore, \(\mathbf {g}_{2} \not \in f(J_{2}\cup \{\mathbf {g}_{2}\})\) which contradicts the condition of the lemma (taking π−1(g1) and π−1(J2) as g and J, respectively). □

The following lemma now characterizes when an OBF for a query is locally optimal.

Lemma 4

Let Q be a CQ and let f be an OBF for Q. The following are equivalent:
  1. 1.

    f is locally optimal; and

     
  2. 2.

    for every instance I and factffor whichf∈f(I∪{f}), there is an instance J and a factgsuch thatfQg,g∉I, f∉J, and g∉f(J∪{g}).

     

Proof 4

We can assume that Q contains at least two atoms. Indeed, when Q contains one atom, the only locally optimal OBF is the one that broadcasts no facts and the lemma trivially holds. The direction from (1) to (2) follows from Lemma 3.

(2) ⇒(1) Let f be an OBF for Q. Towards a contradiction assume that f is not locally optimal. That is, there exists another OBF f for Q such that \(f^{\prime } \subsetneq f\). In particular, there is an instance I and a fact f such that ff(I∪{f}), while ff(I∪{f}). By Lemma 1, for every fact g with fQg where gI, and for every instance J, where fJ, it must be that gf(J∪{g}). The latter then implies that for every such g and J, gf(J∪{g}) which contradicts condition (2) of the present lemma. □

5 Broadcasting Functions Based on Dependency Sets

In this section, we introduce the notion of a broadcast dependency set (BDS) as a formalism to specify OBFs. We present necessary and sufficient conditions for when a BDS induces an OBF which is correct for a given query and also for when it is locally optimal. Furthermore, we study the complexity of the corresponding decision problems. Finally, we show that every locally optimal OBF can be represented by a BDS thereby obtaining that BDS is complete as a representation formalism for locally optimal OBFs.

5.1 Broadcast Dependency Sets

In a nutshell, a broadcast dependency set is a set of key-dependency set pairs, where each pair consists of an equality type (the key), and a set of dependencies (to be formalised later) associated to this key. Intuitively, a BDS gives rise to the following broadcasting function semantics: a fact is broadcast only if it satisfies one of the key equality-types, and at least one of the associated dependencies fails.

We proceed with the formal definition. Let Q be the CQ A0A1,…,An. We assume Q is full and does not contain self-joins. Therefore an atom Ai in bodyQ is uniquely identified by its predicate pred(Ai). For a predicate R, we denote by atom(R) the unique atom AbodyQ for which pred(A)=R.

For a finite set of variables X, a partial equality type overX is a pair of binary relations φ=(Eφ,Iφ) representing equalities and inequalities among elements in X. Formally, we require that EφIφX×X, Eφ is an equivalence relation, and Iφ is irreflexive and symmetric. We abuse notation and also use φ to denote the formula \(\bigwedge \{x = y \mid (x,y)\in E_{\varphi } \}\land \bigwedge \{x \ne y \mid (x,y)\in I_{\varphi } \}.\) We tacitly assume that partial equality types are always consistent. That is, we always assume that there is a tuple \(\bar a\) such that the formula \(\varphi (\bar a)\) evaluates to true. When for all (x,y)∈X×X, either (x,y)∈Eφ or (x,y)∈Iφ, then φ completely specifies all relations between variables in X and we call φ a type. For emphasis, we sometimes say complete equality type rather than just equality type even though equality type always means complete equality type.

A partial atomic type (over Q) is a pair τ=(Rτ,φτ), where Rτ is a database predicate and φτ is a partial type over Vars(atom(Rτ)), that is, the variables occurring in the unique atom AbodyQ for which pred(A)=Rτ. By Vars(τ) we denote the variables over which τ is defined, i.e., Vars(τ)=Vars(atom(Rτ)). Sometimes we write atom(τ) to abbreviate atom(Rτ). We say that τ is an atomic type when φτ is an equality type. To improve readability, we denote partial atomic types with τ and (complete) atomic types with ω. We denote by PTypes(Q) and Types(Q) the set of all partial atomic types and atomic types over Q, respectively.

Example 3

For examples of the above notions, consider the equality types φ1,φ2,φ3 over variables X={x,y,z}:

Alternatively, we can express these equality types through conditions φ1:=xyyzx=z, φ2:=x=yy=zx=z, and φ3:=x=z. Here, φ1 and φ2 are complete over X, and φ3 is a partial equality type over X.

Examples of atomic types over query Q(x,y,z)←A(x,x),B(x,y,z) are complete atomic types ω1=(B,φ1) and ω2=(B,φ2), and partial atomic type τ=(B,φ3).

A fact fis of typeτ or satisfiesτ, denoted fτ, when there is a valuation h from the variables in atom(Rτ) onto Adom(f) such that h(atom(Rτ))=f and the formula φτ evaluates to true where each xi is substituted by h(xi). Notice that h is unique for f. Hereafter we will refer to h as Vf. By type(f), we denote the unique atomic type satisfied by f when it exists. As atomic types are defined w.r.t. Q, type(f) is not always defined. Indeed, when f=R(a,b) (with ab) and atom(R)=R(x,x), then there is no τ with fτ. Two partial atomic types τ,τ are compatible w.r.t.Q, denoted τQτ, when there are facts f and g with fτ and gτ such that fQg. We say that τimpliesτ, denoted ττ, if for all facts f, fτ implies fτ. We can think of a partial atomic type as a disjunction of types for a shared predicate symbol. Define Types(τ)={ωTypes(Q)∣ωτ} as the set of all atomic types ω which imply τ. Notice that, ωτ iff ωTypes(τ) for any atomic type ω. For a set of partial atomic types T, we use Types(T) as an abbreviation for \(\bigcup _{\tau \in T}\mathit {Types}(\tau )\).

Example 4

For examples recall query Q and partial atomic types ω1,ω2,τ from Example 3. Fact B(a,b,a) satisfies ω1 and τ, but not ω2. The former particularly holds because ω1τ.

Let ω3=(A,x=x), then it is easy to see that ω3Qω1 due to the satisfying facts B(1,2,1) and A(1,1), respectively, and valuation V:{x↦1,y↦2,z↦1} for Q.

For a set of variables X and Y, and a partial atomic type τ, XτY if for all xX either xY or there is an yY such that \((x,y)\in E_{\varphi _{\tau }}\). That is, X is a subset of Y when taking the equalities in \(E_{\varphi _{\tau }}\) into account. For instance, let τ be a type such that \((y,z)\in E_{\varphi _{\tau }}\), then {x,y,z}⊆τ{x,y}.

For a set of pairs \(\mathcal {S}\), we define \(\mathit {Keys}(\mathcal {S})=\{a\mid (a,b)\in \mathcal {S}\}\) and \(\mathit {Values}(\mathcal {S})=\{b\mid (a,b)\in \mathcal {S}\}\).

Definition 5

A broadcast dependency set (BDS) for a CQ Q is a set \(\mathcal {S}\) of pairs (τ,T), where τPTypes(Q) is a key, and T∈2PTypes(Q) is a dependency set, such that the following holds:
  1. 1.

    \((\tau ,T)\in \mathcal {S}\) and \((\tau ,T^{\prime })\in \mathcal {S}\) implies T=T;

     
  2. 2.

    \(\tau ,\tau ^{\prime }\in \mathit {Keys}(\mathcal {S})\) implies Types(τ)∩Types(τ)=; and,

     
  3. 3.

    \((\tau , T) \in \mathcal {S}\) implies \(\mathit {Vars}(\tau ^{\prime }) \subseteq _{\tau ^{\prime }} Vars(\tau )\) for every τT.

     
We call the elements of \(\mathcal {S}\)dependencies.

The above definition states that (1) every key can have at most one value in \(\mathcal {S}\); (2) every complete type implies at most one partial type \(\tau \in \mathit {Keys}(\mathcal {S})\); and, (3) the set of variables of atom(τ) is included in the set of variables of atom(τ) taking into account the equalities in \(E_{\tau ^{\prime }}\). We first explain informally how a BDS represents an OBF. Let f be a fact in the local instance at a computing node. When type(f) is undefined, then f is static as f can never participate in any satisfying valuation. For instance this happens when f=R(a,b) with ab and Q contains the atom R(x,x). Every pair \((\tau ,T)\in \mathcal {S}\) now specifies a condition on facts: when fτ then f is broadcast only if a set of facts implied by T (to be formalized below) is not present at the local instance. Furthermore, when there is no \(\tau \in \mathit {Keys}(\mathcal {S})\) for which fτ, f is broadcast as well. In this light, conditions (1) and (2) ensure that every local fact f is matched by at most one partial type \(\tau \in \mathit {Keys}(\mathcal {S})\); and, condition (3) ensures that when fτ then Vf can be extended in a unique way to a valuation for every τT that is consistent with f, that is, for which type(f)∼Qτ.

Next, we formally define how every BDS \(\mathcal {S}\) implies an OBF \(f_{\mathcal {S}}\). Given a fact f, if there is no \(\tau \in \mathit {Keys}(\mathcal {S})\) for which fτ then f is always broadcast. Otherwise, by condition (1) and (2) of Definition 5, there is exactly one \(\tau \in \mathit {Keys}(\mathcal {S})\) such that fτ. Recall that Vf is the valuation (defined above) such that Vf(atom(τ))=f. Then, by condition (3) of Definition 5, Vf can also be interpreted as a valuation for every atom(τ) for every τT for which type(f)∼Qτ. Indeed, for every yVars(τ)∖Vars(τ) there is a variable xVars(τ) for which \((x,y)\in E_{\tau ^{\prime }}\). Therefore, define for every yVars(τ),
$$V_{\mathbf{f},\tau^{\prime}}(y)= \left\{ \begin{array}{ll} V_{\mathbf{f}}(y) & \text{if }y\in\mathit{Vars}(\tau);\text{ and,}\\ V_{\mathbf{f}}(x) & \text{if }y\not\in\mathit{Vars}(\tau)\text{ and }(x,y)\in E_{\tau^{\prime}}.\\ \end{array} \right.$$
As we only consider \(V_{\mathbf {f},\tau ^{\prime }}\) for which type(f)∼Qτ, the above is well-defined.

Now, f is broadcast when the local instance does not contain all the facts \(V_{\textbf {f},\tau ^{\prime }}({atom}(\tau ^{\prime }))\) for which τT and type(f)∼Qτ. We refer to these facts as the dependency fact set. To formally define \(f_{\mathcal {S}}\), we set \(Dep(\mathbf {f}T)=\{V_{\mathbf {f},\tau ^{\prime }}(\mathit {atom}(\tau ^{\prime })) \mid \tau ^{\prime } \in T\) and \({type}(\mathbf {f}) \sim _{Q} \tau ^{\prime } \}.\) Notice that T does not necessarily imply Dep(fT)≠, because \({type}(\mathbf {f}) \sim _{Q} \tau ^{\prime }\) may fail for τT. Further notice that Dep(fT)= means that the fact f is static. Then, define \(Dep(\mathbf {f},\mathcal {S})\) as Dep(fT) when there is a \((\tau ,T)\in \mathcal {S}\) for which fτ. Otherwise, \(Dep(\mathbf {f},\mathcal {S})\) is undefined.

Example 5

For an example, consider the query
$$Q_{2}(x,y,z) \leftarrow A(x,y,z),B(x,y,z),C(z,z). $$
For simplicity, we define partial types through formulas. Then, define
$$\begin{array}{@{}rcl@{}} \tau_{B} = (B, \text{true}),\\ \tau_{A}^{x=y} = (A, x=y),\\ \tau_{A}^{y=z} = (A, y=z), \\ \tau_{A}^{\neq} = (A, x\neq y\land y\neq z), \\ \tau_{B}^{\neq} = (B, x\neq y\land y\neq z). \end{array} $$
Then, \(\mathcal {S} = \{(\tau _{B}, \{\tau _{A}^{x=y}, \tau _{A}^{y=z}\}), (\tau _{A}^{\neq }, \{\tau _{B}^{\neq }\})\}\) is a BDS for Q2. To illustrate how OBF \(f_{\mathcal {S}}\) works, let
$$\begin{array}{@{}rcl@{}} I = \{A(1,2,3),B(1,2,3),A(1,1,2), B(1,1,2),\\ A(1,2,2), B(1,2,2),C(3,4),C(3,3)\} \end{array} $$
be a database instance. Then, \(f_{\mathcal {S}}(I)=\{A(1,1,2), A(1,2,2),C(3,3)\}\). Indeed, the facts A(1,1,2), A(1,2,2), C(3,3) do not match a key in \(\mathcal {S}\) and their type occurs in Types(Q). So they are broadcast. The fact C(3,4) is not broadcast as its type does not occur in Types(Q) (C(3,4) does not match C(z,z)). The fact f1=B(1,1,2) matches τB and \(\mathit {Dep}(\mathbf {f}_{1},\{ \tau _{A}^{x=y}, \tau _{A}^{y=z}\})=\{A(1,1,2)\}\subseteq I\). Therefore, B(1,1,2) is static. Similarly, the fact f2=B(1,2,2) matches τB and \(Dep(f_{2}\{\tau _{A}^{x=y}, \tau _{A}^{y=z}\})=\{A(1,2,2)\}\subseteq I\). Therefore, B(1,2,2) is static as well. The fact f3=A(1,2,3) is static as it matches \(\tau _{A}^{x\ne y}\) and \(Dep(\mathbf {f}_{3}\{\tau _{b}^{\neq }\})=\{B(1,2,3)\} \subseteq I\). The fact f4=B(1,2,3) is static as it matches τB and \(Dep(\mathbf {f}_{4}\{\tau _{A}^{x=y}, \tau _{A}^{y=z}\})= \emptyset .\)

Definition 6

For a CQ Q and a BDS \(\mathcal {S}\) for Q, define \(f_{\mathcal {S}}\) as the function that maps every instance J to the set \(f_{\mathcal {S}}(J)\) of those facts fJ for which (1) type(f)∈Types(Q); and, (2) \(Dep(\mathbf {f},\mathcal {S})\) is undefined or \(Dep(\mathbf {f},\mathcal {S})\not \subseteq J\).

Intuitively, f is static only when type(f)∉Types(Q) (f can not participate in any satisfying valuation) or the dependency fact set \(Dep(\mathbf {f},\mathcal {S})\) is present at the local instance. Notice that a fact f is thus broadcast when it does not imply a key in \(\mathcal {S}\). This is because then \(Dep(\mathbf {f},\mathcal {S})\) is undefined.

Example 6

  1. (1)

    For a simple example of a BDS \(\mathcal {S}\) and OBF \(f_{\mathcal {S}}\), recall query Q1 from Example 1, being Q1(x,y,z)←A(x,y),B(y,x),C(x,z). Let φ=(,), that is, φ imposes no restrictions. Let τA=(A,φ) and τB=(B,φ). Then, \(\mathcal {S} = \{(\tau _{B}, \{\tau _{A}\}), (\tau _{A}, \emptyset )\}\) is a BDS for Q1. Indeed, every partial atomic type occurs at most once as a key. There is no (complete) atomic type that implies both τA and τB. Furthermore, the variable containment condition between τA and τB is satisfied. Notice that \(f_{\mathcal {S}}\) simulates exactly the broadcast dependency function which is described in Example 1.

     
  2. (2)

    For an example where condition (3) of Definition 5 does not reduce to ordinary variable containment, consider again query Q1 from Example 1. Let τC=(C,x=z), and τA=(A,true). Then, \(\mathcal {S} = \{(\tau _{A}, \{\tau _{C}\}), (\tau _{C}, \emptyset )\}\) is a BDS for Q1. Notice that condition Vars(C)⫅̸Vars(A) but \( Vars(\tau _{C})\subseteq _{\tau _{C}} Vars(\tau _{A})\).

     
  3. (3)
    Our final example shows that dependencies can be circular. Let
    $${Q_{3}(x,y,z)}\leftarrow{A(x,y), B(y,z), C(z,x)}. $$
    Let τA=(A,x=y), τB=(B,x=y), and τC=(C,x=y). Then, \(\mathcal {S} = \{(\tau _{A}, \{\tau _{B}\}), (\tau _{B}, \{\tau _{C}\}), (\tau _{C}, \{\tau _{A}\})\}\) is an OBF for Q1. Though correctness of \(\mathcal {S}\) for Q follows from Lemma 5, we provide some intuition. Let I={A(1,1),B(1,1),C(1,1)} be a database instance. Consider a network containg the nodes c1, c2, and c3. When I(c1)={A(1,1)}, I(c2)={B(1,1)}, and I(c3)={C(1,1)}, then all three facts will be broadcast. Now, assume one of the nodes contains two of the facts in I, w.l.o.g., say I(c1)={A(1,1),B(1,1)}. Then, exactly one of the facts in I(c1) is broadcast; i.e., B(1,1). Now, suppose that C(1,1) is mapped on some node, say c2, but that C(1,1) is not broadcast. Then it must be that A(1,1) is mapped on c2 as well. So, broadcasting B(1,1) indeed suffices to guarantee correctness.
     

Note that not every BDS for Q induces an OBF which is correct for Q. Indeed, the following lemma provides equivalent semantic and syntactic conditions for an OBF \(f_{\mathcal {S}}\) to be correct for a query.

Lemma 5

Let Q be a CQ and let\(\mathcal {S}\)be a BDS for Q. Then the following are equivalent:
  1. 1.

    \(f_{\mathcal {S}}\)is an OBF for Q;

     
  2. 2.

    there are no instances I,J, and factsf,g, with fQg, g∉I,f∉J such that\(\mathbf {f} \not \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\)and\(\mathbf {g} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {g}\})\); and

     
  3. 3.

    there are no (complete) atomic types ω12, and pairs\((\tau _{1}, T_{1}),(\tau _{2}, T_{2}) \in \mathcal {S}\), with ω1Qω2, ω1⊧τ1, ω2⊧τ2such that ω1∉Types(T2) and ω2∉Types(T1).

     

Proof 5

(1) ⇔(2) Because \(f_{\mathcal {S}}\) is an OBF, the equivalence follows immediately from Lemma 1.

(2) ⇒(3) The proof is by contraposition. So, assume that there are two (complete) atomic types ω1,ω2, and pairs \((\tau _{1}, T_{1}),(\tau _{2}, T_{2}) \in \mathcal {S}\), with ω1Qω2, ω1Types(τ1), ω2Types(τ2) such that ω1Types(T2) and ω2Types(T1). Now, because ω1Qω2, there are facts f and g, with fQg, type(f)=ω1, and type(g)=ω2. Define \(I = Dep(\mathbf {f},\mathcal {S})\) and \(J = Dep(\mathbf {g}\mathcal {S})\). Observe that by definition of Dep, ω1Types(T2) implies \(\mathbf {f} \not \in Dep(\mathbf {g}\mathcal {S})\) and ω2Types(T1) implies \(\mathbf {g} \not \in Dep(\mathbf {f},\mathcal {S})\). Hence, fJ and gI. Moreover, by definition of \(f_{\mathcal {S}}\), it is always the case that \(\mathbf {f} \not \in f_{\mathcal {S}}(Dep(\mathbf {f},\mathcal {S})\cup \{\mathbf {f}\})\) and \(\mathbf {g} \not \in f_{\mathcal {S}}(Dep(\mathbf {g}\mathcal {S})\cup \{\mathbf {g}\})\). Therefore, \(\mathbf {f} \not \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\) and \(\mathbf {g} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {g}\})\), which contradicts condition (2).

(3) ⇒(2) Again, the proof is by contraposition. So, assume that there is an instance I and J and facts f and g where fQg, gI and fJ, but \(\mathbf {f}\not \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\) and \(\mathbf {g} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {g}\})\). As fQg, we have ω1Qω2 for ω1=type(f) and ω2=type(g). Then, by construction of \(f_{\mathcal {S}}\) there are \((\tau _{1}, T_{1}), (\tau _{2}, T_{2}) \in \mathcal {S}\) with type(f)∈Types(τ1) and type(g)∈Types(τ2). Now, \(\mathbf {f}\not \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\) and \(\mathbf {g} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {g}\})\) implies \(Dep(\mathbf {f},\mathcal {S}) \subseteq I\) and \( Dep(\mathbf {g}\mathcal {S}) \subseteq J\). If we assume that type(g)∈Types(T1) then \(\mathbf {g}\in Dep(\mathbf {f},\mathcal {S})\) (as g=Vf,type(g)(atom(type(f)))), and therefore gI which leads to a contradiction. Hence, type(g)∉Types(T1). A similar argument shows that type(f)∉Types(T2). So, we have found ω1, ω2, (τ1,T2), and (τ2,T2) contradicting condition (3). □

Notice that the OBFs of Example 6 are all correct for the given query.

Two partial atomic types τ1,τ2 are said to be equal, denoted τ1=τ2, when Types(τ1)=Types(τ2). We say that a BDS \(\mathcal {S}\) is harmonious when every two partial types in \(\mathcal {S}\) are either disjoint or equal. That is, when for every two partial atomic types \(\tau _{1}, \tau _{2} \in \mathit {Keys}(\mathcal {S}) \cup \{\tau ^{\prime } \in T \mid T \in \mathit {Values}(\mathcal {S})\}\), either τ1=τ2 or Types(τ1)∩Types(τ2)=.

Theorem 1

Let Q be a CQ and let\(\mathcal {S}\)be a BDS for Q. Deciding whether\(f_{\mathcal {S}}\)is correct for Q isconp-complete and in ptimewhen\(\mathcal {S}\)is harmonious.

Proof 6

(conp-completeness) When \(f_{\mathcal {S}}\) is not an OBF for Q, Lemma 5 guarantees there exists a polynomial-size certificate, consisting of two compatible (complete) atomic types ω1,ω2, two partial atomic types τ1,τ2, and two sets T1,T2, witnessing \(f_{\mathcal {S}}\) to be not an OBF for Q, where \((\tau _{1}, T_{1}), (\tau _{2}, T_{2}) \in \mathcal {S}\), ω1Types(τ1), ω2Types(τ2), ω1Types(T2), and ω2Types(T1). As the foregoing test can be done in polynomiale time, deciding whether \(f_{\mathcal {S}}\) is correct for Q is in conp. Particularly notice that ττ is polynomial time verifiable, for arbitrary (partial) atomic types τ,τ, by taking the union of conditions implied by τ and τ, computing the closure over variable equalities, and then checking for explicit contradictions.

For the hardness proof, we rely on a reduction from the well-known np-complete problem colorability, which asks, given a graph G, whether there is a color assignment for the nodes in G such that only three colors are used and no two adjacent nodes are assigned the same color.

Let G=(NG,EG) be an input for the problem, and m=|NG|.

In what follows we will represent G by a partial-atomic type τP, which takes a variable for each node in the graph and an inequality between every pair of variables corresponding to adjacent nodes in the graph. Particularly observe that every (valid) coloring for G yields a (complete) atomic type implying τP, and vice versa, every atomic type implying τP implies a valid coloring for G.

More formally, we consider relation schema σ={P(m),A(m)} and conjunctive query Q,
$$Q(x_{1},\ldots, x_{m})\leftarrow P(x_{1}, \ldots, x_{m}), A(x_{1}, \ldots, x_{m}), $$
over σ. Let β be a bijection from the nodes in NG onto the set of variables {x1,…,xm}. Then, τP takes the form (P,(E,I)) for Q, where E=, and I={(β(n1),β(n2))∣(n1,n2)∈EG}.

Now consider partial type τA=(A,(,)), and for every i,j,k,l, where 1≤i<j<k<lm, partial type τi,j,k,l=(P,(E,I)), where E=, and I={(xs,xt)∣s,t∈{i,j,k,l},st}. Intuitively, τi,j,k,l represents all color assignments where four specified nodes (those related to xi,xj,xk,xl) are assigned distinct colors. Let T={τi,j,k,l∣1≤i<j<k<lm}. Notice that \(|T| \in \mathcal {O}(m^{4})\), and that these types can be constructed one by one by simply enumeration the possible values for i,j,k, and l. Now, let \(\mathcal {S} = \{(\tau _{P}, \emptyset ), (\tau _{A}, T)\}\).

We claim that \(\mathcal {S}\) is a BDS for Q. Indeed, every pair in \(\mathcal {S}\) is a (consistent) partial atomic-type for Q, every (complete) atomic type in \(\mathcal {S}\) implies at most one of the partial atomic types in \(\mathit {Keys}(\mathcal {S})\), and \( Vars(atom(\tau _{i,j,k,l})) \subseteq _{\tau _{i,j,k,l}} Vars(atom(\tau _{A}))\) for all i,j,k,l.

To show that the reduction works, we need to argue that for every graph G there is a mapping assigning to every node one out of three colors in such a way that every adjacent node is labeled a different color, if and only if, \(f_{\mathcal {S}}\) is not an OBF for Q.

(⇒) Let α be an assignment mapping the nodes in G onto a set of colors {a,b,c}, such that the above mentioned conditions are satisfied. Notice that there is a (complete) atomic type encoding exacty this solution, namely, atomic type ω=(P,(E,I)), where E={(xi,xj)∣i,j∈[m],α(β−1(xi))=α(β−1(xj))}, and I=X×XE. In particular, ω implies τP, ω does not imply any of the partial types in T, and ω is compatible with τA. Then, indeed, by Lemma 5 it immediately follows that \(f_{\mathcal {S}}\) is not an OBF for Q.

(⇐) If no such assignment exists, we have to show that \(f_{\mathcal {S}}\) is an OBF for Q. For this, we make use of the fact that every (complete) atomic type ω, where ωτP, encodes a color assignment for G. Because there is no three-color assignment, it must be that all of these assigments use at least four different colors. In particular, then every ω has at least four variables that are pairwise unequal, say xi,xj,xk,xl, where 1≤i<j<k<lm. Thus, ω implies τi,j,k,l. Therefore, condition (3) of Lemma 5 is satisfied, implying \(f_{\mathcal {S}}\) to be an OBF for Q.

(Harmonious case is in ptime) First of all, observe that condition (3) of Lemma 5 for harmonious BDS simplifies to: \(f_{\mathcal {S}}\) is correct iff
  • there are no partial types τ1,τ2 and pairs \((\tau _{1}, T_{1}), (\tau _{2}, T_{2}) \in \mathcal {S}\) with τ1Qτ2 such that none of the types in T2 equals τ1 and none of the types in T1 equals τ2

To verify whether \(f_{\mathcal {S}}\) is correct for harmonious BDS \(\mathcal {S}\), we thus have to verify condition (‡). For this, consider every pair of compatible partial atomic types \(\tau _{1},\tau _{2} \in \mathit {Keys}(\mathcal {S})\). Compatibility is polynomial time verifiable by taking the union of the conditions in both types and verifying if the resulting partial type is still consistent. Then, for every \(\tau _{1}^{\prime } \in T_{1}\) verify if \(\tau _{2} = \tau _{1}^{\prime }\), and for every \(\tau _{2}^{\prime } \in T_{2}\) if \(\tau ^{\prime }_{2} = \tau _{1}\). If none of these tests succeed, then τ1,τ2,T1,T2 form a proof that condition (‡) fails. Notice that equality of partial types can be checked in polynomial time in the size of |Q| by making both the implicit and explicit conditions of the type visible (by means of E and I) and by comparing these conditions. Eventually, if no proof against (‡) is found, (‡) satisfies and thus \(f_{\mathcal {S}}\) is an OBF for Q. □

5.2 Local Optimality

Next, we turn to locally optimal OBFs. The following lemma provides equivalent semantic and syntactic conditions for an OBF to be locally optimal. Regarding condition (3), the intuition is as follows. While condition (3c) is the syntactic counterpart of condition (2), conditions (3a) and (3b) specify optimality requirements which are inherent to the formalism of BDS. More specifically, condition (3a) specifies that every atomic type implying a partial type in a dependency set in \(\mathcal {S}\) must also imply a key in \(\mathcal {S}\). Indeed, when an atomic type does not imply a key, every local fact of this type is always broadcast and therefore present at every computing node. The atomic type can therefore be removed from every dependency set it occurs in. When Condition (3b) fails for an atomic type ω, \(\mathcal {S}\) can be adapted to broadcast less while preserving correctness for Q by adding the pair \((\omega , \{\tau \mid \tau \sim _{Q} \omega , \tau \in Types(\mathit {Keys}(\mathcal {S}))\})\).

Lemma 6

Let Q be a CQ,\(\mathcal {S}\)a BDS for Q, and\(f_{\mathcal {S}}\)an OBF for Q. The following are equivalent:
  1. 1.

    \(f_{\mathcal {S}}\)is locally optimal;

     
  2. 2.

    for every instance I and factffor which\(\textbf {f} \in f_{\mathcal {S}}(I\cup \{\textbf {f}\})\), there is an instance J and a factgsuch thatfQg, g∉I, f∉J, and\(\textbf {g} \not \in f_{\mathcal {S}}(J\cup \{\textbf {g}\})\); and,

     
  3. 3.
    \(\mathcal {S}\)satisfies the following conditions:
    1. (a)

      for\((\tau , T) \in \mathcal {S}\)and ω∈Types(T), ω∼Qτ implies ω⊧τfor some\(\tau ^{\prime } \in \mathit {Keys}(\mathcal {S})\);

       
    2. (b)

      for every\(\omega \in Types(Q)\setminus Types(\mathit {Keys}(\mathcal {S}))\), there is a partial atomic type\(\tau _{1}\in \mathit {Keys}(\mathcal {S})\)and a ω1∈Types(τ1) such that ω∼Qω1and\( Vars(\omega _{1}) \not \subseteq _{\omega _{1}} Vars(\omega )\); and

       
    3. (c)

      for\((\tau _{1}, T_{1}), (\tau _{2}, T_{2}) \in \mathcal {S}\), where ω1∈Types(τ1), ω2∈Types(τ2), and ω1Qω2: ω1∈Types(T2) implies ω2∉Types(T1).

       
     

Proof 7

The equivalence between (1) and (2) follows from Lemma 4.

We show that (2) implies all three conditions in (3) separately.

(2) ⇒(3a) Let \((\tau , T) \in \mathcal {S}\) and ωTypes(T). Choose f with type(f)∈Types(τ) and set I=Dep(fT), g=Vf,ω(atom(ω)), so that f and g witness τQω. Further, let I=I∖{g}. By definition of \(f_{\mathcal {S}}\), \(\mathbf {f} \not \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\), and \(\mathbf {f} \in f_{\mathcal {S}}(I^{\prime }\cup \{\mathbf {f}\})\). By condition (2), the latter implies that there is an instance J and a fact h, such that hI, hQf, fJ, and \(\mathbf {h} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {h}\})\). Therefore, there must be a pair \(\tau ^{\prime } \in \mathit {Keys}(\mathcal {S})\) with type(h)∈Types(τ). However, as \(f_{\mathcal {S}}\) is an OBF for Q, Lemma 1 implies that hI. So, it must be that h=g. Hence, type(g)=ωTypes(τ).

(2) ⇒(3b) Let \(\omega \in Types(Q)\setminus Types(\mathit {Keys}(\mathcal {S}))\) and let f be a fact of type ω. By definition of \(f_{\mathcal {S}}\), \(\mathbf {f}\in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\) for every instance I. Let I1 be such an instance. By condition (2) there is a compatible fact g1 and instance J1, where g1I1, fJ1, and \(\mathbf {g}_{1} \not \in f_{\mathcal {S}}(J_{1}\cup \{\mathbf {g}_{1}\})\). Now, consider Ii=Ii−1∪{gi−1}, for i≥2. Then, ff(Ii∪{f}) for i≥2. Again, by condition (2) there is a fact gi and instance Ji, where giQf, giIi, fJi, and \(\mathbf {g}_{i} \not \in f_{\mathcal {S}}(J_{i}\cup \{\mathbf {g}_{i}\})\) for i≥2. In particular, gi∉{g1,…,gi−1}. As there are infinitely many such gi, but only finitely many atomic types in Types(Q), there is a type ω1 such that type(gi)=ω1 for infinitely many i. Let G={gii≥1,type(gi)=ω1}. As \(\mathbf {g} \not \in f_{\mathcal {S}}(J_{i}\cup \{\mathbf {g}\})\) for every gG, by definition of \(f_{\mathcal {S}}\), there is a \(\tau _{1}\in Keys(\mathcal {S})\) with ω1Types(τ1). Notice that ωQω1 as fQg for all gG. Towards a contradiction, assume \( Vars(\omega _{1}) \subseteq _{\omega _{1}} Vars(\omega )\). But then, Adom(g)⊆Adom(f) for every gG which can not be as the size of G is infinite. Therefore, \( Vars(\omega _{1}) \not \subseteq _{\omega _{1}} Vars(\omega )\).

(2) ⇒(3c) Let \((\tau _{1}, T_{1}), (\tau _{2}, T_{2}) \in \mathcal {S}\), with ω1Types(τ1),ω2Types(τ2), ω1Qω2, and ω1Types(T2). As ω1Qω2 there are facts g and f, with g=ω1, f=ω2, and gQf. Then, gDep(f,T2) as ω1Types(T2). Towards a contradiction, assume ω2Types(T1) which implies fDep(g,T1). Let I=Dep(g,T1) and I=I∖{f}. Then, \(\mathbf {g} \not \in f_{\mathcal {S}}(I\cup \{\mathbf {g}\})\) and \(\mathbf {g} \in f_{\mathcal {S}}(I^{\prime }\cup \{\mathbf {g}\})\). By condition (2), there is a fact h and instance J, where hI, \(\mathbf {h} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {h}\})\), and gJ. By Lemma 1, however, it must be that hI. So, h=f, which implies that \(\mathbf {f} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {f}\})\). But then, by definition of \(f_{\mathcal {S}}\), \(Dep(\mathbf {f},\mathcal {S}) \subseteq J\) and thus \(\mathbf {g} \not \in Dep(\mathbf {f},\mathcal {S})\). Which is a contradiction.

(3) ⇒(2) Let f be a fact and I an instance, with \(\mathbf {f} \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\). This means that f is broadcast. We make a distinction between two cases: (1) the type of f is in \(\mathcal {S}\) but not all the necessary facts in \(Dep(\mathbf {f},\mathcal {S})\) are present, and (2) the type of f is not in \(\mathcal {S}\).

Case 1:

Suppose there is a pair \((\tau , T) \in \mathcal {S}\) with type(f)∈Types(τ). Then, by definition of \(f_{\mathcal {S}}\) it must be that \(Dep(\mathbf {f},\mathcal {S}) \not \subseteq I\). In particular, there is a fact \(\mathbf {g} \in Dep(\mathbf {f},\mathcal {S}) \setminus I\), where type(g)∈Types(T). Notice that fQg because of the definition of \(Dep(\mathbf {f},\mathcal {S})\) and \(\mathcal {S}\). By condition (3a) there is a pair \((\tau _{2}, T_{2}) \in \mathcal {S}\) such that type(g)∈Types(τ2). Because type(g)∈Types(T), and τ2Qτ (by gQf), condition (3c) implies that type(f)∉Types(T2). Now, let \(J = Dep(\mathbf {g}\mathcal {S})\). Then, fJ and \(\mathbf {g} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {g}\})\). So, facts f, g and instances I and J are as required by condition (2).

Case 2:

Suppose that \(\mathit {type}(\mathbf {f})\not \in Types(\mathit {Keys}(\mathcal {S}))\). Then, condition (3b) implies that there is a pair \((\tau _{1}, T_{1}) \in \mathcal {S}\) and atomic type ω1Types(τ1), where ω1Qtype(f) and \( Vars(\omega _{1}) \not \subseteq _{\omega _{1}} Vars(\mathit {type}(\mathbf {f}))\). As ω1Qtype(f), there is a fact g such that gQf and type(g)=ω1. Because, \( Vars(\omega _{1}) \not \subseteq _{\omega _{1}} Vars(\mathit {type}(\mathbf {f}))\), there must be a variable, say z, in Vars(ω1) that does not equal any of the variables in Vars(type(f)) according to the conditions in atomic type ω1. That is, for no variable xVars(type(f)), ω1x=z. Define Z={yω1y=z} as the set of variables equal to z according to ω1. Let for every udom∖(Adom(f)∪Adom(g)), Vu be the mapping where Vu(x)=Vf(x) for every xVars(atom(f)), \(V_{u}(x) = V_{\mathbf {g}^{\prime }}(x)\) for every xVars(atom(g))∖Z, and Vu(x)=u for every xZ. Notice that the above is well defined, because compatibility between f and g ensures that \(V_{\mathbf {g}^{\prime }}(x) = V_{\mathbf {f}}(x)\) for every shared variable. Now, every Vu induces a fact gu=Vu(atomω1) which has atomic type ω1 and is compatible with f. Further, \(\mathbf {g}_{u} \ne \mathbf {g}_{u^{\prime }}\) for distinct u and u. By the presence of (τ1,T1) in \(\mathcal {S}\), and the definition of \(f_{\mathcal {S}}\), \(\mathbf {g}_{u} \not \in f_{\mathcal {S}}(Dep(\mathbf {g}_{u},T_{1})\cup \{\mathbf {g}_{u}\})\). In particular, condition (3a) implies type(f)∉Types(T1) (because otherwise type(f) must be in \(\mathit {Keys}(\mathcal {S})\), which is a contradiction). Thus, fDep(gu,T1). As there are infinitely many such values u, for every finite instance I there should be a u for which guI. Hence, for every I where \(\mathbf {f} \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\), there is indeed a fact gu and instance J=Dep(gu,T1), where guI, fJ, and \(\mathbf {f} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {g}_{u}\})\) as requested by condition (2).

Deciding whether \(f_{\mathcal {S}}\) is locally optimal for arbitrarily given BDS \(\mathcal {S}\) turns out to be hard (c.f., Theorem 2). Therefore, we also consider the special case of open BDSs. We say that a partial type φ=(E,I) is open when it enforces no restrictions. That is, when E=I=. A partial atomic type (R,φ) is open when φ is. We say that a BDS \(\mathcal {S}\) is open when it only contains open partial atomic types. Notice that a BDS that is open is also harmonious (but not vice versa).

Similarly to Theorem 1, we have the following decidability result for locally optimal OBFs.

Theorem 2

Let Q be a CQ and let\(\mathcal {S}\)be a BDS for Q for which\(f_{\mathcal {S}}\)is correct for Q. Deciding whether\(f_{\mathcal {S}}\)is locally optimal is inconp and in ptimewhen\(\mathcal {S}\)is open.

Proof 8

Verifying whether a given BDS \(\mathcal {S}\) for a query Q is not locally optimal, where \(f_{\mathcal {S}}\) is correct for Q, is easy when given the right gadgets. For these gadgets we rely on Lemma 6 which states that either,
  • there is an atomic type ω, partial atomic type τ, and set of partial atomic types T, where \((\tau , T) \in \mathcal {S}\), ωTypes(T), ωQτ, and for none of the keys \(\tau ^{\prime } \in \mathit {Keys}(\mathcal {S})\), ωτ;

  • there is an atomic type ω, where \(\omega \in Types(Q)\setminus Types(\mathit {Keys}(\mathcal {S}))\), where for every ω1τ1, where \(\tau _{1} \in \mathit {Keys}(\mathcal {S})\), and ωQω1, \( Vars(\omega _{1}) \subseteq _{\omega _{1}} Vars(\omega )\); or,

  • there are atomic types ω1,ω2, partial atomic types τ1,τ2, and sets of partial atomic types T1,T2, where \((\tau _{1}, T_{1}), (\tau _{2}, T_{2}) \in \mathcal {S}\), ω1τ1, ω2τ2, ω1Qω2, and both ω1Types(T2), and ω2Types(T1).

All three cases yield a straightforward certificate (of polynomial size) that can be verified in polynomial time. Therefore, indeed, locally optimality is in conp.
To show that deciding locally optimality is in ptime when \(\mathcal {S}\) is an open BDS, observe that condition (3) of Lemma 6 simplifies for open BDS to:
  1. 1.

    \((\tau , T) \in \mathcal {S}\) and τT, where τQτ implies \(\tau ^{\prime } \in \mathit {Keys}(\mathcal {S})\);

     
  2. 2.

    for every open partial type τ not in \(\mathit {Keys}(\mathcal {S})\), there is a \(\tau _{1} \in \mathit {Keys}(\mathcal {S})\), where τQτ1 and \( Vars{\tau _{1}}\subseteq _{\tau _{1}} Vars(\tau )\); and

     
  3. 3.

    \((\tau _{1}, T_{1}), (\tau _{2}, T_{2}) \in \mathcal {S}\), Varsτ1Qτ2, τ1T2 implies τ2T1.

     
Particularly notice that condition (2) now considers only open partial types, of which there are only polynomialy many. Therefore, all three conditions can be verified straightforward in polynomial time in the size of Q and \(\mathcal {S}\). □

It remains open though whether deciding locally optimality is conp-complete or in ptime (even for harmonious BDS). For harmonious BDS, condition 3(a) and 3(c) of Lemma 6 are verifiable in polynomial time.

Next, we show that every locally optimal OBF can be represented by a BDS thereby obtaining that BDSs (satisfying the conditions in Lemma 6) are a complete representation of locally optimal OBFs. Let Q be a CQ and let f be an OBF for Q. We call a fact fsemi-static for f when there is an atomic type ω and an instance I such that ff(I∪{f}) and type(f)=ω. That is, f has an atomic type and there is an instance for which f is not broadcast. We say that a semi-static fact f (for f) depends on a fact g, when, for every instance I, ff(I∪{f}) implies gI. With every semi-static fact f, we associate the set Df containing exactly all facts on which f depends. Thus, ff(I∪{f}) implies DfI.

We make use of the following lemma in the proof of Theorem 3.

Lemma 7

Let Q be a CQ, and f be a locally optimal OBF for Q. Letfbe semi-static for f. Then, ff(Df{f}). Furthermore, g∈Dfimplies
  1. 1.

    gis semi-static andgQf;

     
  2. 2.

    Adom(g)⊆Adom(f);

     
  3. 3.

    Vars(atomg)type(g)Vars(atom(f)); and

     
  4. 4.

    g=Vf,type(g)(atom(g));

     

Proof 9

Before going to the actual proof, we first show the following auxiliary result:

Lemma 8

IffQgand both are semi-static for f thenfdepends ong or gdepends onf.

Proof 10

Assume towards a contradiction that both dependencies fail. Then, as f and g are semi-static, there is an instance I such that ff(I∪{f}) and gI, and instance J such that gf(I∪{g}) and fJ. But then, by Lemma 1, f,g,I, and J contradict with f being an OBF for Q. □

Next, we argue ff(Df∪{f}). Towards a contradiction suppose ff(Df∪{f}). Then, by Lemma 4 there must be some fact h and instance H, where hQf, hDf, fH, and hf(H∪{h}). Because f is semi-static and hDf, there must be some instance J, where hJ and ff(J∪{f}). So, by Lemma 1, we have found h,f,J,H contradicting f being an OBF for Q.

For (1) let I=Df. Because f is semi-static, by the above, ff(I∪{f}). Further, gDf implies ff(I∪{f}∖{g}). Then, by locally optimality of f and Lemma 4, there is an instance H and a fact h, such that hf(H∪{h}), hQf, hI∖{g}, and fH. However, by Lemma 1, hI, implying h=g. So, indeed, g is compatible with f, and there is an instance for which g is not broadcast.

For (2), towards a contradiction suppose Adom(g)⫅̸Adom(f), implying a value aAdom(g) which is not in Adom(f). Because f is semi-static, there must be an instance J, where ff(J∪{f}). Now, let π be the permutation over dom that maps a onto u (where udomsetminusAdomJ∪{f}), u onto a, and is the identity for every other value. Notice that by construction, π(f)=f, and gπ(J∪{f}). Then, by genericity of f, ff(π(J)∪{f}), implying Dfπ(J), which is a contradiction with the assumption that gDf. Thus, indeed Adom(g)⊆Adom(f).

For (3), again towards a contradiction, suppose that Varsatomg⫅̸type(g)Vars(atom(f)). So, there is a variable zVarsatomgVars(atom(f)), and no variable yVarsatomgVars(atom(f)) exists, for which Vg(z)=Vg(y). Recall that Vg denotes the partial valuation implied by g for atomg. Let Z be the set of variables z in Varsatomg, where Vg(z)=Vg(z). Notice ZVars(atom(f))=. Now, let udomAdom{f}∪{g}. Consider the mapping V, where V(x)=Vf(x) for every xVarsatomgVars(atom(f)). Notice that by compatibility and (1): V(x)=Vg(x) as well. Further, V(x)=Vg(x) for every xVarsatomg∖(Vars(atom(f))∪Z), and V(z)=u for every zZ. Notice that g=V(atomg) is compatible with f. So, because gDf, implying that g is semi-static by (1), by genericity g is also semi-static for f. By construction, Adom(g)⫅̸Adom(f), implying gDf. So, by Lemma 8 it must be that \(\mathbf {f} \in D_{\mathbf {g}^{\prime }}\). The later implies Adom(f)⊆Adom(g). In particular, because uAdom(f), we actually have \(\mathit {Adom}(\mathbf {f}) \subsetneq \mathit {Adom}(\mathbf {g}^{\prime })\). However, gDf implies Adom(g)⊆Adom(f), and because g and g have the same type, |Adom(g)|=|Adom(g)|, which is a contradiction.

Item (4) follows immediately from (1), (3) and the definition of Vf,type(g). □

We are now ready to prove completeness. The proof of the following theorem shows that the formalism of BDS that only uses complete atomic types can already represent every locally optimal OBF.

Theorem 4 (Completeness)

Let Q be a CQ and f a locally optimal OBF for Q. Then, there is a BDS\(\mathcal {S}\)for Q such that\(f=f_{\mathcal {S}}\).

Proof 11

We start by noting that if f is semi-static for f, then every g with type(f)=type(g) is semi-static for f as well. Therefore, we say that an atomic type τ is semi-static for f when there is a semi-static fact f with type(f)=τ. The proof is by construction. Let \(\mathcal {S}\) be the set of pairs (τ,Dτ) where τ is semi-static for f and Dτ=TypesDf, where f is a fact with atomic type τ.

We first show that \(\mathcal {S}\) is a BDS and then that \(f = f_{\mathcal {S}}\). Notice that, \(\mathcal {S}\) has only finitely many pairs, because there are only finitely many distinct atomic-types, and every set in \(\mathit {Values}\mathcal {S}\) is finite by construction. Let \((\tau , T)\in \mathcal {S}\), and τT. By construction of \(\mathcal {S}\), τ is a semi-static atomic type for f and for every atomic type τ there is at most one pair \((\tau , T) \in \mathcal {S}\). Furthermore, T=Dτ. Let f be a fact of type τ. Then, f is a semi-static fact for f and there is a gDf, such that type(g)=τ. By Lemma 7(3), Varsatom(τ)=Varsatomgtype(g)Vars(atom(f))=Varsatom(τ). So, \(\mathcal {S}\) is a broadcast dependency set for the query Q.

Next, we show that \(f = f_{\mathcal {S}}\). For this, we assume Df=DepfDtype(f) (which is argued below) and show that ff(I∪{f}) iff \(\mathbf {f} \not \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\).

Let f be a fact and I an instance, such that ff(I∪{f}). If f has no atomic type, then it is never broadcast by \(f_{\mathcal {S}}\). So, assume f has an atomic type. Then it must be that DfI. However, because \((\mathit {type}(\mathbf {f}), D_{\mathit {type}(\mathbf {f})}) \in \mathcal {S}\) and Df=DepfDtype(f), \(Dep(\mathbf {f},\mathcal {S}) \subseteq I\). Hence, by definition of \(f_{\mathcal {S}}\), \(\mathbf {f} \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\).

For fact f and instance I, where ff(I∪{f}), Lemma 4 implies that f has an atomic type. Either, f is always broadcast by f, or it is semi-static for f. The former implies that there is no pair in \(\mathcal {S}\) of the form (type(f),T). So, f is broadcast by \(f_{\mathcal {S}}\) as well. The latter implies by Lemma 7 that Df⫅̸I and there is a pair \((\mathit {type}(\mathbf {f}), D_{\mathit {type}(\mathbf {f})}) \in \mathcal {S}\). In particular, because DepfDtype(f)=Df, DepfDtype(f)⫅̸I, which implies that \(\mathbf {f} \not \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\).

It remains to show that \(D_{\mathbf {f}} = \mathit {Dep}{\mathbf {f}}{D_{\mathit {type}(\mathbf {f})}}\). Because gDf, implying type(g)∈Dtype(f), it follows by Lemma 7(4) that gDepfDtype(f). For the reverse direction, let gDepfDtype(f), which implies type(g)∈Dtype(f). So, there must be some fact g, which is of the same type as g, in Df. In particular, because \(D_{\mathbf {f}} \subseteq {Dep}{\mathbf {f}}{D_{{type}(\mathbf {f})}}\), \(\mathbf {g}^{\prime } = V_{\mathbf {f}, type(\mathbf {g}^{\prime })}(\mathit {atom}{\mathbf {g}^{\prime }})\). However, because \(\mathbf {g} = V_{\mathbf {f}, type(\mathbf {g})}(\mathit {atom}{\mathbf {g}})\), atomg=atomg, and type(g)=type(g), it must be that g=g. So, indeed gDf. □

Remark 2

The reader may wonder if a similar result exists for OBSs that are not necessarily locally optimal. Then, however, the behaviour of OBS is much less predictable and the BDS formalism falls short. For an example, recall query Q1 and OBF \(f_{\mathcal {S}}\) from Example 6(1). Now let f be the OBF defined by \(f(I) = f_{\mathcal {S}}(I)\) if |I| is even, and f(I)=I if |I| is odd. OBF f is clearly correct for Q1 (because \(f_{\mathcal {S}}(I) \subseteq I\)), but cannot be simulated through a BDS.

6 Algorithms for Constructing a BDS

Lemma 5 and Lemma 6 yield a natural algorithm for constructing a OBF for a given conjunctive query Q by simply starting from S= and adding new pairs in a one by one fashion till no more pairs can be added. More formally, we introduce the algorithm BDS-BUILD, given in Algorithm 1. As there are exponentially many (in the size of Q) partial atomic types, we parameterize BDS-BUILD by a sequence \(\mathcal {R}\) of partial atomic types.4 The algorithm then produces a set of pairs (τ,T)∈PTypes(Q)×2PTypes(Q).

The following theorem obtains the correctness of BDS-BUILD. The complexity follows directly from the size of \(\mathcal {R}\) which is polynomial in the size of Q for open types and exponential for complete types.

Theorem 4

For a conjunctive query Q and a sequence\(\mathcal {R}\)consisting of exactly the complete (respectively, open) types, BDS−BUILDQ computes a BDS S for Q in time exponential (respectively, polynomial) in the size of Q such that fSis correct for Q and locally optimal.

Proof 12

We show that (1) the complexity of BDS-BUILDQ is in time polynomial in the size of \(\mathcal {R}\) and Q, (2) BDS-BUILDQ computes a BDS S for Q, (3) S is correct for Q, and (4) fS is locally optimal. Because there are exponentially many complete types (in the size of Q), and only polynomially many open types (in the size of Q), (1) implies the complexity claims of the theorem.

For (1), as every partial atomic type has a size that is polynomial in the size of Q, verifying variable containment and adding a pair to S can be done in polynomial time in Q.

These actions are repeated for iterations of the inner and outer loop, which iterate over every key in the partially constructed set S, and over every element of \(\mathcal {R}\) respectively. By construction, S can have at most \(\mathcal {R}\) keys, implying that both loops together perform at most \(|{\mathcal {R}}|^2\) iterations, which confirms the complexity of BDS-BUILDQ to be in time polynomial in the size of \(\mathcal {R}\) and Q.

For (2) and (3), observe that both conditions are satisfied when S=. Indeed, because there are no pairs in S, S is a BDS for Q, and every fact that can contribute to a satisfying valuation for Q is broadcast by fS.

Next, we argue that (2) and (3) remain satisfied during each step of the outer loop. Because \(\mathcal {R}\) contains exactly the complete (respectively, open) types, every partial type that is considered in the outer loop is disjoint with every partial type considered before, implying that condition (1) and (2) of Definition 5 remain satisfied during each iteration. Further, as only pairs (τ,T) are added to S, where \(\mathit {Vars}(\tau ^{\prime })\subseteq _{\tau ^{\prime }}\mathit {Vars}{\tau }\) is satisfied for every τT, condition (3) of Definition 5 remains valid as well. For correctness, observe that every \(\tau ^{\prime } \in \mathit {Keys}(\mathcal {S})\), where \(\tau ^{\prime }\thicksim _{Q} \tau \), is added to T, implying that condition (3) of Lemma 5 remains satisfied.

It remains to argue (4). We distinguish between the case of complete and open types.

For \(\mathcal {R}\) consisting of complete types, condition (3a) of Lemma 6 is satisfied, because only atomic types that are already a key are considered as a value, and because keys are never removed during the construction. Condition (3b) is satisfied because every atomic type ω for Q is in \(\mathcal {R}\), and either ω is added to S as a key, or it is not added because there is already a compatible atomic type ω1 in S, for which \(\mathit {Vars}{\omega _1}\not \subseteq _{\omega _1} Vars(\omega )\). So, again because keys are never removed during the construction, condition (3b) is satisfied. For condition (3c) it suffices to observe that value sets do not change during the construction of S. Therefore, for (τ1,T1)∈S, τ2T1 implies that τ1 was already a key when (τ1,T1) was added, and thus τ1 was not a key when (τ2,T2) was added, implying τ2T1.

For \(\mathcal {R}\) consisting of open types the proof is analogous. Every complete atomic type that implies an open type in S is added as a key during the construction, implying that condition (3a) holds.

For every complete atomic type ω, if ω implies no key in S, then the open type τ for predω must have been excluded from S, implying that there is a key τKeysS, where \(\mathit {Vars}(\tau ^{\prime }) \not \subseteq _{\tau ^{\prime }} \mathit {Vars}{\tau }\). Because τ itself must be open, and Q is a CQ, there must be some atomic type ωτ such that \(\omega ^{\prime } \thicksim _{Q} \omega \). The later then imply \(\mathit {Vars}{\omega ^{\prime }} \not \subseteq _{\omega ^{\prime }} \mathit {Vars}(\omega )\).

Condition (3c) satisfies, because for every (τ1,T1),(τ2,T2)∈S, where \(\omega _1\models \tau _1, \omega _2\models \tau _2, \omega _1\thicksim _{Q} \omega _1\), ω1TypesT2 implies that τ1T2. So, τ1 was already a key in S before τ2 was added. Thus, τ2T1. The result then follows because for two distinct open types τ,τ, Types(τ) and Types(τ) are always disjoint. □

Notice that, on arbitrary (not necessarily complete) sequences of partial atomic types, the above algorithm outputs BDSs that are correct but not necessarily locally optimal for the given query. Further notice that the correctness and local-optimality of the BDS returned by BDS-BUILD is independent of the order in which types are fed to the algorithm, but that the order can influence its structure and thus the behaviour of the OBF that it describes.

Example 7

We illustrate BDS-BUILD by means of an example.

Consider the conjunctive query Q(x,y,z,w)←A(x,y,z),B(x,y,z),C(z,w).
  1. (1)

    Open types. Observe that query Q has three open types, being τA=(A,true), τB=(B,true), and τC=(C,true). Let \({\mathcal {R}} = (\tau _A, \tau _B, \tau _C)\). Then, BDS-BUILD computes a BDS by starting from S=, expanding S to {(τA,)} in the first iteration and to {(τA,),(τB,{τA})} in the second iteration. During the last iteration, S is not changed anymore, because \(\mathit {Vars}{\tau _A}\not \subseteq _{\tau _A}\mathit {Vars}{\tau _C}\).

     
  2. (2)
    Complete types. The (complete) atomic types for Q are
    $$\begin{array}{@{}rcl@{}} &&\tau_X^{\ne} = (X, x\ne y\land y\ne z \land x \ne z), ~~~~ ~\tau_X^{x=z} = (X, x=z\land z\ne y\land y\ne z), \\ &&\tau_X^{x=y} = (X, x=y\land x\ne z\land y\ne z), ~\tau_X^{y=z} = (X, x\ne y\land y=z\land z\ne x),\! \\ &&\tau_X^{=} = (X, x=y\land x= z\land y= z), ~~~{\kern2pt}\tau_C^{=}\! =\! (C, z\,=\,w), \!\text{ and } \tau_C^{\ne} \,=\, (C, z\!\ne\! w), \end{array} $$
    where X∈{A,B}.5 Let
    $$\begin{array}{@{}rcl@{}} {\mathcal{R}} = (\tau_B^{\ne}, \tau_C^{=}, \tau_C^{\ne}, \tau_B^{x=z}, \tau_A^{x=y}, \tau_A^{\ne}, \tau_A^{x=z}, \tau_A^=, \tau_B^=, \tau_A^{y=z}, \tau_B^{x=y}, \tau_B^{y=z}). \end{array} $$
    Then, the output of algorithm BDS-BUILDQ is the BDS
    $$\begin{array}{@{}rcl@{}} \mathit{S} = \{{}& (\tau_B^{\ne}, \emptyset), (\tau_B^{x=z}, \emptyset), (\tau_A^{x=y}, \emptyset), (\tau_A^{\ne}, \{\tau_B^{\ne}\}), (\tau_A^{x=z}, \{\tau_B^{x=z}\}),~\\ &{\kern18pt}~(\tau_A^=, \emptyset),\allowbreak (\tau_B^{=}, \{\tau_A^{=}\}),\allowbreak (\tau_A^{y=z}, \emptyset),\allowbreak (\tau_B^{x=y}, \{\tau_A^{x=y}\}),\allowbreak (\tau_B^{y=z}, \{\tau_A^{y=z}\})\}. \end{array} $$

    Observe that the atomic types \(\tau _C^{=}\) and \(\tau _C^{\ne }\) are not part of S because the variable containment condition is not satisfied by the earlier included atomic type \(\tau _B^{\ne }\).

    Observe that the constructed BDS S can be simplified by merging multiple atomic types into partial atomic types; e.g., for
    $$\begin{array}{@{}rcl@{}} \mathit{S}^{\prime} = \{(\tau_A, \{\tau_B^{\ne}, \tau_B^{x=z}\}), (\tau_B, \{\tau_A^{x=y}, \tau_A^{{=}}, \tau_A^{y=z}\})\}, \end{array} $$
    we have \(f_{\mathit {S}} = f_{\mathit {S}^{\prime }}\).
     

Notice that when \(\mathcal {R}\) consists of the complete or open atomic types, adding pairs to a given BDS S as is done by BDS-BUILDQ results in a BDS S that describes an OBF that broadcasts strictly less facts, i.e., \(f_{\mathit {S}^{\prime }} \subsetneq f_{\mathit {S}}\). That is, adding pairs optimizes the OBF.

Remark 5

By construction, BDS-BUILD(Q) prevents any circular dependencies by stratifying the construction of S so that partial atomic types can only depend on partial atomic types that where added before. As illustrated in Example 6(4), dependencies in a BDS can also be circular. To allow for these BDS-BUILD can be modified as follows: as an alternative for adding pairs (τ,T) where every existing key that is compatible with τ is included in T, we can allow adding pairs where some keys that are compatible with τ are in T, and for every other compatible key, their respective value set is expandend to contain τ; i.e., allowing pairs of the form (τ,D), where D is a subset of \(C = \{\omega ^{\prime } \in \mathit {Keys}{\mathit {S}} \mid \omega ^{\prime } \thicksim _{Q} \omega \}\) satisfying \(\mathit {Vars}{\omega ^{\prime }} \subseteq _{\omega ^{\prime }} \mathit {Vars}(\omega )\) for every ωD, and where every existing pair (ω,T), where ωCD, is expanded to (ω,T∪{ω}). Particularly notice that when a given BDS S is changed to S by adding a pair and expanding at least one of the existing pairs as described above, the inherent nature of the described OBF changes, so that not necessarily \(f_{\mathit {S}'} \subsetneq f_{\mathit {S}}\).

Remark 6

Although the machinery developed throughout this paper is motivated by gaining a better understanding of the spectrum of locally optimal OBFs, the reader may notice that when no (statistical) information on the actual distribution of the data is available, there is no basis to favor one locally optimal OBF over another.

In fact, there is already a very simple algorithm to find an arbitrary locally optimal OBF for given CQ Q which is as good as any locally optimal one (when no additional information on the distribution of the data is available). Indeed, consider an arbitrary order on the predicates of Q:

for every local fact f, with predicate R, if there is an earlier predicate S such that some variable in Vars(S) is not in Vars(R), fact f is broadcast; otherwise, fact f is broadcast only if all the facts induced by Vfactf on query Q are in the local instance.

Of course, not every locally optimal OBF can take this form.

7 Discussion

We investigated locally optimal oblivious broadcasting functions represented by the formalism of broadcast dependency sets. We obtained semantical and syntactical characterizations, showed completeness of BDSs for representing locally optimal OBFs, and gave an algorithm for constructing locally optimal OBFs for a given conjunctive query. We present several directions for future work: more expressive query languages, incorporating background knowledge, and non-oblivious broadcast functions.

An obvious question is how to generalize our results to the class of all conjunctive queries (possibly extended with negation) or even to (subsets of) Datalog. A first step would be to get rid of the fullness-restriction and to allow self-joins. When removing these restrictions, output facts may have non-unique valuations, which makes reasoning about local optimality much more complex. Of course, to evaluate non-monotonic queries in a coordination-free manner, computing nodes need more information on how data is distributed (c.f., [6]).

We only discussed how to build a BDS when no information about the way data is distributed is available. Indeed, the best one can do is to let a BDS cover as much types as possible, but at the same time introduce as little dependencies as possible, as these are likely to fail when data is arbitrarily distributed. It would be interesting to devise optimal broadcasting algorithms taking more background knowledge into account like information about clustering of attributes, foreign keys, or cardinality of relations.

Another interesting direction for future work is to investigate non-oblivious broadcasting functions where over time, when new messages arrive, static facts can become broadcast facts (but not vice versa). Such functions are initially more conservative keeping more facts static and only broadcast facts when there is some evidence that they can be used at another computing node. For instance, consider the setting of Example 1. Rather than immediately sending B(i,j) whenever A(j,i) is locally absent, broadcasting is suspended until a C-fact of the form C(j,k) is received. The rationale is that a B-fact that can not contribute to a locally satisfying valuation, should only be broadcast when some evidence is received that it could potentially contribute to a satisfying valuation on a remote node. For our example this means that c waits to send B(2,1) until C(1,3) arrives. Moreover, B(4,4) is never sent. While non-oblivious strategies might seem more attractive as they transmit fewer tuples, such strategies, while remaining coordination-free, can increase the overall evaluation time.

Footnotes

  1. 1.

    Actually, this observation is the straightforward part of the CALM-conjecture [14]. It is the converse direction which is more surprising: that every query which can be evaluated in an eventually consistent and coordination-free manner has to be monotone [6].

  2. 2.

    To simplify notation, in the definition of B and eval, we do not mention I and \(\mathcal {N}\) as they are implied by H.

  3. 3.

    Notice that we abuse the notation and interpret variables as values.

  4. 4.

    We use a sequence rather than a set \(\mathcal {R}\) to keep BDS-BUILD deterministic.

  5. 5.

    For convenience we represent atomic types here by partial atomic types with sufficient (but not complete) conditions; e.g., we write (C,x=y) to denote (C,x=yy=x). Nevertheless, all of the listed pairs indeed correspond to a single (complete) atomic type.

Notes

Acknowledgment

We thank Phokion Kolaitis for raising the question whether it is always necessary to broadcast all the data in the context of the work in [5]. We thank the reviewers for their in-depth comments and numerous suggestions for improving the presentation of the results.

References

  1. 1.
    Afrati, F.N., Koutris, P., Suciu, D., Ullman, J.D.: Parallel skyline queries. In: International Conference on Database Theory (ICDT), pp 274–284 (2012)Google Scholar
  2. 2.
    Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: International Conference on Extending Database Technology (EDBT), pp 99–110 (2010)Google Scholar
  3. 3.
    Alvaro, P., Conway, N., Hellerstein, J., Marczak, W.R.: Consistency analysis in bloom: a CALM and collected approach. In: Conference on Innovative Data Systems Research (CIDR), pp 249–260 (2011)Google Scholar
  4. 4.
    Alvaro, P., Conway, N., Hellerstein, J.M., Maier, D.: Blazes: Coordination analysis for distributed programs. In: International Conference on Data Engineering (ICDE), pp 52–63. IEEE (2014)Google Scholar
  5. 5.
    Ameloot, T.J., Ketsman, B., Neven, F., Zinn, D.: Weaker forms of monotonicity for declarative networking: a more fine-grained answer to the CALM-conjecture . In: Symposium on Principles of Database Systems (PODS), pp 64–75. ACM (2014)Google Scholar
  6. 6.
    Ameloot, T.J., Neven, F., Bussche, J.V.d.: Relational transducers for declarative networking. J. ACM 60(2), 15 (2013)MathSciNetCrossRefMATHGoogle Scholar
  7. 7.
    Beame, P., Koutris, P., Suciu, D.: Communication steps for parallel query processing. In: Symposium on Principles of Database Systems (PODS), pp 273–284 (2013)Google Scholar
  8. 8.
    Beame, P., Koutris, P., Suciu, D.: Skew in parallel query processing. In: Symposium on Principles of Database Systems (PODS), pp 212–223 (2014)Google Scholar
  9. 9.
    Buneman, P., Cheney, J., Tan, W.C., Vansummeren, S: Curated databases. In: Symposium on Principles of Database Systems (PODS), pp 1–12. ACM (2008)Google Scholar
  10. 10.
    Buneman, P., Khanna, S., Tan, W.C.: Why and where: A characterization of data provenance. In: International Conference on Database Theory (ICDT) volume 1973 of Lecture Notes in Computer Science, pp 316–330. Springer (2001)Google Scholar
  11. 11.
    Conway, N., Marczak, W.R., Alvaro, P., Hellerstein, J.M., Maier, D.: Logic and lattices for distributed programming. In: Symposium on Cloud Computing (SoCC), p 1. ACM (2012)Google Scholar
  12. 12.
    Fan, W., Geerts, F., Libkin, L.: On scale independence for querying big data. In: Symposium on Principles of Database Systems (PODS), pp 51–62. ACM (2014)Google Scholar
  13. 13.
    Ganguly, S., Silberschatz, A., Tsur, S.: Parallel bottom-up processing of datalog queries. J. Log. Program. 14(1&2), 101–126 (1992)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Hellerstein, J.M.: The declarative imperative: experiences and conjectures in distributed logic. SIGMOD Rec. 39(1), 5–19 (2010)CrossRefGoogle Scholar
  15. 15.
    Ketsman, B., Neven, F.: Optimal broadcasting strategies for conjunctive queries over distributed data. In: International Conference on Database Theory (ICDT), pp 291–307 (2015)Google Scholar
  16. 16.
    Koutris, P., Suciu, D.: Parallel evaluation of conjunctive queries. In: Symposium on Principles of Database Systems (PODS), pp 223–234 (2011)Google Scholar
  17. 17.
    Meliou, A., Gatterbauer, W., Halpern, J.Y., Koch, C., Moore, K.F., Suciu, D.: Causality in databases. IEEE Data Engineering Bulletin 33(3), 59–67 (2010)Google Scholar
  18. 18.
    Meliou, A., Gatterbauer, W., Moore, K.F., Suciu, D.: The complexity of causality and responsibility for query answers and non-answers. Proceedings of the VLDB Endowmen (PVLDB) 4(1), 34–45 (2010)CrossRefGoogle Scholar
  19. 19.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp 15–28. USENIX Association (2012)Google Scholar
  20. 20.
    Zinn, D., Green, T.J., Ludäscher, B.: Win-move is coordination-free (sometimes). In: International Conference on Database Theory (ICDT), pp 99–113 (2012)Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Hasselt University and transnational University of LimburgHasseltBelgium

Personalised recommendations