# Optimal Broadcasting Strategies for Conjunctive Queries over Distributed Data

- 102 Downloads

## Abstract

In a distributed context where data is dispersed over many computing nodes, monotone queries can be evaluated in an eventually consistent and coordination-free manner through a simple but naive broadcasting strategy which makes all data available on every computing node. In this paper, we investigate more economical broadcasting strategies for full conjunctive queries without self-joins that only transmit a part of the local data necessary to evaluate the query at hand. We consider oblivious broadcasting strategies which determine which local facts to broadcast independent of the data at other computing nodes. We introduce the notion of broadcast dependency set (BDS) as a sound and complete formalism to represent locally optimal oblivious broadcasting functions. We provide algorithms to construct a BDS for a given conjunctive query and study the complexity of various decision problems related to these algorithms.

### Keywords

Coordination-free evaluation Conjunctive queries Broadcasting## 1 Introduction

We assume the setting introduced in the context of declarative networking [6, 14], where queries are specified on a logical level over a global schema and are evaluated by multiple computing nodes over which the input database is distributed. These nodes can perform local computations and communicate asynchronously with each other via messages. The model then operates under the assumption that messages can never be lost but can be arbitrarily delayed.

It is known that every monotone query can be evaluated in an eventually consistent and coordination-free manner through a naive broadcasting strategy that makes all data available to all nodes [14].^{1} Indeed, every computing node sends all its local data to every other node and reevaluates the query every time new data arrives. This evaluation is eventually consistent as, because of monotonicity, no facts will be derived which later have to be retracted and, furthermore, when all transmitted data has arrived, the output of every node will correspond to the result of the query. In addition, the computation requires no coordination between the nodes.

Obviously, the above strategy leads to a very careless evaluation as the whole database is send to every node and every node independently computes the complete answer for the targeted query. In the present paper, we are interested in more economical broadcasting strategies where only a subset of the local data is transmitted and where each computing node contributes to the answer of the query by outputting only a subset of the answer tuples. The result of the query then is the union of the tuples output by the computing nodes. In particular, we focus on full conjunctive queries without self-joins and we consider *oblivious broadcasting strategies* where every computing node determines which facts will be broadcast solely on the content of its own local database (so, oblivious of the data at other nodes). By the latter we particularly mean the initial local database. Our strategies are thus independent of incoming messages and can be thought of as ‘single-shot’ broadcasting strategies.

The sent facts are referred to as *broadcast facts*. Facts that are not initially broadcast are called *static*. We illustrate the ideas behind such strategies by means of an example.

*Example 1*

Let *Q*_{1} be the query *Q*_{1}(*x*,*y*,*z*)←*A*(*x*,*y*),*B*(*y*,*x*),*C*(*x*,*z*) and let *I*={*A*(1,2),*A*(2,2),*B*(2,1),*B*(2,2),*B*(4,4),*C*(1,3)} be a database instance. Consider a network of two computing nodes *c* and *c*^{′} containing the facts *I*(*c*)={*A*(2,2),*B*(2,1),*B*(2,2)} and *I*(*c*^{′})={*A*(1,2),*B*(4,4),*C*(1,3)}, respectively.

### Naive broadcasting strategy

The naive broadcasting algorithm outlined above sends all facts in *I*(*c*) to *c*^{′} and all facts in *I*(*c*^{′}) to *c*. Eventually, both *c* and *c*^{′} receive all data and both of them compute the result of the query, that is, *Q*_{1}(*I*)={(1,2,3)}.

### Improved oblivious broadcasting strategy

The just described strategy is clearly oblivious but also rather wasteful. Therefore consider the following strategy which broadcasts all of the *C*-facts but none of the *A*-facts. Furthermore, a *B*-fact *B*(*i*,*j*) is broadcast only when *A*(*j*,*i*) does not occur in the local database. Executing this strategy for every computing node in our example results in *c* broadcasting the set {*B*(2,1)} while *c*^{′} broadcasts {*B*(4,4),*C*(1,3)}. So, eventually, *I*^{∗}(*c*)={*A*(2,2),*B*(2,1),*B*(2,2),*B*(4,4),*C*(1,3)} and *I*^{∗}(*c*^{′})={*A*(1,2),*B*(2,1),*B*(4,4),*C*(1,3)}. Here, we denote by *I*^{∗}(*d*) the instance at node *d* when all transmitted messages have arrived. Therefore, *Q*_{1}(*I*^{∗}(*c*))=*∅* and *Q*_{1}(*I*^{∗}(*c*^{′}))={(1,2,3)}, and *Q*_{1}(*I*) equals *Q*_{1}(*I*^{∗}(*c*))∪*Q*_{1}(*I*^{∗}(*c*^{′})). Intuitively, this strategy is correct in general as the following invariant holds for every computing node *d*: when a fact *B*(*i*,*j*) is *not* broadcast at a node *d*, then every satisfying valuation *V* for *Q*_{1} on *I* that maps (*x*,*y*) to (*i*,*j*) can be realized locally in *I*^{∗}(*d*). Notice that, a similar strategy reversing the roles of *A*- and *B*-facts would work as well.

We will formalize oblivious broadcasting functions as generic mappings. This means that decisions on whether to broadcast facts do not depend only on the name of the predicate but can also depend on the equality type of the fact under consideration. Therefore, the following strategy would be valid as well: always broadcast facts of the form *C*(*i*,*j*) with *i*≠*j* and keep all facts of the form *C*(*i*,*i*) static; broadcast all *B*-facts; broadcast a fact *A*(*i*,*j*) only when the fact *C*(*i*,*i*) is not present in the local database. While not immediately obvious, this strategy correctly computes *Q* on every distributed database.

Both strategies will be presented more formally in Section 5 in terms of *broadcast dependency sets* and are formalized further in Example 6(1) and 6(2). @@@

(

*i*) We provide a semantical characterization of when an oblivious broadcasting function (OBF) correctly evaluates a given conjunctive query. While it is desirable to construct OBFs that minimize the overall amount of transmitted facts over all distributed databases, we show that there is no optimal OBF for any conjunctive query with at least two distinct atoms in its body. Therefore, we turn to a slightly weaker notion of optimality, called locally optimal, which requires that an OBF is optimal w.r.t. the local instance at every computing node. Intuitively, this means that no broadcast fact can be made static without sacrificing correctness. We provide a semantical characterization for when an OBF is locally optimal for a given conjunctive query.(

*ii*) We introduce the notion of a broadcast dependency set (BDS) as a formalism to specify OBFs. In brief, a BDS \(\mathcal {S}\) is a set of pairs (*τ*,*T*) where*τ*is a partial equality type w.r.t. a relation and*T*is a set of partial equality types. Every such pair encodes a rule that can be interpreted roughly as follows: when a fact**f**matches type*τ*, it will be broadcast at a computing node*c*when the set of facts induced by*T*is not present at*c*. We present necessary and sufficient syntactic conditions for when a BDS is correct for a given query and also for when it is locally optimal w.r.t. that query. Furthermore, we study the complexity of deciding whether a BDS is correct for a query and whether it is locally optimal. Finally, and most importantly, we show that the formalism of BDS is expressively complete w.r.t. locally optimal OBFs by obtaining that every locally optimal OBF can be represented by a BDS. In fact, every locally optimal OBF can already be represented by a BDS that only uses complete types, that is, types where the equalities between all variables are fully specified.(

*iii*) Based on the syntactic criteria of when a BDS is correct for*Q*and when it is locally optimal, we obtain an algorithm BDS-BUILD(*Q*) that computes a locally optimal OBF (represented as a BDS) for a given conjunctive query*Q*. When restricting to open types (these are types without restrictions on the equalities between variables), BDS-BUILD(*Q*) computes a locally optimal OBF in time polynomial in the size of*Q*. When considering complete types, BDS-BUILD(*Q*) computes a locally optimal OBF in time exponential in the size of*Q*simply because there are exponentially many complete types.

### Outline

We discuss related work in Section 2 and introduce the necessary definitions and concepts in Section 3. In Section 4, we discuss OBFs and locally optimality. In Section 5, we discuss broadcast dependency sets and study their properties. In Section 6, we provide an algorithm to construct a locally optimal OBF for a given conjunctive query. We conclude in Section 7.

The present paper is the full version of the extended abstract [15] and provides the missing proofs.

## 2 Related Work

### CALM

The approach in this paper is motivated by the work on the CALM-conjecture. Hellerstein [14] formulated the CALM-principle which suggests a link between logical monotonicity and distributed consistency without the need for coordination. The latter principle is, for instance, embedded in BLOOM, a declarative language for distributed programming, for which practical program analysis techniques have been developed detecting potential consistency anomalies [3, 4, 11]. Ameloot et al. [6] formalized (and proved) the CALM-conjecture in terms of relational transducer networks. Zinn et al. [20] showed that the generalization of the conjecture to stronger variants of relational transducer networks fails. Ameloot et al. [5] then subsequently provided a more fine-grained answer to the CALM-conjecture by relating these stronger variants of relational transducer networks to weaker notions of monotonicity. All of these works considered naive evaluation strategies that broadcast *all* of the local data. In particular, none of these works considered more economic broadcasting evaluation of conjunctive queries.

### Massive Parallel Model

The networked relational transducer model is just one paradigm for studying distributed query evaluation. In the massively parallel (MP) model, introduced by Koutris and Suciu [16], computation proceeds in a sequence of parallel steps, each followed by global synchronization of all servers. In this model, evaluation of conjunctive queries [7, 16] as well as skyline queries [1] have been considered. Recently, Beame et al. [8] proved a matching upper and lower bound for the amount of communication needed to compute a full conjunctive query without self-joins in one communication round. Notice that this is the same subclass of CQs as we consider in this work. The upper bound is provided by a randomized algorithm called Hypercube which dates back to Ganguli et al. [13] and was described by Afrati and Ullman [2] in the context of MapReduce algorithms. Hypercube is motivated by modern massively distributed systems like, for instance, Spark [19], where entire computations reside in main memory, replay is used to recover, and the dominant cost is that of communication. We note that one-round Hypercube is coordination-free and can be easily employed within the framework of relational transducer networks as well. A characteristic of Hypercube-style algorithms is that the space of computing nodes (over which the input data will be distributed) needs to be known in advance. The broadcasting strategies considered in this paper are motivated by a cloud computing setting where data is already initially distributed and the complete space of computing nodes is not necessarily known in advance. In this respect, Hypercube-style and broadcasting algorithms are orthogonal.

### Relevance

One approach to minimize data transfer for a query *Q*, is to find the smallest subset *J* of a distributed instance *I* for which *Q*(*I*)=*Q*(*J*) and then only broadcast the relevant subset *J*. Determining which part of a database is relevant for answering a query is a problem that arises in different contexts. For instance, causality in databases aims to determine which tuples in the database instance caused the output to a query [17, 18]. Then, the contingency set asks for the smallest set *K* such that *Q*(*I*)≠*Q*(*I*−*K*). So, any set *I*−*K* extended with one element is relevant. Similarly, “where” and “why” provenance refer to the location(s) in the source databases from which the output was extracted or by which the output was influenced [9, 10]. Fan et al. [12] study the problem of scale independence where, through access patterns, the result of a query depends only on a bounded part of the database. It would be interesting to investigate how these different approaches translate to a distributed setting. Most surely, any lower bounds for the sequential setting imply lower bounds for the distributed setting, but upper bounds need to take into account that the initial database instance *I* is distributed as well.

The oblivious broadcasting strategies that we introduce operate locally on nodes and are unaware of the data residing on these nodes. In fact, our strategies are also independent of the network configuration itself (i.e., the set of computing nodes). Therefore, these strategies apply for example to (fast) evolving networks, where the exact state of the network at a given time may be unknown, as long as no adjustments in the network configuration happen during the query evaluation.

## 3 Preliminaries

### Instances and Queries

For a finite set *S*, we denote by |*S*| its cardinality and by 2^{S} its powerset. We denote {1,…,*n*} by [*n*], for \(n\in \mathbb {N}\). We assume an infinite set **dom** of data values. A *database schema**σ* is a collection of relation names *R* where every *R* has arity *a**r*(*R*)>0. We call \(R(\bar d)\) a *fact* when *R* is a relation name and \(\bar d\) is a tuple in **d****o****m**. We say that a fact *R*(*d*_{1},…,*d*_{k}) is *over* a database schema *σ* if *R*∈*σ* and *a**r*(*R*)=*k*. A *(database) instance**I* over *σ* is simply a finite set of facts over *σ*. We denote by *A**d**o**m*(*I*) the set of all values that occur in facts of *I*. When *I*={**f**}, we simply write *A**d**o**m*(**f**) rather than *A**d**o**m*(**f**). A *query* over a schema *σ* to a schema *σ*^{′} is a generic mapping *Q* from instances over *σ* to instances over *σ*^{′}. Genericity means that for every permutation *π* of **dom** and every instance *I*, *Q*(*π*(*I*))=*π*(*Q*(*I*)). For the remainder of the paper, we assume that the database schema *σ* where queries are defined over is clear from the context, and do not refer to it anymore. A query *Q* is *monotone* if *Q*(*I*)⊆*Q*(*J*) for all instances *I*,*J* with *I*⊆*J*. We only consider monotone queries in the sequel.

### Conjunctive Queries

Let **var** be the universe of variables, disjoint from **dom**. An *atom**A* is of the form *R*(*u*_{1},…,*u*_{k}) where *R* is a relation name and each *u*_{i}∈**var**. We call *R* the *predicate* and denote it by *p**r**e**d*(*A*). We denote the variables occurring in *A* by *V**a**r**s*(*A*)={*u*_{1},…,*u*_{k}}. We say that *A* is an atom *over* the database schema *σ* if *p**r**e**d*(*A*)∈*σ* and *k*=*a**r*(*p**r**e**d*(*A*)). A *conjunctive query**Q* (CQ) is an expression of the form *A*_{0}←*A*_{1},…,*A*_{n}, where for every *i*∈[*n*], *A*_{i} is an atom over the schema and *A*_{0} is an atom not over the schema. In particular, *A*_{0} is the head of *Q*, denoted *h**e**a**d*_{Q}, and *A*_{1},…,*A*_{n} is the body of *Q*, denoted *b**o**d**y*_{Q}. By *V**a**r**s*(*Q*) we denote all the variables occurring in *Q*. A *valuation* for *Q* on an instance *I* is a function *V*:*V**a**r**s*(*Q*)→**d****o****m**. The *application* of *V* to an atom *A*=*R*(*u*_{1},…,*u*_{k}), denoted *V*(*A*), results in the fact *R*(*a*_{1},…,*a*_{k}) where *a*_{i}=*V*(*u*_{i}) for each *i*∈[*k*]. The valuation *V* is said to be *satisfying* for *Q* if *V*(*A*)∈*I* for all atoms *A* in the body of *Q*. In that case, *V* derives the fact *V*(*A*_{0}). The result of *Q* on *I*, denoted *Q*(*I*) is defined as the set of facts that can be derived by satisfying valuations.

In what follows, we assume that every CQ is full and does not contain self-joins. Formally, we require that *p**r**e**d*(*A*_{i})≠*p**r**e**d*(*A*_{j}) for *i*≠*j* and \(Vars(A_{0}) = \bigcup _{i\in [n]} Vars(A_{i})\). That is, every atom has a unique relation symbol and all variables occurring in the body occur in the head as well. For instance, *Q*_{1}(*x*,*y*,*z*)←*A*(*x*,*y*),*B*(*x*,*z*),*C*(*y*,*y*) is full and does not contain self-joins, while *Q*_{2}(*x*,*y*)←*A*(*x*,*y*),*B*(*x*,*z*),*C*(*y*,*y*) is not full and *Q*_{3}(*x*,*y*,*z*)←*A*(*x*,*y*),*A*(*x*,*z*),*C*(*y*,*y*) contains a self-join.

### Distributed Database

A *network*\(\mathcal {N}\) is a nonempty finite set of values from **d****o****m**, which we call *nodes*. A *distribution* of an instance *I* over \(\mathcal {N}\) is a function *H* that maps each \(c\in \mathcal {N}\) to an instance such that \(I = \bigcup _{c\in \mathcal {N}}H(c)\). Notice that facts can be replicated. We also refer to each of the *H*(*c*) as the *local instances*. We consider a model where nodes have unlimited computational power and can send messages to all other nodes. These messages can never be lost but can be arbitrarily delayed.

The latter is formalised in [6] in terms of a local buffer for each computing node that is used to store incoming messages. Computation of the network is then defined as a transition system where in every transition a node becomes active and non-deterministically picks a message from its input buffer. A fairness condition is imposed to ensure that all messages are eventually read.

## 4 Oblivious Broadcasting

We refrain from introducing the formalism of relational transducer networks from [6], but present a simpler setting more suitable for our needs. In particular, the relational transducer networks needed in this paper only perform two actions: decide which facts to broadcast (and transmit those) and evaluate the query under consideration whenever new data arrives. The only parameter is the used broadcasting strategy and, therefore, forms the focus of our formalization. In brief, we consider broadcasting strategies where computing nodes partition their local database into *static* and *broadcast* facts. Static facts are kept local while broadcast facts, as the name already indicates, are sent to all other nodes in the network. As we only consider conjunctive queries which are monotone, the target query can be recomputed whenever new data arrives.

### 4.1 Oblivious Broadcasting Functions

We now formally define oblivious broadcasting function.

**Definition 1**

An *oblivious broadcasting function (OBF)**f* is a generic mapping that maps instances to instances such that *f*(*J*)⊆*J* for all instances *J*.

An OBF specifies which local facts are broadcast. Specifically, *f*(*J*) are the broadcast facts while *J*∖*f*(*J*) are the static facts. We use the term oblivious as broadcast facts only depend on the local database instance and their choice is independent of the facts at other computing nodes. An OBF *f* is *naive* when there are no static facts, that is, *f*(*J*)=*J* for all instances *J*.

Given a CQ *Q*, an instance *I*, a distribution *H* of *I*, and a network \(\mathcal {N}\), an OBF *f* implies a broadcasting algorithm in the following way. Let \(B(f,H)=\bigcup _{c\in \mathcal {N}} f(H(c))\) be the set of broadcast facts. Then, define \(eval(Q,f,H)= \bigcup _{c\in \mathcal {N}} Q(H(c)\cup B(f,H)))\) as the union of the query result at every computing node over the local instance extended with all broadcast facts.^{2}

*Remark 1*

We note that the function *e**v**a**l*(*Q*,*f*,*H*) implies an evaluation that can be executed by a transducer program *π*_{f,Q} at every node *c* as follows: (1) *R*=*∅*, output *Q*(*H*(*c*)), broadcast *f*(*H*(*c*)); (2) whenever a fact **f** arrives, *R*=*R*∪{**f**}, output *Q*(*H*(*c*)∪*R*). Correctness then follows from the genericity and monotonicity of *f*. We refer to the execution strategy induced by eval(*Q*,*f*,*H*) as a *broadcasting algorithm*. Coordination-freeness intuitively follows as *π*_{f,Q} never waits. Formally, a transducer is coordination-free [6] if there is a so-called *ideal* distribution, on which the query is already computed by a prefix of a run that does not process any of the incoming facts. For *π*_{f,Q} this is the distribution that puts the complete instance at every node. We refer to [6] for a more formal treatment of coordination-freeness.

**Definition 2**

An OBF *f* is *correct* for a CQ *Q* when *Q*(*I*)=*e**v**a**l*(*Q*,*f*,*H*) for all instances *I* and all distributions *H* of *I*.

When *f* is correct for *Q*, we also say that *f* is an OBF *for**Q*. The following lemma characterizes correctness in that two compatible facts residing at different computing nodes can never be both static. Indeed, if they are, then the valuation witnessing compatibility is never realized at any computing node and consequently *f* can not be correct for *Q*.

We say that two distinct facts **f** and **g** are *compatible w.r.t**Q*, denoted **f**∼_{Q}**g**, when, in some model, they are assigned to two atoms from the body of *Q* under one valuation, i.e., there is a valuation *V* for *Q* and atoms *A*,*B*∈*b**o**d**y*_{Q}, such that *V*(*A*)=**f** and *V*(*B*)=**g**.

*Example 2*

For an example recall query *Q*_{1} from Example 1: *Q*_{1}(*x*,*y*,*z*)←*A*(*x*,*y*),*B*(*y*,*x*),*C*(*x*,*z*). For *Q*_{1}, facts *A*(1,2) and *B*(2,1) are compatible, because they are in the image of valuation *V*:{*x*↦1,*y*↦2,*z*↦3} over query *Q*_{1}. This same valuation also witnesses compatibility of *A*(1,2) and *C*(1,3), and *B*(2,1) and *C*(1,3).

For an example of facts not compatible for *Q*_{1}, take *A*(1,2) and *B*(2,2), for which it is easy to see that no valuation can assign variable *x* to both values 1 (for A) and 2 (for B).

**Lemma 1**

*Let Q be a CQ and f be an OBF. Then, the following are equivalent*:

- 1.
*f is correct for Q; and* - 2.
*there are no instances I,J, and facts***f**,**g**, with**f**∼_{Q}**g**,**g**∉I,**f**∉J such that**f**∉f(I∪{**f**}) and**g**∉f(J∪{**g**}).

*Proof 1*

(1) ⇒(2) We start by showing that every OBF for *Q* satisfies the above condition. The proof is by contraposition, so we assume that there are instances *I* and *J* and compatible facts **f** and **g** w.r.t. *Q*, where **g**∉*I* and **f**∉*J*, but **f**∉*f*(*I*∪{**f**}) and **g**∉*f*(*J*∪{**g**}). Let *K* be an instance and let *V* be a satisfying valuation for *Q* on *K* witnessing compatibility of **f** and **g**. Then consider a network \(\mathcal {N} = \{1,2,3\}\) and an instance *L*=*I*∪*J*∪*V*(*b**o**d**y*_{Q}) with the following distribution *H*: *H*(1)=*I*∪{**f**}, *H*(2)=*J*∪{**g**}, and *H*(3)=*V*(*b**o**d**y*_{Q})∖{**f**,**g**}. Clearly, *V*(*h**e**a**d*_{Q})∈*Q*(*L*). As *Q* is full, \(V(head_{Q}) \not \in \bigcup _{c\in \mathcal {N}}Q(H(c)\cup B(f,H))\) because none of the computing nodes contain both **f** and **g**, and **f** and **g** are not broadcast. Thus, \(Q(L) \ne \bigcup _{c\in \mathcal {N}}Q(H(c)\cup B(f,H))=eval(Q, f, H)\) and *f* is not an OBF for *Q*.

(2) ⇒(1) It remains to show that if the above condition is satisfied, then *f* is an OBF for *Q*. For this, let *I* be an instance, \(\mathcal {N}\) a network, and *H* a distribution of *I* over \(\mathcal {N}\). We prove that \(Q(I) = \textit {eval}(Q,f,H)=\bigcup _{c\in \mathcal {N}}Q(H(c) \cup B(f,H))\). As *Q* is monotone, *Q*(*H*(*c*)∪*B*(*f*,*H*))⊆*Q*(*I*) for every \(c\in \mathcal {N}\). Hence, it suffices to show that \(Q(I) \subseteq \bigcup _{c\in \mathcal {N}}Q(H(c)\cup B(f, H))\). Thereto, let **f**∈*Q*(*I*), let *V* be a satisfying valuation for *Q* over *I* for which *V*(*h**e**a**d*_{Q})=**f**. Let *J*=*V*(*b**o**d**y*_{Q})∖*B*(*f*,*H*), and *c* a node for which |*H*(*c*)∩*J*| is maximal. We claim that *J*⊆*H*(*c*), obviously implying that **f** will be derived at node *c*. Towards a contradiction, assume there is an **f**_{i}∈*J*∖*H*(*c*). As **f**_{i}∈*I* there is a \(d \in \mathcal {N}\), *c*≠*d*, such that **f**_{i}∈*H*(*d*). Moreover, by choice of *c*, |*H*(*d*)∩*J*|≤|*H*(*c*)∩*J*| and thus there must be a fact **f**_{j}∈*H*(*c*)∩*J* that is not in *H*(*d*). However, as **f**_{i}∼_{Q}**f**_{j}, **f**_{i}∉*H*(*c*), and **f**_{j}∉*H*(*d*), the instances *H*(*d*), *H*(*c*), and the facts **f**_{i},**f**_{j} contradict condition (2). □

### 4.2 Local Optimality

We are interested in OBFs that transmit as little data as possible. Thereto, we investigate sensible notions of optimality. We fix a query *Q*, an instance *I*, a distribution *H* of *I*, and a network \(\mathcal {N}\). The total number of transmitted facts equals \(||B(f,H)||={\sum }_{c\in \mathcal {N}} | f(H(c))|\). Of course, ||*B*(*f*,*H*)||≥|*B*(*f*,*H*)|.

**Definition 3**

**An OBF***f* for a CQ *Q* is *optimal* iff ||*B*(*f*,*H*)||≤||*B*(*g*,*H*)|| for every other OBF *g* for *Q* and for every instance *I* and distribution *H*.

Intuitively, an OBF is optimal when it transmits the least amount of data over all instances and all distributions. The next result, however, shows that this notion of optimality, although desirable, is unattainable.

**Lemma 2**

*There is no optimal OBF for any conjunctive query with at least two distinct atoms in its body.*

*Proof 2*

Let *Q* be the conjunctive query *A*_{0}←*A*_{1},…,*A*_{n} with *n*≥2. Towards a contradiction assume there is an optimal OBF *f* for *Q*. Let *I* be the canonical instance for *Q* where for every *i*∈[*n*], the relation *p**r**e**d*(*A*_{i}) is interpreted by the fact *A*_{i}.^{3} Now, consider a network \(\mathcal {N} = [n]\) and a distribution *H* that places every fact in *I* on a distinct node. As all of the *n* facts in *I* need to be gathered at one node, at least *n*−1 facts must be broadcast. As the OBF that broadcasts all *A*_{i}-facts for *i*<*n* and keeps all *A*_{n}-facts static is correct for *Q* and only transmits *n*−1 facts on *I*, by assumption on the minimality of *f*, ||*B*(*f*,*H*)||=*n*−1. Let **g** be the fact in *I* that is not broadcast by *f* and assume w.l.o.g. that *p**r**e**d*(**g**)=*A*_{n}. Now, consider *I*^{′}=*I*∖{**g**}. And let *H*^{′} equal *H* restricted to only facts in *I*^{′} over \(\mathcal {N}\). Then, as **g** is not broadcast in *H*, ||*B*(*f*,*H*)||=||*B*(*f*,*H*^{′})||. However, the OBF that broadcasts all *A*_{i}-facts for *i*>1 and keeps all *A*_{1}-facts static is correct for *Q* and only broadcasts *n*−2 facts on *I*^{′} contradicting the optimality of *f*. □

We next turn to a different form of optimality. For two OBFs *f* and *g*, we say that *f* is *included* in *g*, denoted *f*⊆*g*, iff *f*(*I*)⊆*g*(*I*) for every instance *I*.

**Definition 4**

**An OBF***f* for a CQ *Q* is *locally optimal* iff for every other broadcasting function *g* for *Q*, *g*⊆*f* implies *f*=*g*.

Intuitively, when *f* is locally optimal there is no subdivision of *f* that transmits only a strict subset of the facts broadcast by *f*.

The next lemma gives a sufficient criteria for when an OBF can not be locally optimal. Specifically, a condition is given for when a broadcast fact **f** can be kept static and a more economical OBF *f*^{′} can be derived.

**Lemma 3**

*Let Q be a CQ and let f be an OBF for Q. If there is an instance I and fact***f** for which **f**∈f(I∪{**f**}), *but there is no instance J and no fact***g***for which***f**∼_{Q}**g**, **g**∉I, **f**∉J, and **g**∉f(J∪{**g**}), *then there is an OBF f*^{′}*for Q for which*\(f^{\prime } \subsetneq f\).

*Proof 3*

*f*,

*I*, and

**f**as given by the statement of the lemma. The proof is now by construction. Let

*I*

_{f,J}be the set of facts that (by genericity) relate the same way to

*J*, as

**f**to I. That is,

*I*

_{f,J}={

*π*(

**f**)∣

*π*a permutation s.t.

*π*(

*I*)=

*J*}. Then, define

*f*

^{′}as the mapping where for every instance

*J*,

*f*

^{′}(

*J*)=

*f*(

*J*)∖

*I*

_{f,J}. Notice that \(f^{\prime } \subsetneq f\) by construction of

*f*

^{′}. Furthermore,

*f*

^{′}is generic and is an OBF. It remains to show that

*f*

^{′}is an OBF for

*Q*. Towards a contradiction, assume that

*f*

^{′}is not an OBF for

*Q*. Then, by Lemma 1, there are instances

*J*

_{1}and

*J*

_{2}and facts

**g**

_{1}and

**g**

_{2}, for which

**g**

_{1}∼

_{Q}

**g**

_{2},

**g**

_{2}∉

*J*

_{1},

**g**

_{1}∉

*J*

_{2}, and \(\mathbf {g}_{1} \not \in f^{\prime }(J_{1}\cup \{\mathbf {g}_{1}\})\) and \(\mathbf {g}_{2} \not \in f^{\prime }(J_{2}\cup \{\mathbf {g}_{2}\})\). As

*f*is an oblivious broadcasting function for

*Q*, it holds that

*J*

_{1}=

*π*(

*I*) and

**g**

_{1}=

*π*(

**f**) for some permutation

*π*. As

*Q*does not contain self-joins and \(\mathbf {g}_{1}\sim _{Q} \mathbf {g}_{2}\), this means that \(\mathbf {g}_{2}\not \in I_{\mathbf {f}, J_{2}}\). Therefore, \(\mathbf {g}_{2} \not \in f(J_{2}\cup \{\mathbf {g}_{2}\})\) which contradicts the condition of the lemma (taking

*π*

^{−1}(

**g**

_{1}) and

*π*

^{−1}(

*J*

_{2}) as

**g**and

*J*, respectively). □

The following lemma now characterizes when an OBF for a query is locally optimal.

**Lemma 4**

*Let Q be a CQ and let f be an OBF for Q. The following are equivalent*:

- 1.
*f is locally optimal; and* - 2.
*for every instance I and fact***f***for which***f**∈f(I∪{**f**}),*there is an instance J and a fact***g***such that***f**∼_{Q}**g**,**g**∉I,**f**∉J, and**g**∉f(J∪{**g**}).

*Proof 4*

We can assume that *Q* contains at least two atoms. Indeed, when *Q* contains one atom, the only locally optimal OBF is the one that broadcasts no facts and the lemma trivially holds. The direction from (1) to (2) follows from Lemma 3.

(2) ⇒(1) Let *f* be an OBF for *Q*. Towards a contradiction assume that *f* is not locally optimal. That is, there exists another OBF *f*^{′} for *Q* such that \(f^{\prime } \subsetneq f\). In particular, there is an instance *I* and a fact **f** such that **f**∉*f*^{′}(*I*∪{**f**}), while **f**∈*f*(*I*∪{**f**}). By Lemma 1, for every fact **g** with **f**∼_{Q}**g** where **g**∉*I*, and for every instance *J*, where **f**∉*J*, it must be that **g**∈*f*^{′}(*J*∪{**g**}). The latter then implies that for every such **g** and *J*, **g**∈*f*(*J*∪{**g**}) which contradicts condition (2) of the present lemma. □

## 5 Broadcasting Functions Based on Dependency Sets

In this section, we introduce the notion of a broadcast dependency set (BDS) as a formalism to specify OBFs. We present necessary and sufficient conditions for when a BDS induces an OBF which is correct for a given query and also for when it is locally optimal. Furthermore, we study the complexity of the corresponding decision problems. Finally, we show that every locally optimal OBF can be represented by a BDS thereby obtaining that BDS is complete as a representation formalism for locally optimal OBFs.

### 5.1 Broadcast Dependency Sets

In a nutshell, a broadcast dependency set is a set of key-dependency set pairs, where each pair consists of an equality type (the key), and a set of dependencies (to be formalised later) associated to this key. Intuitively, a BDS gives rise to the following broadcasting function semantics: a fact is broadcast only if it satisfies one of the key equality-types, and at least one of the associated dependencies fails.

We proceed with the formal definition. Let *Q* be the CQ *A*_{0}←*A*_{1},…,*A*_{n}. We assume *Q* is full and does not contain self-joins. Therefore an atom *A*_{i} in *b**o**d**y*_{Q} is uniquely identified by its predicate *p**r**e**d*(*A*_{i}). For a predicate *R*, we denote by *a**t**o**m*(*R*) the unique atom *A*∈*b**o**d**y*_{Q} for which *p**r**e**d*(*A*)=*R*.

For a finite set of variables *X*, a *partial equality type over**X* is a pair of binary relations *φ*=(*E*_{φ},*I*_{φ}) representing equalities and inequalities among elements in *X*. Formally, we require that *E*_{φ}∪*I*_{φ}⊆*X*×*X*, *E*_{φ} is an equivalence relation, and *I*_{φ} is irreflexive and symmetric. We abuse notation and also use *φ* to denote the formula \(\bigwedge \{x = y \mid (x,y)\in E_{\varphi } \}\land \bigwedge \{x \ne y \mid (x,y)\in I_{\varphi } \}.\) We tacitly assume that partial equality types are always consistent. That is, we always assume that there is a tuple \(\bar a\) such that the formula \(\varphi (\bar a)\) evaluates to true. When for all (*x*,*y*)∈*X*×*X*, either (*x*,*y*)∈*E*_{φ} or (*x*,*y*)∈*I*_{φ}, then *φ* completely specifies all relations between variables in *X* and we call *φ* a *type*. For emphasis, we sometimes say *complete equality type* rather than just *equality type* even though equality type always means complete equality type.

A *partial atomic type* (over *Q*) is a pair *τ*=(*R*_{τ},*φ*_{τ}), where *R*_{τ} is a database predicate and *φ*_{τ} is a partial type over *V**a**r**s*(*a**t**o**m*(*R*_{τ})), that is, the variables occurring in the unique atom *A*∈*b**o**d**y*_{Q} for which *p**r**e**d*(*A*)=*R*_{τ}. By *V**a**r**s*(*τ*) we denote the variables over which *τ* is defined, i.e., *V**a**r**s*(*τ*)=*V**a**r**s*(*a**t**o**m*(*R*_{τ})). Sometimes we write *a**t**o**m*(*τ*) to abbreviate *a**t**o**m*(*R*_{τ}). We say that *τ* is an *atomic type* when *φ*_{τ} is an equality type. To improve readability, we denote partial atomic types with *τ* and (complete) atomic types with *ω*. We denote by *P**T**y**p**e**s*(*Q*) and *T**y**p**e**s*(*Q*) the set of all partial atomic types and atomic types over *Q*, respectively.

*Example 3*

*φ*

_{1},

*φ*

_{2},

*φ*

_{3}over variables

*X*={

*x*,

*y*,

*z*}:

Alternatively, we can express these equality types through conditions *φ*_{1}:=*x*≠*y*∧*y*≠*z*∧*x*=*z*, *φ*_{2}:=*x*=*y*∧*y*=*z*∧*x*=*z*, and *φ*_{3}:=*x*=*z*. Here, *φ*_{1} and *φ*_{2} are complete over *X*, and *φ*_{3} is a partial equality type over *X*.

Examples of atomic types over query *Q*(*x*,*y*,*z*)←*A*(*x*,*x*),*B*(*x*,*y*,*z*) are complete atomic types *ω*_{1}=(*B*,*φ*_{1}) and *ω*_{2}=(*B*,*φ*_{2}), and partial atomic type *τ*=(*B*,*φ*_{3}).

A fact **f***is of type**τ* or *satisfies**τ*, denoted **f**⊧*τ*, when there is a valuation *h* from the variables in *a**t**o**m*(*R*_{τ}) onto *A**d**o**m*(**f**) such that *h*(*a**t**o**m*(*R*_{τ}))=**f** and the formula *φ*_{τ} evaluates to true where each *x*_{i} is substituted by *h*(*x*_{i}). Notice that *h* is unique for **f**. Hereafter we will refer to *h* as *V*_{f}. By *t**y**p**e*(**f**), we denote the unique atomic type satisfied by **f** when it exists. As atomic types are defined w.r.t. *Q*, *t**y**p**e*(**f**) is not always defined. Indeed, when **f**=*R*(*a*,*b*) (with *a*≠*b*) and *a**t**o**m*(*R*)=*R*(*x*,*x*), then there is no *τ* with **f**⊧*τ*. Two partial atomic types *τ*,*τ*^{′} are *compatible w.r.t.**Q*, denoted *τ*∼_{Q}*τ*^{′}, when there are facts **f** and **g** with **f**⊧*τ* and **g**⊧*τ*^{′} such that **f**∼_{Q}**g**. We say that *τ**implies**τ*^{′}, denoted *τ*⊧*τ*^{′}, if for all facts **f**, **f**⊧*τ* implies **f**⊧*τ*^{′}. We can think of a partial atomic type as a disjunction of types for a shared predicate symbol. Define *T**y**p**e**s*(*τ*)={*ω*∈*T**y**p**e**s*(*Q*)∣*ω*⊧*τ*} as the set of all atomic types *ω* which imply *τ*. Notice that, *ω*⊧*τ* iff *ω*∈*T**y**p**e**s*(*τ*) for any atomic type *ω*. For a set of partial atomic types *T*, we use *T**y**p**e**s*(*T*) as an abbreviation for \(\bigcup _{\tau \in T}\mathit {Types}(\tau )\).

*Example 4*

For examples recall query *Q* and partial atomic types *ω*_{1},*ω*_{2},*τ* from Example 3. Fact *B*(*a*,*b*,*a*) satisfies *ω*_{1} and *τ*, but not *ω*_{2}. The former particularly holds because *ω*_{1}⊧*τ*.

Let *ω*_{3}=(*A*,*x*=*x*), then it is easy to see that *ω*_{3}∼_{Q}*ω*_{1} due to the satisfying facts *B*(1,2,1) and *A*(1,1), respectively, and valuation *V*:{*x*↦1,*y*↦2,*z*↦1} for *Q*.

For a set of variables *X* and *Y*, and a partial atomic type *τ*, *X*⊆_{τ}*Y* if for all *x*∈*X* either *x*∈*Y* or there is an *y*∈*Y* such that \((x,y)\in E_{\varphi _{\tau }}\). That is, *X* is a subset of *Y* when taking the equalities in \(E_{\varphi _{\tau }}\) into account. For instance, let *τ* be a type such that \((y,z)\in E_{\varphi _{\tau }}\), then {*x*,*y*,*z*}⊆_{τ}{*x*,*y*}.

For a set of pairs \(\mathcal {S}\), we define \(\mathit {Keys}(\mathcal {S})=\{a\mid (a,b)\in \mathcal {S}\}\) and \(\mathit {Values}(\mathcal {S})=\{b\mid (a,b)\in \mathcal {S}\}\).

**Definition 5**

*broadcast dependency set (BDS)*for a CQ

*Q*is a set \(\mathcal {S}\) of pairs (

*τ*,

*T*), where

*τ*∈

*P*

*T*

*y*

*p*

*e*

*s*(

*Q*) is a

*key*, and

*T*∈2

^{PTypes(Q)}is a

*dependency set*, such that the following holds:

- 1.
\((\tau ,T)\in \mathcal {S}\) and \((\tau ,T^{\prime })\in \mathcal {S}\) implies

*T*=*T*^{′}; - 2.
\(\tau ,\tau ^{\prime }\in \mathit {Keys}(\mathcal {S})\) implies

*T**y**p**e**s*(*τ*)∩*T**y**p**e**s*(*τ*^{′})=*∅*; and, - 3.
\((\tau , T) \in \mathcal {S}\) implies \(\mathit {Vars}(\tau ^{\prime }) \subseteq _{\tau ^{\prime }} Vars(\tau )\) for every

*τ*^{′}∈*T*.

*dependencies*.

The above definition states that (1) every key can have at most one value in \(\mathcal {S}\); (2) every complete type implies at most one partial type \(\tau \in \mathit {Keys}(\mathcal {S})\); and, (3) the set of variables of *a**t**o**m*(*τ*^{′}) is included in the set of variables of *a**t**o**m*(*τ*) taking into account the equalities in \(E_{\tau ^{\prime }}\). We first explain informally how a BDS represents an OBF. Let **f** be a fact in the local instance at a computing node. When *t**y**p**e*(**f**) is undefined, then **f** is static as **f** can never participate in any satisfying valuation. For instance this happens when **f**=*R*(*a*,*b*) with *a*≠*b* and *Q* contains the atom *R*(*x*,*x*). Every pair \((\tau ,T)\in \mathcal {S}\) now specifies a condition on facts: when **f**⊧*τ* then **f** is broadcast only if a set of facts implied by *T* (to be formalized below) is not present at the local instance. Furthermore, when there is no \(\tau \in \mathit {Keys}(\mathcal {S})\) for which **f**⊧*τ*, **f** is broadcast as well. In this light, conditions (1) and (2) ensure that every local fact **f** is matched by at most one partial type \(\tau \in \mathit {Keys}(\mathcal {S})\); and, condition (3) ensures that when **f**⊧*τ* then *V*_{f} can be extended in a unique way to a valuation for every *τ*^{′}∈*T* that is consistent with **f**, that is, for which *t**y**p**e*(**f**)∼_{Q}*τ*^{′}.

**f**, if there is no \(\tau \in \mathit {Keys}(\mathcal {S})\) for which

**f**⊧

*τ*then

**f**is always broadcast. Otherwise, by condition (1) and (2) of Definition 5, there is exactly one \(\tau \in \mathit {Keys}(\mathcal {S})\) such that

**f**⊧

*τ*. Recall that

*V*

_{f}is the valuation (defined above) such that

*V*

_{f}(

*a*

*t*

*o*

*m*(

*τ*))=

**f**. Then, by condition (3) of Definition 5,

*V*

_{f}can also be interpreted as a valuation for every

*a*

*t*

*o*

*m*(

*τ*

^{′}) for every

*τ*

^{′}∈

*T*for which

*t*

*y*

*p*

*e*(

**f**)∼

_{Q}

*τ*

^{′}. Indeed, for every

*y*∈

*V*

*a*

*r*

*s*(

*τ*

^{′})∖

*V*

*a*

*r*

*s*(

*τ*) there is a variable

*x*∈

*V*

*a*

*r*

*s*(

*τ*) for which \((x,y)\in E_{\tau ^{\prime }}\). Therefore, define for every

*y*∈

*V*

*a*

*r*

*s*(

*τ*

^{′}),

*t*

*y*

*p*

*e*(

**f**)∼

_{Q}

*τ*

^{′}, the above is well-defined.

Now, **f** is broadcast when the local instance does not contain all the facts \(V_{\textbf {f},\tau ^{\prime }}({atom}(\tau ^{\prime }))\) for which *τ*^{′}∈*T* and *t**y**p**e*(**f**)∼_{Q}*τ*^{′}. We refer to these facts as the *dependency fact set*. To formally define \(f_{\mathcal {S}}\), we set \(Dep(\mathbf {f}T)=\{V_{\mathbf {f},\tau ^{\prime }}(\mathit {atom}(\tau ^{\prime })) \mid \tau ^{\prime } \in T\) and \({type}(\mathbf {f}) \sim _{Q} \tau ^{\prime } \}.\) Notice that *T*≠*∅* does not necessarily imply *D**e**p*(**f***T*)≠*∅*, because \({type}(\mathbf {f}) \sim _{Q} \tau ^{\prime }\) may fail for *τ*^{′}∈*T*. Further notice that *D**e**p*(**f***T*)=*∅* means that the fact **f** is static. Then, define \(Dep(\mathbf {f},\mathcal {S})\) as *D**e**p*(**f***T*) when there is a \((\tau ,T)\in \mathcal {S}\) for which **f**⊧*τ*. Otherwise, \(Dep(\mathbf {f},\mathcal {S})\) is undefined.

*Example 5*

*Q*

_{2}. To illustrate how OBF \(f_{\mathcal {S}}\) works, let

*A*(1,1,2),

*A*(1,2,2),

*C*(3,3) do not match a key in \(\mathcal {S}\) and their type occurs in

*T*

*y*

*p*

*e*

*s*(

*Q*). So they are broadcast. The fact

*C*(3,4) is not broadcast as its type does not occur in

*T*

*y*

*p*

*e*

*s*(

*Q*) (

*C*(3,4) does not match

*C*(

*z*,

*z*)). The fact

**f**

_{1}=

*B*(1,1,2) matches

*τ*

_{B}and \(\mathit {Dep}(\mathbf {f}_{1},\{ \tau _{A}^{x=y}, \tau _{A}^{y=z}\})=\{A(1,1,2)\}\subseteq I\). Therefore,

*B*(1,1,2) is static. Similarly, the fact

**f**

_{2}=

*B*(1,2,2) matches

*τ*

_{B}and \(Dep(f_{2}\{\tau _{A}^{x=y}, \tau _{A}^{y=z}\})=\{A(1,2,2)\}\subseteq I\). Therefore,

*B*(1,2,2) is static as well. The fact

**f**

_{3}=

*A*(1,2,3) is static as it matches \(\tau _{A}^{x\ne y}\) and \(Dep(\mathbf {f}_{3}\{\tau _{b}^{\neq }\})=\{B(1,2,3)\} \subseteq I\). The fact

**f**

_{4}=

*B*(1,2,3) is static as it matches

*τ*

_{B}and \(Dep(\mathbf {f}_{4}\{\tau _{A}^{x=y}, \tau _{A}^{y=z}\})= \emptyset .\)

**Definition 6**

For a CQ *Q* and a BDS \(\mathcal {S}\) for *Q*, define \(f_{\mathcal {S}}\) as the function that maps every instance *J* to the set \(f_{\mathcal {S}}(J)\) of those facts **f**∈*J* for which (1) *t**y**p**e*(**f**)∈*T**y**p**e**s*(*Q*); and, (2) \(Dep(\mathbf {f},\mathcal {S})\) is undefined or \(Dep(\mathbf {f},\mathcal {S})\not \subseteq J\).

Intuitively, **f** is static only when *t**y**p**e*(**f**)∉*T**y**p**e**s*(*Q*) (**f** can not participate in any satisfying valuation) or the dependency fact set \(Dep(\mathbf {f},\mathcal {S})\) is present at the local instance. Notice that a fact **f** is thus broadcast when it does not imply a key in \(\mathcal {S}\). This is because then \(Dep(\mathbf {f},\mathcal {S})\) is undefined.

*Example 6*

- (1)
For a simple example of a BDS \(\mathcal {S}\) and OBF \(f_{\mathcal {S}}\), recall query

*Q*_{1}from Example 1, being*Q*_{1}(*x*,*y*,*z*)←*A*(*x*,*y*),*B*(*y*,*x*),*C*(*x*,*z*). Let*φ*=(*∅*,*∅*), that is,*φ*imposes no restrictions. Let*τ*_{A}=(*A*,*φ*) and*τ*_{B}=(*B*,*φ*). Then, \(\mathcal {S} = \{(\tau _{B}, \{\tau _{A}\}), (\tau _{A}, \emptyset )\}\) is a BDS for*Q*_{1}. Indeed, every partial atomic type occurs at most once as a key. There is no (complete) atomic type that implies both*τ*_{A}and*τ*_{B}. Furthermore, the variable containment condition between*τ*_{A}and*τ*_{B}is satisfied. Notice that \(f_{\mathcal {S}}\) simulates exactly the broadcast dependency function which is described in Example 1. - (2)
For an example where condition (3) of Definition 5 does not reduce to ordinary variable containment, consider again query

*Q*_{1}from Example 1. Let*τ*_{C}=(*C*,*x*=*z*), and*τ*_{A}=(*A*,true). Then, \(\mathcal {S} = \{(\tau _{A}, \{\tau _{C}\}), (\tau _{C}, \emptyset )\}\) is a BDS for*Q*_{1}. Notice that condition*V**a**r**s*(*C*)⫅̸*V**a**r**s*(*A*) but \( Vars(\tau _{C})\subseteq _{\tau _{C}} Vars(\tau _{A})\). - (3)Our final example shows that dependencies can be circular. LetLet$${Q_{3}(x,y,z)}\leftarrow{A(x,y), B(y,z), C(z,x)}. $$
*τ*_{A}=(*A*,*x*=*y*),*τ*_{B}=(*B*,*x*=*y*), and*τ*_{C}=(*C*,*x*=*y*). Then, \(\mathcal {S} = \{(\tau _{A}, \{\tau _{B}\}), (\tau _{B}, \{\tau _{C}\}), (\tau _{C}, \{\tau _{A}\})\}\) is an OBF for*Q*_{1}. Though correctness of \(\mathcal {S}\) for*Q*follows from Lemma 5, we provide some intuition. Let*I*={*A*(1,1),*B*(1,1),*C*(1,1)} be a database instance. Consider a network containg the nodes*c*_{1},*c*_{2}, and*c*_{3}. When*I*(*c*_{1})={*A*(1,1)},*I*(*c*_{2})={*B*(1,1)}, and*I*(*c*_{3})={*C*(1,1)}, then all three facts will be broadcast. Now, assume one of the nodes contains two of the facts in*I*, w.l.o.g., say*I*(*c*_{1})={*A*(1,1),*B*(1,1)}. Then, exactly one of the facts in*I*(*c*_{1}) is broadcast; i.e.,*B*(1,1). Now, suppose that*C*(1,1) is mapped on some node, say*c*_{2}, but that*C*(1,1) is not broadcast. Then it must be that*A*(1,1) is mapped on*c*_{2}as well. So, broadcasting*B*(1,1) indeed suffices to guarantee correctness.

Note that not every BDS for *Q* induces an OBF which is correct for *Q*. Indeed, the following lemma provides equivalent semantic and syntactic conditions for an OBF \(f_{\mathcal {S}}\) to be correct for a query.

**Lemma 5**

*Let Q be a CQ and let*\(\mathcal {S}\)

*be a BDS for Q. Then the following are equivalent*:

- 1.
\(f_{\mathcal {S}}\)

*is an OBF for Q*; - 2.
*there are no instances I,J, and facts***f**,**g**, with**f**∼_{Q}**g**,**g**∉I,**f***∉J such that*\(\mathbf {f} \not \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\)*and*\(\mathbf {g} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {g}\})\);*and* - 3.
*there are no (complete) atomic types*ω_{1}*,ω*_{2}*, and pairs*\((\tau _{1}, T_{1}),(\tau _{2}, T_{2}) \in \mathcal {S}\)*, with ω*_{1}*∼*_{Q}*ω*_{2}*, ω*_{1}*⊧τ*_{1}*, ω*_{2}*⊧τ*_{2}*such that ω*_{1}*∉Types(T*_{2}*) and ω*_{2}*∉Types(T*_{1}).

*Proof 5*

(1) ⇔(2) Because \(f_{\mathcal {S}}\) is an OBF, the equivalence follows immediately from Lemma 1.

(2) ⇒(3) The proof is by contraposition. So, assume that there are two (complete) atomic types *ω*_{1},*ω*_{2}, and pairs \((\tau _{1}, T_{1}),(\tau _{2}, T_{2}) \in \mathcal {S}\), with *ω*_{1}∼_{Q}*ω*_{2}, *ω*_{1}∈*T**y**p**e**s*(*τ*_{1}), *ω*_{2}∈*T**y**p**e**s*(*τ*_{2}) such that *ω*_{1}∉*T**y**p**e**s*(*T*_{2}) and *ω*_{2}∉*T**y**p**e**s*(*T*_{1}). Now, because *ω*_{1}∼_{Q}*ω*_{2}, there are facts **f** and **g**, with **f**∼_{Q}**g**, *t**y**p**e*(**f**)=*ω*_{1}, and *t**y**p**e*(**g**)=*ω*_{2}. Define \(I = Dep(\mathbf {f},\mathcal {S})\) and \(J = Dep(\mathbf {g}\mathcal {S})\). Observe that by definition of *D**e**p*, *ω*_{1}∉*T**y**p**e**s*(*T*_{2}) implies \(\mathbf {f} \not \in Dep(\mathbf {g}\mathcal {S})\) and *ω*_{2}∉*T**y**p**e**s*(*T*_{1}) implies \(\mathbf {g} \not \in Dep(\mathbf {f},\mathcal {S})\). Hence, **f**∉*J* and **g**∉*I*. Moreover, by definition of \(f_{\mathcal {S}}\), it is always the case that \(\mathbf {f} \not \in f_{\mathcal {S}}(Dep(\mathbf {f},\mathcal {S})\cup \{\mathbf {f}\})\) and \(\mathbf {g} \not \in f_{\mathcal {S}}(Dep(\mathbf {g}\mathcal {S})\cup \{\mathbf {g}\})\). Therefore, \(\mathbf {f} \not \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\) and \(\mathbf {g} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {g}\})\), which contradicts condition (2).

(3) ⇒(2) Again, the proof is by contraposition. So, assume that there is an instance *I* and *J* and facts **f** and **g** where **f**∼_{Q}**g**, **g**∉*I* and **f**∉*J*, but \(\mathbf {f}\not \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\) and \(\mathbf {g} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {g}\})\). As **f**∼_{Q}**g**, we have *ω*_{1}∼_{Q}*ω*_{2} for *ω*_{1}=*t**y**p**e*(**f**) and *ω*_{2}=*t**y**p**e*(**g**). Then, by construction of \(f_{\mathcal {S}}\) there are \((\tau _{1}, T_{1}), (\tau _{2}, T_{2}) \in \mathcal {S}\) with *t**y**p**e*(**f**)∈*T**y**p**e**s*(*τ*_{1}) and *t**y**p**e*(**g**)∈*T**y**p**e**s*(*τ*_{2}). Now, \(\mathbf {f}\not \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\) and \(\mathbf {g} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {g}\})\) implies \(Dep(\mathbf {f},\mathcal {S}) \subseteq I\) and \( Dep(\mathbf {g}\mathcal {S}) \subseteq J\). If we assume that *t**y**p**e*(**g**)∈*T**y**p**e**s*(*T*_{1}) then \(\mathbf {g}\in Dep(\mathbf {f},\mathcal {S})\) (as **g**=*V*_{f,type(g)}(*a**t**o**m*(*t**y**p**e*(**f**)))), and therefore **g**∈*I* which leads to a contradiction. Hence, *t**y**p**e*(**g**)∉*T**y**p**e**s*(*T*_{1}). A similar argument shows that *t**y**p**e*(**f**)∉*T**y**p**e**s*(*T*_{2}). So, we have found *ω*_{1}, *ω*_{2}, (*τ*_{1},*T*_{2}), and (*τ*_{2},*T*_{2}) contradicting condition (3). □

Notice that the OBFs of Example 6 are all correct for the given query.

Two partial atomic types *τ*_{1},*τ*_{2} are said to be *equal*, denoted *τ*_{1}=*τ*_{2}, when *T**y**p**e**s*(*τ*_{1})=*T**y**p**e**s*(*τ*_{2}). We say that a BDS \(\mathcal {S}\) is *harmonious* when every two partial types in \(\mathcal {S}\) are either disjoint or equal. That is, when for every two partial atomic types \(\tau _{1}, \tau _{2} \in \mathit {Keys}(\mathcal {S}) \cup \{\tau ^{\prime } \in T \mid T \in \mathit {Values}(\mathcal {S})\}\), either *τ*_{1}=*τ*_{2} or *T**y**p**e**s*(*τ*_{1})∩*T**y**p**e**s*(*τ*_{2})=*∅*.

**Theorem 1**

*Let Q be a CQ and let*\(\mathcal {S}\)*be a BDS for Q. Deciding whether*\(f_{\mathcal {S}}\)*is correct for Q is*conp-complete and in ptime*when*\(\mathcal {S}\)*is harmonious.*

*Proof 6*

(conp-completeness) When \(f_{\mathcal {S}}\) is not an OBF for *Q*, Lemma 5 guarantees there exists a polynomial-size certificate, consisting of two compatible (complete) atomic types *ω*_{1},*ω*_{2}, two partial atomic types *τ*_{1},*τ*_{2}, and two sets *T*_{1},*T*_{2}, witnessing \(f_{\mathcal {S}}\) to be not an OBF for *Q*, where \((\tau _{1}, T_{1}), (\tau _{2}, T_{2}) \in \mathcal {S}\), *ω*_{1}∈*T**y**p**e**s*(*τ*_{1}), *ω*_{2}∈*T**y**p**e**s*(*τ*_{2}), *ω*_{1}∉*T**y**p**e**s*(*T*_{2}), and *ω*_{2}∉*T**y**p**e**s*(*T*_{1}). As the foregoing test can be done in polynomiale time, deciding whether \(f_{\mathcal {S}}\) is correct for *Q* is in conp. Particularly notice that *τ*^{′}⊧*τ* is polynomial time verifiable, for arbitrary (partial) atomic types *τ*^{′},*τ*, by taking the union of conditions implied by *τ*^{′} and *τ*, computing the closure over variable equalities, and then checking for explicit contradictions.

For the hardness proof, we rely on a reduction from the well-known np-complete problem colorability, which asks, given a graph *G*, whether there is a color assignment for the nodes in *G* such that only three colors are used and no two adjacent nodes are assigned the same color.

Let *G*=(*N*_{G},*E*_{G}) be an input for the problem, and *m*=|*N*_{G}|.

In what follows we will represent *G* by a partial-atomic type *τ*_{P}, which takes a variable for each node in the graph and an inequality between every pair of variables corresponding to adjacent nodes in the graph. Particularly observe that every (valid) coloring for *G* yields a (complete) atomic type implying *τ*_{P}, and vice versa, every atomic type implying *τ*_{P} implies a valid coloring for *G*.

*σ*={

*P*

^{(m)},

*A*

^{(m)}} and conjunctive query

*Q*,

*σ*. Let

*β*be a bijection from the nodes in

*N*

_{G}onto the set of variables {

*x*

_{1},…,

*x*

_{m}}. Then,

*τ*

_{P}takes the form (

*P*,(

*E*,

*I*)) for

*Q*, where

*E*=

*∅*, and

*I*={(

*β*(

*n*

_{1}),

*β*(

*n*

_{2}))∣(

*n*

_{1},

*n*

_{2})∈

*E*

_{G}}.

Now consider partial type *τ*_{A}=(*A*,(*∅*,*∅*)), and for every *i*,*j*,*k*,*l*, where 1≤*i*<*j*<*k*<*l*≤*m*, partial type *τ*_{i,j,k,l}=(*P*,(*E*,*I*)), where *E*=*∅*, and *I*={(*x*_{s},*x*_{t})∣*s*,*t*∈{*i*,*j*,*k*,*l*},*s*≠*t*}. Intuitively, *τ*_{i,j,k,l} represents all color assignments where four specified nodes (those related to *x*_{i},*x*_{j},*x*_{k},*x*_{l}) are assigned distinct colors. Let *T*={*τ*_{i,j,k,l}∣1≤*i*<*j*<*k*<*l*≤*m*}. Notice that \(|T| \in \mathcal {O}(m^{4})\), and that these types can be constructed one by one by simply enumeration the possible values for *i*,*j*,*k*, and *l*. Now, let \(\mathcal {S} = \{(\tau _{P}, \emptyset ), (\tau _{A}, T)\}\).

We claim that \(\mathcal {S}\) is a BDS for *Q*. Indeed, every pair in \(\mathcal {S}\) is a (consistent) partial atomic-type for *Q*, every (complete) atomic type in \(\mathcal {S}\) implies at most one of the partial atomic types in \(\mathit {Keys}(\mathcal {S})\), and \( Vars(atom(\tau _{i,j,k,l})) \subseteq _{\tau _{i,j,k,l}} Vars(atom(\tau _{A}))\) for all *i*,*j*,*k*,*l*.

To show that the reduction works, we need to argue that for every graph *G* there is a mapping assigning to every node one out of three colors in such a way that every adjacent node is labeled a different color, if and only if, \(f_{\mathcal {S}}\) is not an OBF for *Q*.

(⇒) Let *α* be an assignment mapping the nodes in *G* onto a set of colors {*a*,*b*,*c*}, such that the above mentioned conditions are satisfied. Notice that there is a (complete) atomic type encoding exacty this solution, namely, atomic type *ω*=(*P*,(*E*,*I*)), where *E*={(*x*_{i},*x*_{j})∣*i*,*j*∈[*m*],*α*(*β*^{−1}(*x*_{i}))=*α*(*β*^{−1}(*x*_{j}))}, and *I*=*X*×*X*∖*E*. In particular, *ω* implies *τ*_{P}, *ω* does not imply any of the partial types in *T*, and *ω* is compatible with *τ*_{A}. Then, indeed, by Lemma 5 it immediately follows that \(f_{\mathcal {S}}\) is not an OBF for *Q*.

(⇐) If no such assignment exists, we have to show that \(f_{\mathcal {S}}\) is an OBF for *Q*. For this, we make use of the fact that every (complete) atomic type *ω*, where *ω*⊧*τ*_{P}, encodes a color assignment for *G*. Because there is no three-color assignment, it must be that all of these assigments use at least four different colors. In particular, then every *ω* has at least four variables that are pairwise unequal, say *x*_{i},*x*_{j},*x*_{k},*x*_{l}, where 1≤*i*<*j*<*k*<*l*≤*m*. Thus, *ω* implies *τ*_{i,j,k,l}. Therefore, condition (3) of Lemma 5 is satisfied, implying \(f_{\mathcal {S}}\) to be an OBF for *Q*.

there are no partial types

*τ*_{1},*τ*_{2}and pairs \((\tau _{1}, T_{1}), (\tau _{2}, T_{2}) \in \mathcal {S}\) with*τ*_{1}∼_{Q}*τ*_{2}such that none of the types in*T*_{2}equals*τ*_{1}and none of the types in*T*_{1}equals*τ*_{2}

To verify whether \(f_{\mathcal {S}}\) is correct for harmonious BDS \(\mathcal {S}\), we thus have to verify condition (‡). For this, consider every pair of compatible partial atomic types \(\tau _{1},\tau _{2} \in \mathit {Keys}(\mathcal {S})\). Compatibility is polynomial time verifiable by taking the union of the conditions in both types and verifying if the resulting partial type is still consistent. Then, for every \(\tau _{1}^{\prime } \in T_{1}\) verify if \(\tau _{2} = \tau _{1}^{\prime }\), and for every \(\tau _{2}^{\prime } \in T_{2}\) if \(\tau ^{\prime }_{2} = \tau _{1}\). If none of these tests succeed, then *τ*_{1},*τ*_{2},*T*_{1},*T*_{2} form a proof that condition (‡) fails. Notice that equality of partial types can be checked in polynomial time in the size of |*Q*| by making both the implicit and explicit conditions of the type visible (by means of *E* and *I*) and by comparing these conditions. Eventually, if no proof against (‡) is found, (‡) satisfies and thus \(f_{\mathcal {S}}\) is an OBF for *Q*. □

### 5.2 Local Optimality

Next, we turn to locally optimal OBFs. The following lemma provides equivalent semantic and syntactic conditions for an OBF to be locally optimal. Regarding condition (3), the intuition is as follows. While condition (3c) is the syntactic counterpart of condition (2), conditions (3a) and (3b) specify optimality requirements which are inherent to the formalism of BDS. More specifically, condition (3a) specifies that every atomic type implying a partial type in a dependency set in \(\mathcal {S}\) must also imply a key in \(\mathcal {S}\). Indeed, when an atomic type does not imply a key, every local fact of this type is always broadcast and therefore present at every computing node. The atomic type can therefore be removed from every dependency set it occurs in. When Condition (3b) fails for an atomic type *ω*, \(\mathcal {S}\) can be adapted to broadcast less while preserving correctness for *Q* by adding the pair \((\omega , \{\tau \mid \tau \sim _{Q} \omega , \tau \in Types(\mathit {Keys}(\mathcal {S}))\})\).

**Lemma 6**

*Let Q be a CQ,*\(\mathcal {S}\)

*a BDS for Q, and*\(f_{\mathcal {S}}\)

*an OBF for Q. The following are equivalent*:

- 1.
\(f_{\mathcal {S}}\)

*is locally optimal*; - 2.
*for every instance I and fact***f***for which*\(\textbf {f} \in f_{\mathcal {S}}(I\cup \{\textbf {f}\})\)*, there is an instance J and a fact***g***such that***f**∼_{Q}**g**,**g**∉I,**f***∉J, and*\(\textbf {g} \not \in f_{\mathcal {S}}(J\cup \{\textbf {g}\})\)*; and*, - 3.\(\mathcal {S}\)
*satisfies the following conditions*:- (a)
*for*\((\tau , T) \in \mathcal {S}\)*and ω∈Types(T), ω∼*_{Q}*τ implies ω⊧τ*^{′}*for some*\(\tau ^{\prime } \in \mathit {Keys}(\mathcal {S})\); - (b)
*for every*\(\omega \in Types(Q)\setminus Types(\mathit {Keys}(\mathcal {S}))\)*, there is a partial atomic type*\(\tau _{1}\in \mathit {Keys}(\mathcal {S})\)*and a ω*_{1}*∈Types(τ*_{1}*) such that ω∼*_{Q}*ω*_{1}*and*\( Vars(\omega _{1}) \not \subseteq _{\omega _{1}} Vars(\omega )\)*; and* - (c)
*for*\((\tau _{1}, T_{1}), (\tau _{2}, T_{2}) \in \mathcal {S}\)*, where ω*_{1}*∈Types(τ*_{1}*), ω*_{2}*∈Types(τ*_{2}*), and ω*_{1}*∼*_{Q}*ω*_{2}*: ω*_{1}*∈Types(T*_{2}*) implies ω*_{2}*∉Types(T*_{1}).

- (a)

*Proof 7*

The equivalence between (1) and (2) follows from Lemma 4.

We show that (2) implies all three conditions in (3) separately.

(2) ⇒(3a) Let \((\tau , T) \in \mathcal {S}\) and *ω*∈*T**y**p**e**s*(*T*). Choose **f** with *t**y**p**e*(**f**)∈*T**y**p**e**s*(*τ*) and set *I*=*D**e**p*(**f***T*), **g**=*V*_{f,ω}(*a**t**o**m*(*ω*)), so that **f** and **g** witness *τ*∼_{Q}*ω*. Further, let *I*^{′}=*I*∖{**g**}. By definition of \(f_{\mathcal {S}}\), \(\mathbf {f} \not \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\), and \(\mathbf {f} \in f_{\mathcal {S}}(I^{\prime }\cup \{\mathbf {f}\})\). By condition (2), the latter implies that there is an instance *J* and a fact **h**, such that **h**∉*I*^{′}, **h**∼_{Q}**f**, **f**∉*J*, and \(\mathbf {h} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {h}\})\). Therefore, there must be a pair \(\tau ^{\prime } \in \mathit {Keys}(\mathcal {S})\) with *t**y**p**e*(**h**)∈*T**y**p**e**s*(*τ*^{′}). However, as \(f_{\mathcal {S}}\) is an OBF for *Q*, Lemma 1 implies that **h**∈*I*. So, it must be that **h**=**g**. Hence, *t**y**p**e*(**g**)=*ω*∈*T**y**p**e**s*(*τ*^{′}).

(2) ⇒(3b) Let \(\omega \in Types(Q)\setminus Types(\mathit {Keys}(\mathcal {S}))\) and let **f** be a fact of type *ω*. By definition of \(f_{\mathcal {S}}\), \(\mathbf {f}\in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\) for every instance *I*. Let *I*_{1} be such an instance. By condition (2) there is a compatible fact **g**_{1} and instance *J*_{1}, where **g**_{1}∉*I*_{1}, **f**∉*J*_{1}, and \(\mathbf {g}_{1} \not \in f_{\mathcal {S}}(J_{1}\cup \{\mathbf {g}_{1}\})\). Now, consider *I*_{i}=*I*_{i−1}∪{**g**_{i−1}}, for *i*≥2. Then, **f**∈*f*(*I*_{i}∪{**f**}) for *i*≥2. Again, by condition (2) there is a fact **g**_{i} and instance *J*_{i}, where **g**_{i}∼_{Q}**f**, **g**_{i}∉*I*_{i}, **f**∉*J*_{i}, and \(\mathbf {g}_{i} \not \in f_{\mathcal {S}}(J_{i}\cup \{\mathbf {g}_{i}\})\) for *i*≥2. In particular, **g**_{i}∉{**g**_{1},…,**g**_{i−1}}. As there are infinitely many such **g**_{i}, but only finitely many atomic types in *T**y**p**e**s*(*Q*), there is a type *ω*_{1} such that *t**y**p**e*(**g**_{i})=*ω*_{1} for infinitely many *i*. Let *G*={**g**_{i}∣*i*≥1,*t**y**p**e*(**g**_{i})=*ω*_{1}}. As \(\mathbf {g} \not \in f_{\mathcal {S}}(J_{i}\cup \{\mathbf {g}\})\) for every **g**∈*G*, by definition of \(f_{\mathcal {S}}\), there is a \(\tau _{1}\in Keys(\mathcal {S})\) with *ω*_{1}∈*T**y**p**e**s*(*τ*_{1}). Notice that *ω*∼_{Q}*ω*_{1} as **f**∼_{Q}**g** for all **g**∈*G*. Towards a contradiction, assume \( Vars(\omega _{1}) \subseteq _{\omega _{1}} Vars(\omega )\). But then, *A**d**o**m*(**g**)⊆*A**d**o**m*(**f**) for every **g**∈*G* which can not be as the size of *G* is infinite. Therefore, \( Vars(\omega _{1}) \not \subseteq _{\omega _{1}} Vars(\omega )\).

(2) ⇒(3c) Let \((\tau _{1}, T_{1}), (\tau _{2}, T_{2}) \in \mathcal {S}\), with *ω*_{1}∈*T**y**p**e**s*(*τ*_{1}),*ω*_{2}∈*T**y**p**e**s*(*τ*_{2}), *ω*_{1}∼_{Q}*ω*_{2}, and *ω*_{1}∈*T**y**p**e**s*(*T*_{2}). As *ω*_{1}∼_{Q}*ω*_{2} there are facts **g** and **f**, with **g**=*ω*_{1}, **f**=*ω*_{2}, and **g**∼_{Q}**f**. Then, **g**∈*D**e**p*(**f**,*T*_{2}) as *ω*_{1}∈*T**y**p**e**s*(*T*_{2}). Towards a contradiction, assume *ω*_{2}∈*T**y**p**e**s*(*T*_{1}) which implies **f**∈*D**e**p*(**g**,*T*_{1}). Let *I*=*D**e**p*(**g**,*T*_{1}) and *I*^{′}=*I*∖{**f**}. Then, \(\mathbf {g} \not \in f_{\mathcal {S}}(I\cup \{\mathbf {g}\})\) and \(\mathbf {g} \in f_{\mathcal {S}}(I^{\prime }\cup \{\mathbf {g}\})\). By condition (2), there is a fact **h** and instance *J*, where **h**∉*I*^{′}, \(\mathbf {h} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {h}\})\), and **g**∉*J*. By Lemma 1, however, it must be that **h**∈*I*. So, **h**=**f**, which implies that \(\mathbf {f} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {f}\})\). But then, by definition of \(f_{\mathcal {S}}\), \(Dep(\mathbf {f},\mathcal {S}) \subseteq J\) and thus \(\mathbf {g} \not \in Dep(\mathbf {f},\mathcal {S})\). Which is a contradiction.

(3) ⇒(2) Let **f** be a fact and *I* an instance, with \(\mathbf {f} \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\). This means that **f** is broadcast. We make a distinction between two cases: (1) the type of **f** is in \(\mathcal {S}\) but not all the necessary facts in \(Dep(\mathbf {f},\mathcal {S})\) are present, and (2) the type of **f** is not in \(\mathcal {S}\).

- Case 1:
Suppose there is a pair \((\tau , T) \in \mathcal {S}\) with

*t**y**p**e*(**f**)∈*T**y**p**e**s*(*τ*). Then, by definition of \(f_{\mathcal {S}}\) it must be that \(Dep(\mathbf {f},\mathcal {S}) \not \subseteq I\). In particular, there is a fact \(\mathbf {g} \in Dep(\mathbf {f},\mathcal {S}) \setminus I\), where*t**y**p**e*(**g**)∈*T**y**p**e**s*(*T*). Notice that**f**∼_{Q}**g**because of the definition of \(Dep(\mathbf {f},\mathcal {S})\) and \(\mathcal {S}\). By condition (3a) there is a pair \((\tau _{2}, T_{2}) \in \mathcal {S}\) such that*t**y**p**e*(**g**)∈*T**y**p**e**s*(*τ*_{2}). Because*t**y**p**e*(**g**)∈*T**y**p**e**s*(*T*), and*τ*_{2}∼_{Q}*τ*(by**g**∼_{Q}**f**), condition (3c) implies that*t**y**p**e*(**f**)∉*T**y**p**e**s*(*T*_{2}). Now, let \(J = Dep(\mathbf {g}\mathcal {S})\). Then,**f**∉*J*and \(\mathbf {g} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {g}\})\). So, facts**f**,**g**and instances*I*and*J*are as required by condition (2).- Case 2:
Suppose that \(\mathit {type}(\mathbf {f})\not \in Types(\mathit {Keys}(\mathcal {S}))\). Then, condition (3b) implies that there is a pair \((\tau _{1}, T_{1}) \in \mathcal {S}\) and atomic type

*ω*_{1}∈*T**y**p**e**s*(*τ*_{1}), where*ω*_{1}∼_{Q}*t**y**p**e*(**f**) and \( Vars(\omega _{1}) \not \subseteq _{\omega _{1}} Vars(\mathit {type}(\mathbf {f}))\). As*ω*_{1}∼_{Q}*t**y**p**e*(**f**), there is a fact**g**^{′}such that**g**^{′}∼_{Q}**f**and*t**y**p**e*(**g**^{′})=*ω*_{1}. Because, \( Vars(\omega _{1}) \not \subseteq _{\omega _{1}} Vars(\mathit {type}(\mathbf {f}))\), there must be a variable, say*z*, in*V**a**r**s*(*ω*_{1}) that does not equal any of the variables in*V**a**r**s*(*t**y**p**e*(**f**)) according to the conditions in atomic type*ω*_{1}. That is, for no variable*x*∈*V**a**r**s*(*t**y**p**e*(**f**)),*ω*_{1}⊧*x*=*z*. Define*Z*={*y*∣*ω*_{1}⊧*y*=*z*} as the set of variables equal to*z*according to*ω*_{1}. Let for every*u*∈**dom**∖(*A**d**o**m*(**f**)∪*A**d**o**m*(**g**^{′})),*V*_{u}be the mapping where*V*_{u}(*x*)=*V*_{f}(*x*) for every*x*∈*V**a**r**s*(*a**t**o**m*(**f**)), \(V_{u}(x) = V_{\mathbf {g}^{\prime }}(x)\) for every*x*∈*V**a**r**s*(*a**t**o**m*(**g**^{′}))∖*Z*, and*V*_{u}(*x*)=*u*for every*x*∈*Z*. Notice that the above is well defined, because compatibility between**f**and**g**ensures that \(V_{\mathbf {g}^{\prime }}(x) = V_{\mathbf {f}}(x)\) for every shared variable. Now, every*V*_{u}induces a fact**g**_{u}=*V*_{u}(*a**t**o**m**ω*_{1}) which has atomic type*ω*_{1}and is compatible with**f**. Further, \(\mathbf {g}_{u} \ne \mathbf {g}_{u^{\prime }}\) for distinct*u*and*u*^{′}. By the presence of (*τ*_{1},*T*_{1}) in \(\mathcal {S}\), and the definition of \(f_{\mathcal {S}}\), \(\mathbf {g}_{u} \not \in f_{\mathcal {S}}(Dep(\mathbf {g}_{u},T_{1})\cup \{\mathbf {g}_{u}\})\). In particular, condition (3a) implies*t**y**p**e*(**f**)∉*T**y**p**e**s*(*T*_{1}) (because otherwise*t**y**p**e*(**f**) must be in \(\mathit {Keys}(\mathcal {S})\), which is a contradiction). Thus,**f**∉*D**e**p*(**g**_{u},*T*_{1}). As there are infinitely many such values*u*, for every finite instance*I*there should be a*u*for which**g**_{u}∉*I*. Hence, for every*I*where \(\mathbf {f} \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\), there is indeed a fact**g**_{u}and instance*J*=*D**e**p*(**g**_{u},*T*_{1}), where**g**_{u}∉*I*,**f**∉*J*, and \(\mathbf {f} \not \in f_{\mathcal {S}}(J\cup \{\mathbf {g}_{u}\})\) as requested by condition (2).

Deciding whether \(f_{\mathcal {S}}\) is locally optimal for arbitrarily given BDS \(\mathcal {S}\) turns out to be hard (c.f., Theorem 2). Therefore, we also consider the special case of open BDSs. We say that a partial type *φ*=(*E*,*I*) is *open* when it enforces no restrictions. That is, when *E*=*I*=*∅*. A partial atomic type (*R*,*φ*) is *open* when *φ* is. We say that a BDS \(\mathcal {S}\) is *open* when it only contains open partial atomic types. Notice that a BDS that is open is also harmonious (but not vice versa).

Similarly to Theorem 1, we have the following decidability result for locally optimal OBFs.

**Theorem 2**

*Let Q be a CQ and let*\(\mathcal {S}\)*be a BDS for Q for which*\(f_{\mathcal {S}}\)*is correct for Q. Deciding whether*\(f_{\mathcal {S}}\)*is locally optimal is in*conp and in ptime*when*\(\mathcal {S}\)*is open.*

*Proof 8*

*Q*is not locally optimal, where \(f_{\mathcal {S}}\) is correct for

*Q*, is easy when given the right gadgets. For these gadgets we rely on Lemma 6 which states that either,

there is an atomic type

*ω*, partial atomic type*τ*, and set of partial atomic types*T*, where \((\tau , T) \in \mathcal {S}\),*ω*∈*T**y**p**e**s*(*T*),*ω*∼_{Q}*τ*, and for none of the keys \(\tau ^{\prime } \in \mathit {Keys}(\mathcal {S})\),*ω*⊧*τ*^{′};there is an atomic type

*ω*, where \(\omega \in Types(Q)\setminus Types(\mathit {Keys}(\mathcal {S}))\), where for every*ω*_{1}⊧*τ*_{1}, where \(\tau _{1} \in \mathit {Keys}(\mathcal {S})\), and*ω*∼_{Q}*ω*_{1}, \( Vars(\omega _{1}) \subseteq _{\omega _{1}} Vars(\omega )\); or,there are atomic types

*ω*_{1},*ω*_{2}, partial atomic types*τ*_{1},*τ*_{2}, and sets of partial atomic types*T*_{1},*T*_{2}, where \((\tau _{1}, T_{1}), (\tau _{2}, T_{2}) \in \mathcal {S}\),*ω*_{1}⊧*τ*_{1},*ω*_{2}⊧*τ*_{2},*ω*_{1}∼_{Q}*ω*_{2}, and both*ω*_{1}∈*T**y**p**e**s*(*T*_{2}), and*ω*_{2}∈*T**y**p**e**s*(*T*_{1}).

- 1.
\((\tau , T) \in \mathcal {S}\) and

*τ*^{′}∈*T*, where*τ*∼_{Q}*τ*^{′}implies \(\tau ^{\prime } \in \mathit {Keys}(\mathcal {S})\); - 2.
for every open partial type

*τ*not in \(\mathit {Keys}(\mathcal {S})\), there is a \(\tau _{1} \in \mathit {Keys}(\mathcal {S})\), where*τ*∼_{Q}*τ*_{1}and \( Vars{\tau _{1}}\subseteq _{\tau _{1}} Vars(\tau )\); and - 3.
\((\tau _{1}, T_{1}), (\tau _{2}, T_{2}) \in \mathcal {S}\),

*V**a**r**s**τ*_{1}∼_{Q}*τ*_{2},*τ*_{1}∈*T*_{2}implies*τ*_{2}∉*T*_{1}.

*Q*and \(\mathcal {S}\). □

It remains open though whether deciding locally optimality is conp-complete or in ptime (even for harmonious BDS). For harmonious BDS, condition 3(a) and 3(c) of Lemma 6 are verifiable in polynomial time.

Next, we show that every locally optimal OBF can be represented by a BDS thereby obtaining that BDSs (satisfying the conditions in Lemma 6) are a complete representation of locally optimal OBFs. Let *Q* be a CQ and let *f* be an OBF for *Q*. We call a fact **f***semi-static* for *f* when there is an atomic type *ω* and an instance *I* such that **f**∉*f*(*I*∪{**f**}) and *t**y**p**e*(**f**)=*ω*. That is, **f** has an atomic type and there is an instance for which **f** is not broadcast. We say that a semi-static fact **f** (for *f*) *depends* on a fact **g**, when, for every instance *I*, **f**∉*f*(*I*∪{**f**}) implies **g**∈*I*. With every semi-static fact **f**, we associate the set *D*_{f} containing exactly all facts on which **f** depends. Thus, **f**∉*f*(*I*∪{**f**}) implies *D*_{f}⊆*I*.

We make use of the following lemma in the proof of Theorem 3.

**Lemma 7**

*Let Q be a CQ, and f be a locally optimal OBF for Q. Let*

**f**

*be semi-static for f. Then*,

**f**∉

*f(D*

_{f}

*∪*{

**f**}).

*Furthermore*,

**g**

*∈D*

_{f}

*implies*

- 1.
**g***is semi-static and***g**∼_{Q}**f**; - 2.
Adom(

**g**)⊆Adom(**f**); - 3.
Vars(atom

**g**)*⊆*_{type(g)}*Vars(atom*(**f**)); and - 4.
**g**=*V*_{f,type(g)}(*atom*(**g**));

*Proof 9*

Before going to the actual proof, we first show the following auxiliary result:

**Lemma 8**

*If***f**∼_{Q}**g***and both are semi-static for f then***f***depends on***g** or **g***depends on***f**.

*Proof 10*

Assume towards a contradiction that both dependencies fail. Then, as **f** and **g** are semi-static, there is an instance *I* such that **f**∉*f*(*I*∪{**f**}) and **g**∉*I*, and instance *J* such that **g**∉*f*(*I*∪{**g**}) and **f**∉*J*. But then, by Lemma 1, **f**,**g**,*I*, and *J* contradict with *f* being an OBF for *Q*. □

Next, we argue **f**∉*f*(*D*_{f}∪{**f**}). Towards a contradiction suppose **f**∈*f*(*D*_{f}∪{**f**}). Then, by Lemma 4 there must be some fact **h** and instance *H*, where **h**∼_{Q}**f**, **h**∉*D*_{f}, **f**∉*H*, and **h**∉*f*(*H*∪{**h**}). Because **f** is semi-static and **h**∉*D*_{f}, there must be some instance *J*, where **h**∉*J* and **f**∉*f*(*J*∪{**f**}). So, by Lemma 1, we have found **h**,**f**,*J*,*H* contradicting *f* being an OBF for *Q*.

For (1) let *I*=*D*_{f}. Because **f** is semi-static, by the above, **f**∉*f*(*I*∪{**f**}). Further, **g**∈*D*_{f} implies **f**∈*f*(*I*∪{**f**}∖{**g**}). Then, by locally optimality of *f* and Lemma 4, there is an instance *H* and a fact **h**, such that **h**∉*f*(*H*∪{**h**}), **h**∼_{Q}**f**, **h**∉*I*∖{**g**}, and **f**∉*H*. However, by Lemma 1, **h**∈*I*, implying **h**=**g**. So, indeed, **g** is compatible with **f**, and there is an instance for which **g** is not broadcast.

For (2), towards a contradiction suppose *A**d**o**m*(**g**)⫅̸*A**d**o**m*(**f**), implying a value *a*∈*A**d**o**m*(**g**) which is not in *A**d**o**m*(**f**). Because **f** is semi-static, there must be an instance *J*, where **f**∉*f*(*J*∪{**f**}). Now, let *π* be the permutation over **d****o****m** that maps *a* onto *u* (where *u*∈**dom***s**e**t**m**i**n**u**s**A**d**o**m**J*∪{**f**}), *u* onto *a*, and is the identity for every other value. Notice that by construction, *π*(**f**)=**f**, and **g**∉*π*(*J*∪{**f**}). Then, by genericity of *f*, **f**∉*f*(*π*(*J*)∪{**f**}), implying *D*_{f}⊆*π*(*J*), which is a contradiction with the assumption that **g**∈*D*_{f}. Thus, indeed *A**d**o**m*(**g**)⊆*A**d**o**m*(**f**).

For (3), again towards a contradiction, suppose that *V**a**r**s**a**t**o**m***g**⫅̸_{type(g)}*V**a**r**s*(*a**t**o**m*(**f**)). So, there is a variable *z*∈*V**a**r**s**a**t**o**m***g**∖*V**a**r**s*(*a**t**o**m*(**f**)), and no variable *y*∈*V**a**r**s**a**t**o**m***g**∩*V**a**r**s*(*a**t**o**m*(**f**)) exists, for which *V*_{g}(*z*)=*V*_{g}(*y*). Recall that *V*_{g} denotes the partial valuation implied by **g** for *a**t**o**m***g**. Let *Z* be the set of variables *z*^{′} in *V**a**r**s**a**t**o**m***g**, where *V*_{g}(*z*^{′})=*V*_{g}(*z*). Notice *Z*∩*V**a**r**s*(*a**t**o**m*(**f**))=*∅*. Now, let *u*∈**dom**∖*A**d**o**m*{**f**}∪{**g**}. Consider the mapping *V*, where *V*(*x*)=*V*_{f}(*x*) for every *x*∈*V**a**r**s**a**t**o**m***g**∩*V**a**r**s*(*a**t**o**m*(**f**)). Notice that by compatibility and (1): *V*(*x*)=*V*_{g}(*x*) as well. Further, *V*(*x*)=*V*_{g}(*x*) for every *x*∈*V**a**r**s**a**t**o**m***g**∖(*V**a**r**s*(*a**t**o**m*(**f**))∪*Z*), and *V*(*z*)=*u* for every *z*∈*Z*. Notice that **g**^{′}=*V*(*a**t**o**m***g**) is compatible with **f**. So, because **g**∈*D*_{f}, implying that **g** is semi-static by (1), by genericity **g**^{′} is also semi-static for *f*. By construction, *A**d**o**m*(**g**^{′})⫅̸*A**d**o**m*(**f**), implying **g**^{′}∉*D*_{f}. So, by Lemma 8 it must be that \(\mathbf {f} \in D_{\mathbf {g}^{\prime }}\). The later implies *A**d**o**m*(**f**)⊆*A**d**o**m*(**g**^{′}). In particular, because *u*∉*A**d**o**m*(**f**), we actually have \(\mathit {Adom}(\mathbf {f}) \subsetneq \mathit {Adom}(\mathbf {g}^{\prime })\). However, **g**∈*D*_{f} implies *A**d**o**m*(**g**)⊆*A**d**o**m*(**f**), and because **g** and **g**^{′} have the same type, |*A**d**o**m*(**g**)|=|*A**d**o**m*(**g**^{′})|, which is a contradiction.

Item (4) follows immediately from (1), (3) and the definition of *V*_{f,type(g)}. □

We are now ready to prove completeness. The proof of the following theorem shows that the formalism of BDS that only uses complete atomic types can already represent every locally optimal OBF.

**Theorem 4** (Completeness)

*Let Q be a CQ and f a locally optimal OBF for Q. Then, there is a BDS*\(\mathcal {S}\)*for Q such that*\(f=f_{\mathcal {S}}\).

*Proof 11*

We start by noting that if **f** is semi-static for *f*, then every **g** with *t**y**p**e*(**f**)=*t**y**p**e*(**g**) is semi-static for *f* as well. Therefore, we say that an atomic type *τ* is semi-static for *f* when there is a semi-static fact **f** with *t**y**p**e*(**f**)=*τ*. The proof is by construction. Let \(\mathcal {S}\) be the set of pairs (*τ*,*D*_{τ}) where *τ* is semi-static for *f* and *D*_{τ}=*T**y**p**e**s**D*_{f}, where **f** is a fact with atomic type *τ*.

We first show that \(\mathcal {S}\) is a BDS and then that \(f = f_{\mathcal {S}}\). Notice that, \(\mathcal {S}\) has only finitely many pairs, because there are only finitely many distinct atomic-types, and every set in \(\mathit {Values}\mathcal {S}\) is finite by construction. Let \((\tau , T)\in \mathcal {S}\), and *τ*^{′}∈*T*. By construction of \(\mathcal {S}\), *τ* is a semi-static atomic type for *f* and for every atomic type *τ* there is at most one pair \((\tau , T) \in \mathcal {S}\). Furthermore, *T*=*D*_{τ}. Let **f** be a fact of type *τ*. Then, **f** is a semi-static fact for *f* and there is a **g**∈*D*_{f}, such that *t**y**p**e*(**g**)=*τ*^{′}. By Lemma 7(3), *V**a**r**s**a**t**o**m*(*τ*^{′})=*V**a**r**s**a**t**o**m***g**⊆_{type(g)}*V**a**r**s*(*a**t**o**m*(**f**))=*V**a**r**s**a**t**o**m*(*τ*). So, \(\mathcal {S}\) is a broadcast dependency set for the query *Q*.

Next, we show that \(f = f_{\mathcal {S}}\). For this, we assume *D*_{f}=*D**e**p***f***D*_{type(f)} (which is argued below) and show that **f**∉*f*(*I*∪{**f**}) iff \(\mathbf {f} \not \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\).

Let **f** be a fact and *I* an instance, such that **f**∉*f*(*I*∪{**f**}). If **f** has no atomic type, then it is never broadcast by \(f_{\mathcal {S}}\). So, assume **f** has an atomic type. Then it must be that *D*_{f}⊆*I*. However, because \((\mathit {type}(\mathbf {f}), D_{\mathit {type}(\mathbf {f})}) \in \mathcal {S}\) and *D*_{f}=*D**e**p***f***D*_{type(f)}, \(Dep(\mathbf {f},\mathcal {S}) \subseteq I\). Hence, by definition of \(f_{\mathcal {S}}\), \(\mathbf {f} \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\).

For fact **f** and instance *I*, where **f**∈*f*(*I*∪{**f**}), Lemma 4 implies that **f** has an atomic type. Either, **f** is always broadcast by *f*, or it is semi-static for **f**. The former implies that there is no pair in \(\mathcal {S}\) of the form (*t**y**p**e*(**f**),*T*). So, **f** is broadcast by \(f_{\mathcal {S}}\) as well. The latter implies by Lemma 7 that *D*_{f}⫅̸*I* and there is a pair \((\mathit {type}(\mathbf {f}), D_{\mathit {type}(\mathbf {f})}) \in \mathcal {S}\). In particular, because *D**e**p***f***D*_{type(f)}=*D*_{f}, *D**e**p***f***D*_{type(f)}⫅̸*I*, which implies that \(\mathbf {f} \not \in f_{\mathcal {S}}(I\cup \{\mathbf {f}\})\).

It remains to show that \(D_{\mathbf {f}} = \mathit {Dep}{\mathbf {f}}{D_{\mathit {type}(\mathbf {f})}}\). Because **g**∈*D*_{f}, implying *t**y**p**e*(**g**)∈*D*_{type(f)}, it follows by Lemma 7(4) that **g**∈*D**e**p***f***D*_{type(f)}. For the reverse direction, let **g**∈*D**e**p***f***D*_{type(f)}, which implies *t**y**p**e*(**g**)∈*D*_{type(f)}. So, there must be some fact **g**^{′}, which is of the same type as **g**, in *D*_{f}. In particular, because \(D_{\mathbf {f}} \subseteq {Dep}{\mathbf {f}}{D_{{type}(\mathbf {f})}}\), \(\mathbf {g}^{\prime } = V_{\mathbf {f}, type(\mathbf {g}^{\prime })}(\mathit {atom}{\mathbf {g}^{\prime }})\). However, because \(\mathbf {g} = V_{\mathbf {f}, type(\mathbf {g})}(\mathit {atom}{\mathbf {g}})\), *a**t**o**m***g**=*a**t**o**m***g**^{′}, and *t**y**p**e*(**g**^{′})=*t**y**p**e*(**g**), it must be that **g**=**g**^{′}. So, indeed **g**∈*D*_{f}. □

*Remark 2*

The reader may wonder if a similar result exists for OBSs that are not necessarily locally optimal. Then, however, the behaviour of OBS is much less predictable and the BDS formalism falls short. For an example, recall query *Q*_{1} and OBF \(f_{\mathcal {S}}\) from Example 6(1). Now let *f* be the OBF defined by \(f(I) = f_{\mathcal {S}}(I)\) if |*I*| is even, and *f*(*I*)=*I* if |*I*| is odd. OBF *f* is clearly correct for *Q*_{1} (because \(f_{\mathcal {S}}(I) \subseteq I\)), but cannot be simulated through a BDS.

## 6 Algorithms for Constructing a BDS

*Q*by simply starting from

*S*=

*∅*and adding new pairs in a one by one fashion till no more pairs can be added. More formally, we introduce the algorithm BDS-BUILD, given in Algorithm 1. As there are exponentially many (in the size of

*Q*) partial atomic types, we parameterize BDS-BUILD by a sequence \(\mathcal {R}\) of partial atomic types.

^{4}The algorithm then produces a set of pairs (

*τ*,

*T*)∈

*P*

*T*

*y*

*p*

*e*

*s*(

*Q*)×2

^{PTypes(Q)}.

The following theorem obtains the correctness of BDS-BUILD. The complexity follows directly from the size of \(\mathcal {R}\) which is polynomial in the size of *Q* for open types and exponential for complete types.

**Theorem 4**

*For a conjunctive query Q and a sequence*\(\mathcal {R}\)*consisting of exactly the complete (respectively, open) types, BDS−BUILDQ computes a BDS S for Q in time exponential (respectively, polynomial) in the size of Q such that f*_{S}*is correct for Q and locally optimal.*

*Proof 12*

We show that (1) the complexity of BDS-BUILDQ is in time polynomial in the size of \(\mathcal {R}\) and *Q*, (2) BDS-BUILDQ computes a BDS *S* for *Q*, (3) *S* is correct for *Q*, and (4) *f*_{S} is locally optimal. Because there are exponentially many complete types (in the size of *Q*), and only polynomially many open types (in the size of *Q*), (1) implies the complexity claims of the theorem.

For (1), as every partial atomic type has a size that is polynomial in the size of *Q*, verifying variable containment and adding a pair to *S* can be done in polynomial time in *Q*.

These actions are repeated for iterations of the inner and outer loop, which iterate over every key in the partially constructed set *S*, and over every element of \(\mathcal {R}\) respectively. By construction, *S* can have at most \(\mathcal {R}\) keys, implying that both loops together perform at most \(|{\mathcal {R}}|^2\) iterations, which confirms the complexity of BDS-BUILDQ to be in time polynomial in the size of \(\mathcal {R}\) and *Q*.

For (2) and (3), observe that both conditions are satisfied when *S*=*∅*. Indeed, because there are no pairs in *S*, *S* is a BDS for *Q*, and every fact that can contribute to a satisfying valuation for *Q* is broadcast by *f*_{S}.

Next, we argue that (2) and (3) remain satisfied during each step of the outer loop. Because \(\mathcal {R}\) contains exactly the complete (respectively, open) types, every partial type that is considered in the outer loop is disjoint with every partial type considered before, implying that condition (1) and (2) of Definition 5 remain satisfied during each iteration. Further, as only pairs (*τ*,*T*) are added to *S*, where \(\mathit {Vars}(\tau ^{\prime })\subseteq _{\tau ^{\prime }}\mathit {Vars}{\tau }\) is satisfied for every *τ*^{′}∈*T*, condition (3) of Definition 5 remains valid as well. For correctness, observe that every \(\tau ^{\prime } \in \mathit {Keys}(\mathcal {S})\), where \(\tau ^{\prime }\thicksim _{Q} \tau \), is added to *T*, implying that condition (3) of Lemma 5 remains satisfied.

It remains to argue (4). We distinguish between the case of complete and open types.

For \(\mathcal {R}\) consisting of complete types, condition (3a) of Lemma 6 is satisfied, because only atomic types that are already a key are considered as a value, and because keys are never removed during the construction. Condition (3b) is satisfied because every atomic type *ω* for *Q* is in \(\mathcal {R}\), and either *ω* is added to *S* as a key, or it is not added because there is already a compatible atomic type *ω*_{1} in *S*, for which \(\mathit {Vars}{\omega _1}\not \subseteq _{\omega _1} Vars(\omega )\). So, again because keys are never removed during the construction, condition (3b) is satisfied. For condition (3c) it suffices to observe that value sets do not change during the construction of *S*. Therefore, for (*τ*_{1},*T*_{1})∈*S*, *τ*_{2}∈*T*_{1} implies that *τ*_{1} was already a key when (*τ*_{1},*T*_{1}) was added, and thus *τ*_{1} was not a key when (*τ*_{2},*T*_{2}) was added, implying *τ*_{2}∉*T*_{1}.

For \(\mathcal {R}\) consisting of open types the proof is analogous. Every complete atomic type that implies an open type in *S* is added as a key during the construction, implying that condition (3a) holds.

For every complete atomic type *ω*, if *ω* implies no key in *S*, then the open type *τ* for *p**r**e**d**ω* must have been excluded from *S*, implying that there is a key *τ*^{′}∈*K**e**y**s**S*, where \(\mathit {Vars}(\tau ^{\prime }) \not \subseteq _{\tau ^{\prime }} \mathit {Vars}{\tau }\). Because *τ*^{′} itself must be open, and *Q* is a CQ, there must be some atomic type *ω*^{′}⊧*τ*^{′} such that \(\omega ^{\prime } \thicksim _{Q} \omega \). The later then imply \(\mathit {Vars}{\omega ^{\prime }} \not \subseteq _{\omega ^{\prime }} \mathit {Vars}(\omega )\).

Condition (3c) satisfies, because for every (*τ*_{1},*T*_{1}),(*τ*_{2},*T*_{2})∈*S*, where \(\omega _1\models \tau _1, \omega _2\models \tau _2, \omega _1\thicksim _{Q} \omega _1\), *ω*_{1}∈*T**y**p**e**s**T*_{2} implies that *τ*_{1}∈*T*_{2}. So, *τ*_{1} was already a key in *S* before *τ*_{2} was added. Thus, *τ*_{2}∉*T*_{1}. The result then follows because for two distinct open types *τ*,*τ*^{′}, *T**y**p**e**s*(*τ*) and *T**y**p**e**s*(*τ*^{′}) are always disjoint. □

Notice that, on arbitrary (not necessarily complete) sequences of partial atomic types, the above algorithm outputs BDSs that are correct but not necessarily locally optimal for the given query. Further notice that the correctness and local-optimality of the BDS returned by BDS-BUILD is independent of the order in which types are fed to the algorithm, but that the order can influence its structure and thus the behaviour of the OBF that it describes.

*Example 7*

We illustrate BDS-BUILD by means of an example.

*Q*(

*x*,

*y*,

*z*,

*w*)←

*A*(

*x*,

*y*,

*z*),

*B*(

*x*,

*y*,

*z*),

*C*(

*z*,

*w*).

- (1)
**Open types**. Observe that query*Q*has three open types, being*τ*_{A}=(*A*,*t**r**u**e*),*τ*_{B}=(*B*,*t**r**u**e*), and*τ*_{C}=(*C*,*t**r**u**e*). Let \({\mathcal {R}} = (\tau _A, \tau _B, \tau _C)\). Then, BDS-BUILD computes a BDS by starting from*S*=*∅*, expanding*S*to {(*τ*_{A},*∅*)} in the first iteration and to {(*τ*_{A},*∅*),(*τ*_{B},{*τ*_{A}})} in the second iteration. During the last iteration,*S*is not changed anymore, because \(\mathit {Vars}{\tau _A}\not \subseteq _{\tau _A}\mathit {Vars}{\tau _C}\). - (2)
**Complete types**. The (complete) atomic types for*Q*arewhere$$\begin{array}{@{}rcl@{}} &&\tau_X^{\ne} = (X, x\ne y\land y\ne z \land x \ne z), ~~~~ ~\tau_X^{x=z} = (X, x=z\land z\ne y\land y\ne z), \\ &&\tau_X^{x=y} = (X, x=y\land x\ne z\land y\ne z), ~\tau_X^{y=z} = (X, x\ne y\land y=z\land z\ne x),\! \\ &&\tau_X^{=} = (X, x=y\land x= z\land y= z), ~~~{\kern2pt}\tau_C^{=}\! =\! (C, z\,=\,w), \!\text{ and } \tau_C^{\ne} \,=\, (C, z\!\ne\! w), \end{array} $$*X*∈{*A*,*B*}.^{5}LetThen, the output of algorithm BDS-BUILDQ is the BDS$$\begin{array}{@{}rcl@{}} {\mathcal{R}} = (\tau_B^{\ne}, \tau_C^{=}, \tau_C^{\ne}, \tau_B^{x=z}, \tau_A^{x=y}, \tau_A^{\ne}, \tau_A^{x=z}, \tau_A^=, \tau_B^=, \tau_A^{y=z}, \tau_B^{x=y}, \tau_B^{y=z}). \end{array} $$$$\begin{array}{@{}rcl@{}} \mathit{S} = \{{}& (\tau_B^{\ne}, \emptyset), (\tau_B^{x=z}, \emptyset), (\tau_A^{x=y}, \emptyset), (\tau_A^{\ne}, \{\tau_B^{\ne}\}), (\tau_A^{x=z}, \{\tau_B^{x=z}\}),~\\ &{\kern18pt}~(\tau_A^=, \emptyset),\allowbreak (\tau_B^{=}, \{\tau_A^{=}\}),\allowbreak (\tau_A^{y=z}, \emptyset),\allowbreak (\tau_B^{x=y}, \{\tau_A^{x=y}\}),\allowbreak (\tau_B^{y=z}, \{\tau_A^{y=z}\})\}. \end{array} $$Observe that the atomic types \(\tau _C^{=}\) and \(\tau _C^{\ne }\) are not part of

*S*because the variable containment condition is not satisfied by the earlier included atomic type \(\tau _B^{\ne }\).Observe that the constructed BDS*S*can be simplified by merging multiple atomic types into partial atomic types; e.g., forwe have \(f_{\mathit {S}} = f_{\mathit {S}^{\prime }}\).$$\begin{array}{@{}rcl@{}} \mathit{S}^{\prime} = \{(\tau_A, \{\tau_B^{\ne}, \tau_B^{x=z}\}), (\tau_B, \{\tau_A^{x=y}, \tau_A^{{=}}, \tau_A^{y=z}\})\}, \end{array} $$

Notice that when \(\mathcal {R}\) consists of the complete or open atomic types, adding pairs to a given BDS *S* as is done by BDS-BUILDQ results in a BDS *S*^{′} that describes an OBF that broadcasts strictly less facts, i.e., \(f_{\mathit {S}^{\prime }} \subsetneq f_{\mathit {S}}\). That is, adding pairs optimizes the OBF.

*Remark 5*

By construction, BDS-BUILD(Q) prevents any circular dependencies by stratifying the construction of *S* so that partial atomic types can only depend on partial atomic types that where added before. As illustrated in Example 6(4), dependencies in a BDS can also be circular. To allow for these BDS-BUILD can be modified as follows: as an alternative for adding pairs (*τ*,*T*) where every existing key that is compatible with *τ* is included in *T*, we can allow adding pairs where some keys that are compatible with *τ* are in *T*, and for every other compatible key, their respective value set is expandend to contain *τ*; i.e., allowing pairs of the form (*τ*,*D*), where *D* is a subset of \(C = \{\omega ^{\prime } \in \mathit {Keys}{\mathit {S}} \mid \omega ^{\prime } \thicksim _{Q} \omega \}\) satisfying \(\mathit {Vars}{\omega ^{\prime }} \subseteq _{\omega ^{\prime }} \mathit {Vars}(\omega )\) for every *ω*^{′}∈*D*, and where every existing pair (*ω*^{′},*T*), where *ω*^{′}∈*C*∖*D*, is expanded to (*ω*^{′},*T*∪{*ω*}). Particularly notice that when a given BDS *S* is changed to *S*^{′} by adding a pair and expanding at least one of the existing pairs as described above, the inherent nature of the described OBF changes, so that not necessarily \(f_{\mathit {S}'} \subsetneq f_{\mathit {S}}\).

*Remark 6*

Although the machinery developed throughout this paper is motivated by gaining a better understanding of the spectrum of locally optimal OBFs, the reader may notice that when no (statistical) information on the actual distribution of the data is available, there is no basis to favor one locally optimal OBF over another.

*Q*which is as good as any locally optimal one (when no additional information on the distribution of the data is available). Indeed, consider an arbitrary order on the predicates of

*Q*:

for every local fact

f, with predicateR, if there is an earlier predicateSsuch that some variable inVars(S) is not inVars(R), factfis broadcast; otherwise, factfis broadcast only if all the facts induced byV_{factf}on queryQare in the local instance.

Of course, not every locally optimal OBF can take this form.

## 7 Discussion

We investigated locally optimal oblivious broadcasting functions represented by the formalism of broadcast dependency sets. We obtained semantical and syntactical characterizations, showed completeness of BDSs for representing locally optimal OBFs, and gave an algorithm for constructing locally optimal OBFs for a given conjunctive query. We present several directions for future work: more expressive query languages, incorporating background knowledge, and non-oblivious broadcast functions.

An obvious question is how to generalize our results to the class of all conjunctive queries (possibly extended with negation) or even to (subsets of) Datalog. A first step would be to get rid of the fullness-restriction and to allow self-joins. When removing these restrictions, output facts may have non-unique valuations, which makes reasoning about local optimality much more complex. Of course, to evaluate non-monotonic queries in a coordination-free manner, computing nodes need more information on how data is distributed (c.f., [6]).

We only discussed how to build a BDS when no information about the way data is distributed is available. Indeed, the best one can do is to let a BDS cover as much types as possible, but at the same time introduce as little dependencies as possible, as these are likely to fail when data is arbitrarily distributed. It would be interesting to devise optimal broadcasting algorithms taking more background knowledge into account like information about clustering of attributes, foreign keys, or cardinality of relations.

Another interesting direction for future work is to investigate non-oblivious broadcasting functions where over time, when new messages arrive, static facts can become broadcast facts (but not vice versa). Such functions are initially more conservative keeping more facts static and only broadcast facts when there is some evidence that they can be used at another computing node. For instance, consider the setting of Example 1. Rather than immediately sending *B*(*i*,*j*) whenever *A*(*j*,*i*) is locally absent, broadcasting is suspended until a *C*-fact of the form *C*(*j*,*k*) is received. The rationale is that a *B*-fact that can not contribute to a locally satisfying valuation, should only be broadcast when some evidence is received that it could potentially contribute to a satisfying valuation on a remote node. For our example this means that *c* waits to send *B*(2,1) until *C*(1,3) arrives. Moreover, *B*(4,4) is never sent. While non-oblivious strategies might seem more attractive as they transmit fewer tuples, such strategies, while remaining coordination-free, can increase the overall evaluation time.

## Footnotes

- 1.
- 2.
To simplify notation, in the definition of

*B*and*eval*, we do not mention*I*and \(\mathcal {N}\) as they are implied by*H*. - 3.
Notice that we abuse the notation and interpret variables as values.

- 4.
We use a sequence rather than a set \(\mathcal {R}\) to keep BDS-BUILD deterministic.

- 5.
For convenience we represent atomic types here by partial atomic types with sufficient (but not complete) conditions; e.g., we write (

*C*,*x*=*y*) to denote (*C*,*x*=*y*∧*y*=*x*). Nevertheless, all of the listed pairs indeed correspond to a single (complete) atomic type.

## Notes

### Acknowledgment

We thank Phokion Kolaitis for raising the question whether it is always necessary to broadcast all the data in the context of the work in [5]. We thank the reviewers for their in-depth comments and numerous suggestions for improving the presentation of the results.

### References

- 1.Afrati, F.N., Koutris, P., Suciu, D., Ullman, J.D.: Parallel skyline queries. In: International Conference on Database Theory (ICDT), pp 274–284 (2012)Google Scholar
- 2.Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: International Conference on Extending Database Technology (EDBT), pp 99–110 (2010)Google Scholar
- 3.Alvaro, P., Conway, N., Hellerstein, J., Marczak, W.R.: Consistency analysis in bloom: a CALM and collected approach. In: Conference on Innovative Data Systems Research (CIDR), pp 249–260 (2011)Google Scholar
- 4.Alvaro, P., Conway, N., Hellerstein, J.M., Maier, D.: Blazes: Coordination analysis for distributed programs. In: International Conference on Data Engineering (ICDE), pp 52–63. IEEE (2014)Google Scholar
- 5.Ameloot, T.J., Ketsman, B., Neven, F., Zinn, D.: Weaker forms of monotonicity for declarative networking: a more fine-grained answer to the CALM-conjecture . In: Symposium on Principles of Database Systems (PODS), pp 64–75. ACM (2014)Google Scholar
- 6.Ameloot, T.J., Neven, F., Bussche, J.V.d.: Relational transducers for declarative networking. J. ACM
**60**(2), 15 (2013)MathSciNetCrossRefMATHGoogle Scholar - 7.Beame, P., Koutris, P., Suciu, D.: Communication steps for parallel query processing. In: Symposium on Principles of Database Systems (PODS), pp 273–284 (2013)Google Scholar
- 8.Beame, P., Koutris, P., Suciu, D.: Skew in parallel query processing. In: Symposium on Principles of Database Systems (PODS), pp 212–223 (2014)Google Scholar
- 9.Buneman, P., Cheney, J., Tan, W.C., Vansummeren, S: Curated databases. In: Symposium on Principles of Database Systems (PODS), pp 1–12. ACM (2008)Google Scholar
- 10.Buneman, P., Khanna, S., Tan, W.C.: Why and where: A characterization of data provenance. In: International Conference on Database Theory (ICDT) volume 1973 of Lecture Notes in Computer Science, pp 316–330. Springer (2001)Google Scholar
- 11.Conway, N., Marczak, W.R., Alvaro, P., Hellerstein, J.M., Maier, D.: Logic and lattices for distributed programming. In: Symposium on Cloud Computing (SoCC), p 1. ACM (2012)Google Scholar
- 12.Fan, W., Geerts, F., Libkin, L.: On scale independence for querying big data. In: Symposium on Principles of Database Systems (PODS), pp 51–62. ACM (2014)Google Scholar
- 13.Ganguly, S., Silberschatz, A., Tsur, S.: Parallel bottom-up processing of datalog queries. J. Log. Program.
**14**(1&2), 101–126 (1992)MathSciNetCrossRefMATHGoogle Scholar - 14.Hellerstein, J.M.: The declarative imperative: experiences and conjectures in distributed logic. SIGMOD Rec.
**39**(1), 5–19 (2010)CrossRefGoogle Scholar - 15.Ketsman, B., Neven, F.: Optimal broadcasting strategies for conjunctive queries over distributed data. In: International Conference on Database Theory (ICDT), pp 291–307 (2015)Google Scholar
- 16.Koutris, P., Suciu, D.: Parallel evaluation of conjunctive queries. In: Symposium on Principles of Database Systems (PODS), pp 223–234 (2011)Google Scholar
- 17.Meliou, A., Gatterbauer, W., Halpern, J.Y., Koch, C., Moore, K.F., Suciu, D.: Causality in databases. IEEE Data Engineering Bulletin
**33**(3), 59–67 (2010)Google Scholar - 18.Meliou, A., Gatterbauer, W., Moore, K.F., Suciu, D.: The complexity of causality and responsibility for query answers and non-answers. Proceedings of the VLDB Endowmen (PVLDB)
**4**(1), 34–45 (2010)CrossRefGoogle Scholar - 19.Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp 15–28. USENIX Association (2012)Google Scholar
- 20.Zinn, D., Green, T.J., Ludäscher, B.: Win-move is coordination-free (sometimes). In: International Conference on Database Theory (ICDT), pp 99–113 (2012)Google Scholar