Abstract
The ability to efficiently analyze changing data is a key requirement of many realtime analytics applications. In prior work, we have proposed general dynamic Yannakakis (GDyn), a general framework for dynamically processing acyclic conjunctive queries with \(\theta \)joins in the presence of data updates. Whereas traditional approaches face a tradeoff between materialization of subresults (to avoid inefficient recomputation) and recomputation of subresults (to avoid the potentially large space overhead of materialization), GDyn is able to avoid this tradeoff. It intelligently maintains a succinct data structure that supports efficient maintenance under updates and from which the full query result can quickly be enumerated. In this paper, we consolidate and extend the development of GDyn. First, we give full formal proof of GDyn ’s correctness and complexity. Second, we present a novel algorithm for computing GDyn query plans. Finally, we instantiate GDyn to the case where all \(\theta \)joins are inequalities and present extended experimental comparison against stateoftheart engines. Our approach performs consistently better than the competitor systems with multiple orders of magnitude improvements in both time and memory consumption.
This is a preview of subscription content, log in to check access.
Access options
Buy single article
Instant unlimited access to the full article PDF.
US$ 39.95
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
US$ 99
This is the net price. Taxes to be calculated in checkout.
Notes
 1.
Note that, in this framework, value modifications inside a tuple are modeled by deleting the tuple with the old value, and then reinserting the tuple, but now with the new value.
 2.
Note that such queries may also contain equijoins by sharing variables between atoms.
 3.
Strictly speaking, we described in section that R needs to be sorted lexicographically, first on \(\overline{x} \cap \overline{y}\) and then on x. The grouping + sorting of the enumeration index obtains the same result.
 4.
In the conference version of this paper [26], there was an incorrect claim: we stated that updates could be processed in time \(O(M\cdot \log (M))\) in the general case of multiple inequalities. We then found a bug in our proof and we currently do not know if this bound can be achieved.
 5.
Note that because we set \({\textit{out}}(\mathcal {I}) = \emptyset \) on the residual, new variables may become isolated and therefore more reductions steps may be possible on the normal form of \(\mathcal {I}\).
 6.
In the sense that batch updates are only supported by treating each update tuple in the batch individually.
 7.
Should \(X_2 {\setminus } X_1\) be empty, we don’t actually need to do anything on \(\mathcal {I}_1\): \(X_1 \cup X_2\) is already removed from it. A similar remark holds for \(\mathcal {I}_2\) when \(X_1 {\setminus } X_2\) is empty.
 8.
Note that, since \(e_1\) does not share variables with any predicate, the CSE operation also does not remove any predicates from \(\mathcal {H}_1\), similar to the ISO operation and hence yields \(\mathcal {I}_1\).
 9.
Note that all leafs have a parent since the root of \(T_1\) is an interior node labeled by \(\emptyset \).
 10.
 11.
 12.
 13.
 14.
References
 1.
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. AddisonWesley Longman Publishing Co., Inc., Boston (1995)
 2.
Abo Khamis, M., Ngo, H.Q., Rudra, A.: FAQ: questions asked frequently. In: Proceedings of PODS, pp. 13–28 (2016)
 3.
Agrawal, J., Diao, Y., Gyllstrom, D., Immerman, N.: Efficient pattern matching over event streams. Proc. SIGMOD 2008, 147–160 (2008)
 4.
Arasu, A., Babcock, B., Babu, S., Cieslewicz, J., Datar, M., Ito, K., Motwani, R., Srivastava, U., Widom, J.: STREAM: the stanford data stream management system. In: Data Stream Management—Processing HighSpeed Data Streams, pp. 317–336 (2016)
 5.
Baader, F., Nipkow, T.: Term Rewriting and All That. Cambridge University Press, Cambridge (1998)
 6.
Bagan, G., Durand, A., Grandjean, E.: On acyclic conjunctive queries and constant delay enumeration. In: Proceedings of CSL, pp. 208–222 (2007)
 7.
Bakibayev, N., Kočiský, T., Olteanu, D., Závodný, J.: Aggregation and ordering in factorised databases. Proc. VLDB 6(14), 1990–2001 (2013)
 8.
Berkholz, C., Keppeler, J., Schweikardt, N.: Answering conjunctive queries under updates. In: Proceedings of PODS, pp. 303–318 (2017)
 9.
Bernstein, P.A., Goodman, N.: The power of inequality semijoins. Inf. Syst. 6(4), 255–265 (1981)
 10.
BraultBaron, J.: De la pertinence de l’énumération: complexité en logiques. Ph.D. thesis, Université de Caen (2013)
 11.
Brenna, L., Demers, A.J., Gehrke, J., Hong, M., Ossher, J., Panda, B., Riedewald, M., Thatte, M., White, W.M.: Cayuga: a highperformance event processing engine. Proc. SIGMOD 2007, 1100–1102 (2007)
 12.
Chirkova, R., Yang, J.: Materialized views. Found. Trends Databases 4(4), 295–405 (2012)
 13.
Cormen, T.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)
 14.
Cugola, G., Margara, A.: TESLA: a formally defined event specification language. Proc. DEBS 2010, 50–61 (2010)
 15.
Cugola, G., Margara, A.: Complex event processing with TREX. J. Syst. Softw. 85(8), 1709–1728 (2012)
 16.
Cugola, G., Margara, A.: Processing flows of information: from data stream to complex event processing. ACM Comput. Surv. 44(3), 15:1–15:62 (2012)
 17.
DeWitt, D.J., Naughton, J.F., Schneider, D.A.: An evaluation of nonequijoin algorithms. VLDB 1991, 443–452 (1991)
 18.
Enderle, J., Hampel, M., Seidl, T.: Joining interval data in relational databases. Proc SIGMOD 2004, 683–694 (2004)
 19.
EsperTech. Esper complex event processing engine. http://www.espertech.com/
 20.
Golab, L., Özsu, M.T.: Processing sliding window multijoins in continuous queries over data streams. In: Proceedings of VLDB, pp. 500–511 (2003)
 21.
Gupata, A., Mumick, I.S. (eds.): Materialized Views: Techniques, Implementations, and Applications. MIT Press, Cambridge (1999)
 22.
Gupta, A., Mumick, I.S., Subrahmanian, V.S.: Maintaining views incrementally. In: Proceedings of SIGMOD, pp. 157–166 (1993)
 23.
Hellerstein, J.M., Naughton, J.F., Pfeffer, A.: Generalized search trees for database systems. In: VLDB’95, pp. 562–573 (1995)
 24.
Henzinger, M., Krinninger, S., Nanongkai, D., Saranurak, T.: Unifying and strengthening hardness for dynamic problems via the online matrixvector multiplication conjecture. In: Proceedings of STOC, pp. 21–30 (2015)
 25.
Idris, M., Ugarte, M., Vansummeren, S.: The dynamic Yannakakis algorithm: compact and efficient query processing under updates. In: Proceedings of SIGMOD 2017 (2017)
 26.
Idris, M., Ugarte, M., Vansummeren, S., Voigt, H., Lehner, W.: Conjunctive queries with inequalities under updates. PVLDB 11(7), 733–745 (2018)
 27.
Kang, J., Naughton, J.F., Viglas, S.: Evaluating window joins over unbounded streams. In: Proceedings of ICDE, pp. 341–352 (2003)
 28.
Khayyat, Z., Lucia, W., Singh, M., Ouzzani, M., Papotti, P., QuianéRuiz, J., Tang, N., Kalnis, P.: Fast and scalable inequality joins. VLDB J. 26(1), 125–150 (2017)
 29.
Koch, C.: Incremental query evaluation in a ring of databases. In: Proceedings of PODS, pp. 87–98 (2010)
 30.
Koch, C., Ahmad, Y., Kennedy, O., Nikolic, M., Nötzli, A., Lupei, D., Shaikhha, A.: Dbtoaster: higherorder delta processing for dynamic, frequently fresh views. VLDB J. 23, 253–278 (2014)
 31.
Mei, Y., Madden, S.: Zstream: a costbased query processor for adaptively detecting composite events. Proc. SIGMOD 2009, 193–206 (2009)
 32.
Nikolic, M., Olteanu, D.: Incremental view maintenance with triple lock factorization benefits. Proc. SIGMOD 2018, 365–380 (2018)
 33.
Olteanu, D., Závodný, J.: Size bounds for factorised representations of query results. ACM TODS 40(1), 2:1–2:44 (2015)
 34.
Roy, P., Teubner, J., Gemulla, R.: Lowlatency handshake join. PVLDB 7(9), 709–720 (2014)
 35.
Sahay, B., Ranjan, J.: Real time business intelligence in supply chain analytics. Inf. Manage. Comput. Secur. 16(1), 28–48 (2008)
 36.
Schleich, M., Olteanu, D., Ciucanu, R.: Learning linear regression models over factorized joins. In: Proceedings of SIGMOD, pp. 3–18 (2016)
 37.
SchultzMøller, N.P., Migliavacca, M., Pietzuch, P.R.: Distributed complex event processing with query rewriting. In: Proceedings of DEBS 2009 (2009)
 38.
Segoufin, L.: Constant delay enumeration for conjunctive queries. SIGMOD Rec. 44(1), 10–17 (2015)
 39.
Stonebraker, M., Çetintemel, U., Zdonik, S.: The 8 requirements of realtime stream processing. SIGMOD Rec. 4, 42–47 (2005)
 40.
Teubner, J., Müller, R.: How soccer players would do stream joins. In: Proceedings of SIGMOD, pp. 625–636 (2011)
 41.
Urhan, T., Franklin, M.J.: Xjoin: a reactivelyscheduled pipelined join operator. IEEE Data Eng. Bull. 23(2), 27–33 (2000)
 42.
Vardi, M.Y.: The complexity of relational query languages (extended abstract). In: Proceedings of STOC, pp. 137–146 (1982)
 43.
Viglas, S., Naughton, J. F., Burger, J.: Maximizing the output rate of multiway join queries over streaming information sources. In: Proceedings of VLDB, pp. 285–296 (2003)
 44.
Wang, W., Gao, J., Zhang, M., Wang, S., Chen, G., Ng, T.K., Ooi, B.C., Shao, J., Reyad, M.: Rafiki: machine learning as an analytics service system. PVLDB 12(2), 128–140 (2018)
 45.
Wilschut, A.N., Apers, P.M.G.: Dataflow query execution in a parallel mainmemory environment. In: Proceedings of the First International Conference on Parallel and Distributed Information Systems (PDIS 1991), pp. 68–77. IEEE Computer Society (1991)
 46.
Wu, E., Diao, Y., Rizvi, S.: Highperformance complex event processing over streams. Proc. SIGMOD 2006, 407–418 (2006)
 47.
Yannakakis, M.: Algorithms for acyclic database schemes. In: Proceedings of VLDB, pp. 82–94 (1981)
 48.
Yoshikawa, M., Kambayashi, Y.: Processing inequality queries based on generalized semijoins. In: VLDB, pp. 416–428 (1984)
 49.
Zhang, H., Diao, Y., Immerman, N.: On complexity and optimization of expensive queries in complex event processing. In: Proceedings of SIGMOD (2014)
Author information
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
M. Ugarte This work was done while the author was affiliated to ULB, Belgium.
H. Voigt This work was done while the author was affiliated to TU Dresden, Germany.
Dr. Sihem AmerYahia.
Appendices
Proofs of Sect. 4
Lemma 1 \(\rho _n= {\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}}(\textit{db})\), for every node\(n \in T\).
Proof
We proceed by induction on the number of descendants of n. If n has no descendants, then \(T_n\) is a single atom \(r(\overline{x})\) with \(\overline{x} = \textit{var}(n) = {\textit{out}}({\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}})\). Then, \({\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}}(\textit{db}) = (\pi _{var(n)} r(\overline{x}))(\textit{db})= r(\overline{x}) (\textit{db})=\textit{db}_{r(\overline{x})}=\rho _n\), concluding the basic case. Now, for the inductive case, we distinguish whether n has one or two children.
Assume n has a single child c. Then, \(\textit{at}(T_n) = \textit{at}(T_c)\) and \(\textit{pred}(T_n) = \textit{pred}(T_c) \cup {\textit{pred}}(n)\). Therefore, by definition of \({\mathcal {Q}}{\texttt {[}}\cdot {\texttt {]}}\), we have \({\mathcal {Q}}{\texttt {[}}T_n{\texttt {]}} \equiv \sigma _{{\textit{pred}}(n)}{\mathcal {Q}}{\texttt {[}}T_c{\texttt {]}}\), which implies that \({\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}} = \pi _{\textit{var}(n)}{\mathcal {Q}}{\texttt {[}}T_n{\texttt {]}} \equiv \pi _{\textit{var}(n)}\sigma _{{\textit{pred}}(n)}{\mathcal {Q}}{\texttt {[}}T_c{\texttt {]}}\). Furthermore, since \({\textit{pred}}(n)\) only mentions variables in \(\textit{var}(c) \cup \textit{var}(n)\) and \(\textit{var}(n)\subseteq \textit{var}(c)\), as c is a guard of n, this is equivalent to
By induction, \(\pi _{\textit{var}(n)} \sigma _{{\textit{pred}}(n)} {\mathcal {Q}}{\texttt {[}}T_c,c{\texttt {]}}(\textit{db}) = \pi _{\textit{var}(n)}\sigma _{{\textit{pred}}(n)}\rho _c = \rho _n\), showing that \({\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}}(\textit{db})=\rho _n\).
Assume now that n has two children \(c_1\) and \(c_2\). We assume w.l.o.g. that \(c_1\) is a guard for n. Note that \(\textit{at}(T_n) = \textit{at}(T_{c_1}) \cup \textit{at}(T_{c_2})\) and \(\textit{pred}(T_n)= \textit{pred}(T_{c_1}) \cup \textit{pred}(T_{c_2}) \cup \textit{pred}(n)\). Therefore,
Here, we abuse notation and write \(\textit{at}(T_i)\) for the natural join of all atoms in \(T_{c_i}\). Since \(\textit{pred}(T_{c_i})\) only mentions variables of atoms in \(T_{c_i}\) (for \(i\in \{1,2\}\)), we can push the selections:
Therefore,
Since \(\textit{var}({\textit{pred}}(n))\subseteq \textit{var}(c_1)\cup \textit{var}(c_2)\cup \textit{var}(n)\) and \(\textit{var}(n)\subseteq \textit{var}(c_1)\), we have \(\textit{var}({\textit{pred}}(n))\subseteq \textit{var}(c_1)\cup \textit{var}(c_2)\). This is combined with the fact that, due to the connectedness property of T, we have \(\textit{var}({\mathcal {Q}}{\texttt {[}}T_{c_1}{\texttt {]}})\cap \textit{var}({\mathcal {Q}}{\texttt {[}}T_{c_2}{\texttt {]}}) \subseteq \textit{var}(c_i)\) for \(i\in \{1,2\}\), we can add the following projections
Hence, by induction hypothesis, we have
concluding our proof. \(\square \)
Lemma 3

1.
\(Q(\textit{db})\)is a positive GMR, for any\(GCQ \)Qand any database\(\textit{db}\).

2.
IfRis a positive GMR over\(\overline{x}\)and\(\overline{y} \subseteq \overline{x}\), then\(\mathbf {t}[\overline{y}] \in \pi _{\overline{y}} R\)for every tuple\(\mathbf {t} \in R\).
Proof
(1) Follows by straightforward induction on Q, using the fact that the GMRs in \(\textit{db}\) are themselves positive by definition. (2) Is a standard result in relational algebra, which hence transfers to the case of positive GMRs. \(\square \)
Lemma 10
Let R be a positive GMR over \(\overline{x}\), S a positive GMR over \(\overline{y}\) and \(\mathbf {t}\) a tuple over \(\overline{z}\). If \(\overline{z} \subseteq \overline{y} \subseteq \overline{x}\), then \(R \ltimes (S \ltimes \mathbf {t}) = (R \ltimes S) \ltimes \mathbf {t}\).
Proof
This results well know in standard relational algebra, and its proof transfers to the case of positive GMRs. \(\square \)
Lemma 2 For every node\(n\in N\)and every tuple\(\mathbf {t}\)in\(\rho _n\), \(\textsc {enum}_{T,N}(n, \mathbf {t},\rho )\)enumerates\({\mathcal {Q}}{\texttt {[}}T_n, N_n{\texttt {]}}(\textit{db}) \ltimes \mathbf {t}\).
Proof
Let \(n \in N\) and \(\mathbf {t}\in \rho _n\). We need to show that executing \(\textsc {enum}_{T,N}(n, \mathbf {t},\rho )\) outputs all (tuple, multiplicity) pairs of \({\mathcal {Q}}{\texttt {[}}T_n, N_n{\texttt {]}}(\textit{db}) \ltimes \mathbf {t}\) exactly once. We proceed by induction on the number of nodes in \(N_n\). If \(N_n = \{n\}\), then \({\mathcal {Q}}{\texttt {[}}T_n, N_n{\texttt {]}} = {\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}}\). Therefore, by Lemma 1, \({\mathcal {Q}}{\texttt {[}}T_n, N_n{\texttt {]}}(\textit{db}) = \rho _n\). Since \(\mathbf {t}\in \rho _n\), this implies that the only tuple in \({\mathcal {Q}}{\texttt {[}}T_n,N_n{\texttt {]}}(\textit{db})\) that is compatible with \(\mathbf {t}\) is \(\mathbf {t}\) itself. Furthermore, since \(N_n = \{n\}\), n must be in the frontier of n. Therefore, \(\textsc {enum}_{T,N}(n, \mathbf {t},\rho )\) will output precisely \(\{(\mathbf {t}, \rho _n(\mathbf {t}))\}\) (line 4), which concludes the base case.
For the inductive step we need to consider two cases depending on the number of children of n.
Case 1 If n has a single child c, then necessarily c is a guard of n, i.e., \(\textit{var}(n) \subseteq \textit{var}(c)\). In this case, Algorithm 1 will call \(\textsc {enum}_{T, N}(c, \mathbf {s}, \rho )\) for each tuple \(\mathbf {s}\in \left( \rho _c\ltimes _{{\textit{pred}}(n)} \mathbf {t}\right) \). By induction hypothesis and Lemma 1, this will correctly enumerate and output the elements of \({\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}}(\textit{db})\ltimes \mathbf {s}\), for every \(\mathbf {s}\) in \({\mathcal {Q}}{\texttt {[}}T_c, c{\texttt {]}}(\textit{db})\ltimes _{{\textit{pred}}(n)} \mathbf {t}\). Note that the sets \({\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}}(\textit{db})\ltimes \mathbf {s}\) are disjoint for different values of \(\mathbf {s}\). Thus, no element is output twice. Hence, \(\textsc {enum}_{T, N}(n, \mathbf {t}, \rho )\) enumerates the GMR
Since \(\textit{var}({\textit{pred}}(n))\subseteq \textit{var}(c) \cup \textit{var}(n) = \textit{var}(c) = {\textit{out}}({\mathcal {Q}}{\texttt {[}}T_c,c{\texttt {]}}\), we can pull out the selection:
Subsequently, because \(var({\textit{pred}}(n)) = \textit{var}(c) \subseteq {\textit{out}}({\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}})\), we can pull out the selection again:
Because the variables in \(\mathbf {t}\) are a subset of \(\textit{var}(c)\), because \(\textit{var}(c) \subseteq \textit{var}(N_c)\), and because \({\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}}(\textit{db})\) and \({\mathcal {Q}}{\texttt {[}}T_c,c{\texttt {]}}(\textit{db})\) are positive (Lemma 3(1)), we can apply Lemma 10:
Next, observe that, since \(\textit{var}(n_c) \subseteq \textit{var}(N_c)\) as \(c \in N_c\), we have
Then, because \({\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}})(\textit{db})\) is positive, we obtain from Lemma 3(2) that
Finally, because \({\textit{pred}}(n) \subseteq \textit{var}(n) \subseteq \textit{var}(N_c)\), we push the selection again and obtain
Here, the last equality is due to the fact that \(\textit{var}(N_n) = \textit{var}(n) \cup \textit{var}(N_c) = \textit{var}(N_c)\), as \(\textit{var}(n) \subseteq \textit{var}(c)\) and \(c \in N_c\), which implies that projecting on \(\textit{var}(N_n)\) does not modify the result. The result then follows from the observation that \({\mathcal {Q}}{\texttt {[}}T_n, N_n{\texttt {]}} \equiv \pi _{\textit{var}(N_n)} \sigma _{{\textit{pred}}(n)} {\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}}\).
Case 2 Otherwise, n has two children \(c_1\) and \(c_2\). We assume w.l.o.g. that \(c_1\) is a guard of n, i.e., \(\textit{var}(n) \subseteq \textit{var}(c_1)\). Since \({N_n}>1\) and N is sibling closed, we have \(\{c_1,c_2\}\subset N\). In this case, Algorithm 1 will first enumerate \(\mathbf {t_i} \in \rho _{c_i}\ltimes _{{\textit{pred}}(n\rightarrow c_1)} \mathbf {t}\) for \(i\in \{1, 2\}\). By Lemma 1, this is equivalent to enumerate every \(\mathbf {t_i}\) in \({\mathcal {Q}}{\texttt {[}}T_{c_i}, c_i{\texttt {]}}(\textit{db})\ltimes _{{\textit{pred}}(n\rightarrow c_1)} \mathbf {t}\). Then, for each such \(\mathbf {t_i}\) the algorithm will enumerate every pair \((\mathbf {s_i}, \mu _i)\) generated by \(\textsc {enum}_{T, N}(c_i, \mathbf {t_i}, \rho )\), which by induction is the same as enumerating every \((\mathbf {s_i}, \mu _i)\) in \({\mathcal {Q}}{\texttt {[}}T_{c_i}, N_{c_i}{\texttt {]}}(\textit{db})\ltimes \mathbf {t_i}\). Note that the sets \({\mathcal {Q}}{\texttt {[}}T_{c_i}, N_{c_i}{\texttt {]}}(\textit{db})\ltimes \mathbf {t_i}\) are disjoint for distinct \(\mathbf {t_i}\). Therefore, no \((\mathbf {s_i},\mu _i)\) is generated twice. the algorithm is hence enumerating
By the same reasoning as in Case (1), this is equivalent to enumerating every \((\mathbf {s_i}, \mu _i)\) in \((\sigma _{{\textit{pred}}(n\rightarrow c_i)}{\mathcal {Q}}{\texttt {[}}T_{c_i}{\texttt {]}}(\textit{db}))\ltimes \mathbf {t}.\) From the connectedness property of T, it follows that \(\textit{var}({\mathcal {Q}}{\texttt {[}}T_{c_1}{\texttt {]}})\cap \textit{var}({\mathcal {Q}}{\texttt {[}}T_{c_2}{\texttt {]}}) \subseteq \textit{var}(n)\). Thus, \(\textit{var}({\mathcal {Q}}{\texttt {[}}T_{c_1}{\texttt {]}})\cap \textit{var}({\mathcal {Q}}{\texttt {[}}T_{c_2}{\texttt {]}})\) is a subset of the variables of \(\mathbf {t}\). Hence, every tuple \(\mathbf {s_1}\) will be compatible with every tuple \(\mathbf {s_2}\), and therefore, enumeration of every pair \((\mathbf {s_1}\cup \mathbf {s_2},\mu _1\times \mu _2)\) is the same as the enumeration of
The semijoin with \(\mathbf {t}\) factors out of the join:
We can now pull out the selections and obtain
Here, the last equality is due to the fact that \(\textit{var}(N_n) = \textit{var}(n) \cup \textit{var}(N_{c_1}) \cup \textit{var}(N_{c_2}) = \textit{var}(N_{c_1}) \cup \textit{var}(N_{c_2})\), as \(\textit{var}(n) \subseteq \textit{var}(c_1) \subseteq \textit{var}(N_{c_{1}})\). This implies that
Hence, projecting the join result on \(\textit{var}(N_n)\) does not modify the result. The result then follows from the observation that \({\mathcal {Q}}{\texttt {[}}T_n, N_n{\texttt {]}} \equiv \pi _{\textit{var}(N_n)} \sigma _{{\textit{pred}}(n)} ({\mathcal {Q}}{\texttt {[}}T_{c_1}, N_{c_2}{\texttt {]}} \bowtie {\mathcal {Q}}{\texttt {[}}T_{c_2}, N_{c_2}{\texttt {]}})\). \(\square \)
Proposition 4Assume that all join indexes in the (T, N)rep have access timeg, and that all indexes (join and enumeration) have update timeh, wheregandhare monotone functions. Further assume that, during the entire execution of\(\textsc {update}\), KandUbound the size of\(\rho _n\)and\(\Delta _n\), respectively, for alln. Then,\(\textsc {update}_{T,N}(\rho ,{ u })\)runs in time\({\mathcal {O}}\left( {T} \cdot \left( U + h(K, U) + g(K,U) \right) \right) \).
Proof
First, note that the initialization of \(\Delta _n\) in line 15 can be done in \({\mathcal {O}}(U)\) time (by copying \({ u }_{r(\overline{x}))}\) to \(\Delta _n\) tuple by tuple) and the initialization of \(\Delta _n\) in line 17 in \({\mathcal {O}}(1)\) time. Therefore, lines 14–17 run in \({\mathcal {O}}(T\cdot U)\) time, which falls within the claimed bounds. We next show that the for loop of lines 18–23 also runs within the claimed bounds. Since the body of this for loop is executed \({T}\) times, it suffices to show that each of the lines 19–23 run in time \({\mathcal {O}}(U + h(K, U) + g(K,U))\). Since \({\Delta _n} \le U\) by assumption, the statement \(\rho _n += \Delta _n\) of line 19 can be executed in \({\mathcal {O}}(U)\) time by iterating over the tuples \(\mathbf {t} \in \Delta _n\), and updating \(\rho _n(\mathbf {t})\) for each such tuple. (Recall that multiplicity lookup and modification in a GMR are \({\mathcal {O}}(1)\) operations). The indexes associated with \(\rho _n\) (if any) are updated in time h(K, U). Therefore, the total time require to execute line 19 is \({\mathcal {O}}(U + h(K,U))\). We next bound the complexity of line 21. Computing \(\pi _{\textit{var}(p)} (\rho _m \bowtie _{{\textit{pred}}{p}} \Delta _n)\) using the join index on \(\rho _m\) takes \({\mathcal {O}}(g(K,U))\) time. Furthermore, the number of tuples in \(\pi _{\textit{var}(p)} (\rho _m \bowtie _{{\textit{pred}}{p}} \Delta _n)\) can be at most 2U. This is because \({\Delta _p} \le U\) at any time during the execution. In the worst case, therefore, \(\pi _{\textit{var}(p)} (\rho _m \bowtie _{{\textit{pred}}{p}} \Delta _n)\) can at most delete the tuples already present in \(\Delta _p\) (which requires U tuples) and subsequently insert U new tuples (requiring another U tuples), for at most 2U tuples in total. For each of the 2U resulting tuples, we update \(\Delta _p\) accordingly in \({\mathcal {O}}(1)\) time. The total time to execute line 21 is hence \({\mathcal {O}}(2 \cdot U + g(K,U))\). Finally, using similar reasoning, the complexity of line 23 can be shown to be \({\mathcal {O}}(U)\). \(\square \)
Proofs of Sect. 5.1
Proof of Proposition 7
Because no infinite sequences of reduction steps are possible, it suffices to demonstrate local confluence:
Proposition 14
If \(\mathcal {H} \rightsquigarrow \mathcal {I}_1\) and \(\mathcal {H} \rightsquigarrow \mathcal {I}_2\), then there exists \(\mathcal {J}\) such that both \(\mathcal {I}_1 \rightsquigarrow ^* \mathcal {J}\) and \(\mathcal {I}_2 \rightsquigarrow ^* \mathcal {J}\).
Indeed, it is a standard result in the theory of rewriting systems that confluence (Lemma 7) and local confluence (Lemma 14) coincide when infinite sequences of reductions steps are impossible [5].
Before proving Lemma 14, we observe that the property of being isolated or being a conditional subset is preserved under reductions, in the following sense.
Lemma 11
Assume that \(\mathcal {H} \rightsquigarrow \mathcal {I}\). Then, \(\textit{pred}(\mathcal {I}) \subseteq \textit{pred}(\mathcal {H})\) and for every hyperedge e, we have \(\textit{ext}_{\mathcal {I}}(e) \subseteq \textit{ext}_{\mathcal {H}}(e)\), \({\textit{jv}}_{\mathcal {I}}(e) \subseteq {\textit{jv}}_{\mathcal {H}}(e)\), and \({\textit{isol}}_{\mathcal {H}}(e) \subseteq {\textit{isol}}_{\mathcal {I}}(e)\). Furthermore, if \(e \sqsubseteq _{\mathcal {H}} f\) then also \(e \sqsubseteq _{\mathcal {I}} f\).
Proof
First, observe that \(\textit{pred}(\mathcal {I}) \subseteq \textit{pred}(\mathcal {H})\), since reduction operators only remove predicates. This implies that \(\textit{ext}_{\mathcal {I}}(e) \subseteq \textit{ext}_{\mathcal {H}}(e)\) for every hyperedge e. Furthermore, because reduction operators only remove hyperedges and never add them, it is easy to see that \({\textit{jv}}_{\mathcal {H}}(e) \subseteq {\textit{jv}}_{\mathcal {I}}(e)\). Hence, if \(x \in {\textit{isol}}_{\mathcal {H}}(e)\) then \(x \not \in {\textit{jv}}_{\mathcal {H}}(e) \supseteq {\textit{jv}}_{\mathcal {I}}(e)\) and \(x \not \in \textit{var}(\textit{pred}(\mathcal {H})) \supseteq \textit{var}(\textit{pred}(\mathcal {I}))\). Therefore, \(x \in {\textit{isol}}_{\mathcal {I}}(e)\). As such, \({\textit{isol}}_{\mathcal {I}}(e) \subseteq {\textit{isol}}_{\mathcal {H}}(e)\).
Next, assume that \(e \sqsubseteq _{\mathcal {H}} f\). We need to show that \({\textit{jv}}_{\mathcal {I}}(e) \subseteq f\) and \(\textit{ext}_{\mathcal {I}}(e {\setminus } f) \subseteq f\). The first condition follows since \({\textit{jv}}_{\mathcal {I}}(e) \subseteq {\textit{jv}}_{\mathcal {H}}(e) \subseteq f\) where the last inclusion is due to \(e \sqsubseteq _{\mathcal {H}} f\). The second also follows since \(\textit{ext}_{\mathcal {I}}(e {\setminus } f) \subseteq \textit{ext}_{\mathcal {H}}(e {\setminus } f) \subseteq f\) where the last inclusion is due to \(e \sqsubseteq _{\mathcal {H}} f\). \(\square \)
Proof of Proposition 14
If \(\mathcal {I}_1 = \mathcal {I}_2\), then it suffices to take \(\mathcal {J} =\mathcal {I}_1 = \mathcal {I}_2\). Therefore, assume in the following that \(\mathcal {I}_1 \not = \mathcal {I}_2\). Then, necessarily \(\mathcal {I}_1\) and \(\mathcal {I}_2\) are obtained by applying two different reduction operations on \(\mathcal {H}\). We make a case analysis on the types of reductions applied.
(1) \({\textit{Case (ISO, ISO)}}\) assume that \(\mathcal {I}_1\) is obtained by removing the nonempty set \(X_1 \subseteq {\textit{isol}}_{\mathcal {H}}(e_1)\) from hyperedge \(e_1\), while \(\mathcal {I}_2\) is obtained by removing nonempty \(X_2 \subseteq {\textit{isol}}_{\mathcal {H}}(e_2)\) from \(e_2\) with \(X_1 \not = X_2\). There are two possibilities.
(1a) \(e_1 \not = e_2\). Then, \(e_2\) is still a hyperedge in \(\mathcal {I}_2 \) and \(e_1\) is still a hyperedge in \(\mathcal {I}_1\). By Lemma 11, \({\textit{isol}}_{\mathcal {H}}(e_1) \subseteq {\textit{isol}}_{\mathcal {I}_2}(e_1)\) and \({\textit{isol}}_{\mathcal {H}}(e_2) \subseteq {\textit{isol}}_{\mathcal {I}_1}(e_2)\). Therefore, we can still remove \(X_2\) from \(\mathcal {I}_1\) by means of rule ISO and similarly remove \(X_1\) from \(\mathcal {I}_2\). Let \(\mathcal {J}_1\) (resp. \(\mathcal {J}_2\)) be the result of removing \(X_2\) from \(\mathcal {I}_1\) (resp. \(\mathcal {I}_2\)). Then, \(\mathcal {J}_1 = \mathcal {J}_2\) (and hence equals triplet \(\mathcal {J}\)):
(1b) \(e_1 = e_2\). We show that \(X_2 {\setminus } X_1 \subseteq {\textit{isol}}_{\mathcal {I}_1}(e_1 {\setminus } X_1)\) and similarly \(X_1 {\setminus } X_2 \subseteq {\textit{isol}}_{\mathcal {I}_1}(e_2 {\setminus } X_1)\). This suffices because we can then apply ISO to remove \(X_2 {\setminus } X_1\) from \(\mathcal {I}_1\) and \(X_1 {\setminus } X_2\) from \(\mathcal {I}_2\). In both cases, we reach the same triplet as removing \(X_1 \cup X_2 \subseteq {\textit{isol}}_{\mathcal {H}}(e_1)\) from \(\mathcal {H}\).^{Footnote 7}
To see that \(X_2 {\setminus } X_1 \subseteq {\textit{isol}}_{\mathcal {I}_1}(e_1 {\setminus } X_1)\), let \(x \in X_2 {\setminus } X_1\). We need to show \(x \not \in {\textit{jv}}_{\mathcal {I}_1}(e_1 {\setminus } X_1)\) and \(x \not \in \textit{var}(\textit{pred}(\mathcal {I}_1))\). Because \(x \in X_2 \subseteq {\textit{isol}}_{\mathcal {H}}(e_1)\), we know \(x \not \in {\textit{jv}}_{\mathcal {H}}(e_1)\). Then, since \(x \not \in X_1\), also \(x \not \in {\textit{jv}}_{\mathcal {H}}(e_1 {\setminus } X_1)\). By Lemma 11, \({\textit{jv}}_{\mathcal {I}_1}(e_1 {\setminus } X_1) \subseteq {\textit{jv}}_{\mathcal {H}}(e_1 {\setminus } X_1)\). Therefore, \(x \not \in {\textit{jv}}_{\mathcal {I}_1}(e_1 {\setminus } X_1)\). Furthermore, because \(x \in {\textit{isol}}_{\mathcal {H}}(e_1)\), we know \(x \not \in \textit{var}(\textit{pred}(\mathcal {H}))\). Since \(\textit{var}(\textit{pred}(\mathcal {I}_1)) \subseteq \textit{var}(\textit{pred}(\mathcal {H}))\) by Lemma 11, also \(x\ not \in \textit{var}(\textit{pred}(\mathcal {I}_1))\).
\(X_1 {\setminus } X_2 \subseteq {\textit{isol}}_{\mathcal {I}_1}(e_2 {\setminus } X_1)\) is shown similarly.
(2) \({\textit{Case (CSE, CSE)}}\) assume that \(\mathcal {I}_1\) is obtained by removing hyperedge \(e_1\) because it is a conditional subset of hyperedge \(f_1\), while \(\mathcal {I}_2\) is obtained by removing \(e_2\), conditional subset of \(f_2\). Since \(\mathcal {I}_1 \not = \mathcal {I}_2\), it must be \(e_1 \not = e_2\). We need to further distinguish the following cases.
(2a) \(e_1 \not = f_2\) and \(e_2 \not = f_1\). In this case, \(e_2\) and \(f_2\) remain hyperedges in \(\mathcal {I}_1\) while \(e_1\) and \(f_1\) remain hyperedges in \(\mathcal {I}_2\). Then, by Lemma 11, \(e_2 \sqsubseteq _{\mathcal {I}_1} f_2\) and \(e_1 \sqsubseteq _{\mathcal {I}_2} f_2\). Let \(\mathcal {J}_1\) (resp. \(\mathcal {J}_2\)) be the triplet obtained by removing \(e_2\) from \(\mathcal {I}_1\) (resp. \(e_1\) from \(\mathcal {I}_2\)). Then, \(\mathcal {J}_1 = \mathcal {J}_2\) since clearly \({\textit{out}}(\mathcal {J}_1) = {\textit{out}}(\mathcal {J}_2)\) and
From this the result follows by taking \(\mathcal {J} = \mathcal {J}_1 = \mathcal {J}_2\).
(2b) \(e_1 \not = f_2\) but \(e_2 = f_1\). Then, \(e_1 \sqsubseteq _{\mathcal {H}} e_2\) and \(e_2 \sqsubseteq _{\mathcal {H}} f_2\) with \(f_2 \not = e_1\). It suffices to show that \(e_1 \sqsubseteq _{\mathcal {H}} f_2\) and \(e_1 {\setminus } f_2 = e_1 {\setminus } f_1\), because then (CSE) due to \(e_1\sqsubseteq _{\mathcal {H}} f_1\) has the same effect as CSE on \(e_1 \sqsubseteq _{\mathcal {H}} f_2\), and we can apply the reasoning of case (2a) because \(e_1 \not = f_2\) and \(e_2 \not = f_2\).
We first show \(e_1 {\setminus } f_2 = e_1 {\setminus } f_1\). Let \(x \in e_1 {\setminus } f_2\) and suppose for the purpose of contradiction that that \(x \in e_2 = f_1\). Then, since \(e_1 \not = e_2\), \(x \in {\textit{jv}}(e_2) \subseteq f_2\) where the last inclusion is due to \(e_2 \sqsubseteq _{\mathcal {H}} f_2\). Hence, \(e_1 {\setminus } f_2 \subseteq e_1 {\setminus } f_1\). Conversely, let \(x \in e_1 {\setminus } f_1\). Since \(f_1 = e_2\), \(x \not \in e_2\). Suppose for the purpose of contradiction that \(x \in f_2\). Because \(e_1 \not = f_2\), \(x \in {\textit{jv}}_{\mathcal {H}}(e_1) \subseteq e_2\) where the last inclusion is due to \(e_1 \sqsubseteq _{\mathcal {H}} e_2\). Therefore, \(e_2 {\setminus } f_1 = e_1 {\setminus } f_2\).
To show that \(e_1 \sqsubseteq _{\mathcal {H}} f_2\), let \(x \in {\textit{jv}}_{\mathcal {H}}(e_1)\). Because \(e_1 \sqsubseteq _{\mathcal {H}} e_2\), \(x \in e_2\). Because x occurs in two distinct hyperedges in \(\mathcal {H}\), also \(x \in {\textit{jv}}_{\mathcal {H}}(e_2)\). Then, because \(e_2 \sqsubseteq _{\mathcal {H}} f_2\), \(x \in f_2\). Hence, \({\textit{jv}}_{\mathcal {H}}(e_1) \subseteq f_2\). It remains to show \(\textit{ext}_{\mathcal {H}}(e_1 {\setminus } f_2) \subseteq f_2\). To this end, let \(x \in \textit{ext}_{\mathcal {H}}(e_1 {\setminus } f_2)\) and suppose for the purpose of contradiction that \(x \not \in f_2\). By definition of \(\textit{ext}\) there exists \(\theta \in \textit{pred}(\mathcal {H})\) and \(y \in \textit{var}(\theta ) \cap (e_1 {\setminus } f_2)\) such that \(x \in \textit{var}(\theta ) {\setminus } (e_1 {\setminus } f_2)\). In particular, \(y \not \in f_2\). Since \(e_1 {\setminus } f_2 = e_1 {\setminus } e_2\), \(y \in \textit{var}(\theta ) \cap (e_1 {\setminus } e_2) \) and \(x \in \textit{var}(\theta ) {\setminus } (e_1 {\setminus } e_2)\). Thus, \(x \in \textit{ext}_{\mathcal {H}}(e_1 {\setminus } e_2)\). Then, since \(e_1 \sqsubseteq _{\mathcal {H}} e_2\), \(x \in e_2\). Thus, \(x \in e_2 {\setminus } f_2\) since \(x \not \in f_2\). Hence, \(x \in \textit{var}(\theta ) \cap (e_2 {\setminus } f_2)\). Furthermore, since \(y \not \in e_2\) also \(y \not \in e_2 {\setminus } f_2\). Hence, \(y \in \textit{var}(\theta ) {\setminus } (e_2 {\setminus } f_2)\). But then \(\theta \) shows that \(y \in \textit{ext}_{\mathcal {H}}(e_2 {\setminus } f_2)\). Then, by because \(e_2 \sqsubseteq _{\mathcal {H}} f_2\), also \(y \in f_2\) which yields the desired contradiction.
(2c) \(e_1 = f_2\) but \(e_2 \not = f_1\). Similar to case (2b).
(2d) \(e_1 = f_2\) and \(e_2 = f_1\). Then, \(e_1 \sqsubseteq _{\mathcal {H}} e_2\) and \(e_2 \sqsubseteq {_\mathcal {H}} e_1\) and \(e_1 \not = e_2\). Let \(\mathcal {K}_1\) (resp. \(\mathcal {K}_2\)) be the triplet obtained by applying (FLT) to remove all \(\theta \in \textit{pred}(\mathcal {I}_1)\) (resp. \(\theta \in \textit{pred}(\mathcal {I}_2)\) for which \(\textit{var}(\theta ) \subseteq \textit{var}(e_2)\) (resp. \((\textit{var}(\theta ) \subseteq \textit{var}(e_2)\). Furthermore, let \(\mathcal {J}_1\) (resp. \(\mathcal {J}_2\)) be the triplet obtained by applying ISO to removing \({\textit{isol}}_{\mathcal {I}_1}(e_2)\) from \(\mathcal {K}_1\) (resp. removing \({\textit{isol}}_{\mathcal {I}_2}(e_1)\) from \(\mathcal {K}_2\)). Here, we take \(\mathcal {J}_1 = \mathcal {K}_1\) if \({\textit{isol}}_{\mathcal {K}_1}(e_2)\) is empty (and similarly for \(\mathcal {J}_2\)). Then, clearly \(\mathcal {H} \rightsquigarrow \mathcal {I}_1 \rightsquigarrow ^* \mathcal {K}_1 \rightsquigarrow ^* J_1\) and \(\mathcal {H} \rightsquigarrow \mathcal {I}_2 \rightsquigarrow ^* \mathcal {K}_2 \rightsquigarrow ^* \mathcal {J}_2\). The result then follows by showing that \(\mathcal {J}_1 = \mathcal {J}_2\). Toward this end, first observe that \({\textit{out}}(\mathcal {J}_1) = {\textit{out}}(\mathcal {K}_1) = {\textit{out}}(\mathcal {I}_1) = {\textit{out}}(\mathcal {H}) = {\textit{out}}(\mathcal {I}_2) = {\textit{out}}(\mathcal {K}_2) = {\textit{out}}(\mathcal {J}_2)\). Next, we show that \(\textit{pred}(\mathcal {J}_1) = \textit{pred}(\mathcal {J}_2)\). We first observe that \(\textit{pred}(\mathcal {J}_1) = \textit{pred}(\mathcal {K}_1)\) and \(\textit{pred}(\mathcal {J}_2) = \textit{pred}(\mathcal {K}_2)\) since the ISO operation does not remove predicates. Then, observe that
We only show the reasoning for \(\textit{pred}(\mathcal {K}_1) \subseteq \textit{pred}(\mathcal {K}_2)\), the other direction being similar. Let \(\theta \in \textit{pred}(\mathcal {K}_1)\). Then, \(\textit{var}(\theta \cap (e_1 {\setminus } e_2) =\emptyset \) and \(\textit{var}(\theta ) \not \subseteq e_2\). Since \(\textit{var}(\theta ) \not \subseteq e_2\) there exists \(y \in \textit{var}(\theta ) {\setminus } e_2\). Then, because \(\textit{var}(\theta ) \cap (e_1 {\setminus } e_2) = \emptyset \), \(y \not \in e_1\). Thus, \(\textit{var}(\theta ) \not \subseteq e_1\). Now, suppose for the purpose of obtaining a contradiction, that \(\textit{var}(\theta ) \cap (e_2 {\setminus } e_1) \not = \emptyset \). Then take \(z \in \textit{var}(\theta ) \cap (e_2 {\setminus } e_1)\). But then \(y \in \textit{ext}_{\mathcal {H}}(e_2 {\setminus } e_1)\). Hence, \(y \in e_1\) because \(e_2 \sqsubseteq _{\mathcal {H}} e_1\), which yields the desired contradiction with \(y \not \in e_2\). Therefore, \(\textit{var}(\theta ) \cap (e_2 {\setminus } e_1) = \emptyset \), as desired. Hence, \(\theta \in \textit{pred}(\mathcal {K}_2)\).
It remains to show that \(\textit{hyp}(\mathcal {J}_1) = \textit{hyp}(\mathcal {J}_2)\). To this end, first observe
Clearly, \(\textit{hyp}(\mathcal {J}_1) = \textit{hyp}(\mathcal {J}_2)\) if \(e_2 {\setminus } {\textit{isol}}_{\mathcal {K}_1}(e_2) = e_1 {\setminus } {\textit{isol}}_{\mathcal {K}_2}(e_1)\).
We only show \(e_2 {\setminus } {\textit{isol}}_{\mathcal {K}_1}(e_2) \subseteq e_1 {\setminus } {\textit{isol}}_{\mathcal {K}_2}(e_1)\), the other inclusion being similar. Let \(x \in e_2 {\setminus } {\textit{isol}}_{\mathcal {K}_1}(e_2)\). Since \(x \not \in {\textit{isol}}_{\mathcal {K}_1}(e_2)\) one of the following hold.
\(x \in {\textit{out}}(\mathcal {K}_1)\). But then, \(x \in {\textit{out}}(\mathcal {K}_1) = {\textit{out}}(\mathcal {I}_1) = {\textit{out}}(\mathcal {H}) = {\textit{out}}(\mathcal {I}_2) = {\textit{out}}(\mathcal {K}_2)\). In particular, x is an equijoin variable in \(\mathcal {H}\) and \(\mathcal {K_2}\). Then, \(x \in {\textit{jv}}_{\mathcal {H}}(e_2) \subseteq e_1\) because \(e_2 \sqsubseteq _{\mathcal {H}} e_1\). From this and the fact that x remains an equijoin variable in \(\mathcal {K}_2\), we obtain \(x \in e_1 {\setminus } {\textit{isol}}_{\mathcal {K}_2}(e_1)\).
x occurs in \(e_2\) and in some hyperedge g in \(\mathcal {K}_1\) with \(g \not = e_2\). Since \(e_1\) is not in \(\mathcal {K}_1\) also \(g \not = e_1\). Since every hyperedge in \(\mathcal {K}_1\) is in \(\mathcal {I}_1\) and every hyperedge in \(\mathcal {I}_1\) is in \(\mathcal {H}\), also g is in \(\mathcal {H}\). But then, x occurs in two distinct hyperedges in \(\mathcal {H}\), namely \(e_2\) and g, and hence \(x \in {\textit{jv}}_{\mathcal {H}}(e_2) \subseteq e_1\) because \(e_2 \sqsubseteq _{\mathcal {H}} e_1\). However, because x also occurs in g which must also be in \(\mathcal {I}_2\) and therefore also in \(\mathcal {K}_2\), x also occurs in two distinct hyperedges in \(\mathcal {K}_2\), namely \(e_1\) and g. Therefore, \(x \in {\textit{jv}}_{\mathcal {I}_2}(e_1)\) and hence \(x \in e_1 {\setminus } {\textit{isol}}_{\mathcal {I}_2}(e_1)\), as desired.
\(x \in \textit{var}(\textit{pred}(\mathcal {K}_1))\). Then, there exists \(\theta \in \textit{pred}(\mathcal {K}_1)\) such that \(x \in \textit{var}(\theta )\). Since \(\textit{pred}(\mathcal {K}_1) =\textit{pred}(\mathcal {K}_2)\), \(\theta \in \textit{pred}(\mathcal {K}_2)\). As such, \(\theta \in \textit{pred}(\mathcal {H})\), \(\textit{var}(\theta ) \cap (e_2 {\setminus } e_1) = \emptyset \), and \(\textit{var}(\theta ) \not \subseteq e_1\). But then, since \(x \in \textit{var}(\theta )\); \(x \in e_2\); and \(\textit{var}(\theta ) \cap (e_2 {\setminus } e_1) = \emptyset \), it must be the case that \(x \in e_1\). As such, \(x \in e_1\) and \(x \in \textit{var}(\mathcal {K}_2)\). Hence, \(x \in e_1 {\setminus } {\textit{isol}}_{\mathcal {K}_2}(e_1)\).
(3) \({\textit{Case (ISO, CSE)}}\) assume that \(\mathcal {I}_1\) is obtained by removing the nonempty set of isolated variables \(X_1 \subseteq {\textit{isol}}_{\mathcal {H}}(e_1)\) from \(e_1\), while \(\mathcal {I}_2\) is obtained by removing hyperedge \(e_2\), conditional subset of hyperedge \(f_2\). We may assume w.l.o.g. that \(e_1 \not = {\textit{isol}}_{\mathcal {H}}(e_1)\): if \(e_1 = {\textit{isol}}_{\mathcal {H}}(e_1)\), then the ISO operation removes the complete hyperedge \(e_1\). However, because no predicate in \(\mathcal {H}\) shares any variable with \(e_1\), it is readily verified that \(e_1 \sqsubseteq _{\mathcal {H}} e_2\) and thus the removal of \(e_1\) can also be seen as an application of CSE on \(e_1\),^{Footnote 8} and we are hence back in case (2).
Now reason as follows. Because \(e_2 \sqsubseteq _{\mathcal {H}} f_2\) and because isolated variables of \(e_1\) occur in no other hyperedge in \(\mathcal {H}\), it must be the case that \(e_2 \cap X_1 = \emptyset \). In particular, \(e_1\) and \(e_2\) must hence be distinct. Therefore, \(e_1 \in \textit{hyp}(\mathcal {I}_2)\) and \(e_2 \in \textit{hyp}(\mathcal {I}_1)\). By Lemma 11, we can apply ISO on \(\mathcal {I}_2\) to remove \(X_1\) from \(e_1\). It then suffices to show that \(e_2\) remains a conditional subset of some hyperedge \(f'_2\) in \(\mathcal {I}_1\) with \(e_2 {\setminus } f_2 = e_2 {\setminus } f'_2\). Indeed, we can then use ECQ to remove \(e_2\) from \(\textit{hyp}(\mathcal {I}_1)\) as well as predicates \(\theta \) with \(\textit{var}(\theta ) \cap (e_2 {\setminus } f_2) \not = \emptyset \) from \(\textit{pred}(\mathcal {I}_1)\). This clearly yields the same triplet as the one obtained by removing \(X_1\) from \(e_1\) in \(\mathcal {I}_2\). We need to distinguish two cases.
(3a) \(f_2 \not = e_1\). Then, \(f_2 \in \textit{hyp}(\mathcal {I}_1)\) and hence \(e_2 \sqsubseteq _{\mathcal {I}_1} f_2\) by Lemma 11. We hence take \(f'_2 = f_2\).
(3b) \(f_2 = e_1\). Then, we take \(f'_2 = e_1 {\setminus } X\). Since \(e_1 \not = {\textit{isol}}_{\mathcal {H}}(e_1)\), it follows that \(e_1 {\setminus } X_1 \not = \emptyset \). Therefore, \(f'_2 = e_1 {\setminus } X_1 \in \textit{hyp}(\mathcal {I}_1)\). Furthermore, since \(X \subseteq {\textit{isol}}_{\mathcal {H}}(e_1)\), no variable in X is in any other hyperedge in \(\mathcal {H}\). In particular \(X \cap e_2 = \emptyset \). Therefore, \(e_2 {\setminus } f'_2 = e_2 {\setminus } (e_1 {\setminus } X) = (e_2 {\setminus } e_1) \cup (e_2 \cap X) = e_2 {\setminus } e_1 {\setminus } e_1 = e_2 {\setminus } f_2\). It remains to show that \(e_2 \sqsubseteq _{\mathcal {I}_1} e_1 {\setminus } X_1\).
\({\textit{jv}}_{\mathcal {I}_1}(e_2) \subseteq e_1 {\setminus } X_1\). Let \(x \in {\textit{jv}}_{\mathcal {I}_1}(e_2)\). By Lemma 11, \(x \in {\textit{jv}}_{\mathcal {I}_1}(e_2) \subseteq {\textit{jv}}_{\mathcal {H}}(e_2) \subseteq e_1\) where the last inclusion is due to \(e_2 \sqsubseteq _{\mathcal {H}} e_1\). In particular, x is an equijoin variable in \(\mathcal {H}\). But then it cannot be an isolated variable in any hyperedge. Therefore, \(x \not \in X_1\).
\(\textit{ext}_{\mathcal {I}_1}(e_2 {\setminus } e_1) \subseteq e_1 {\setminus } X\). Let \(x \in \textit{ext}_{\mathcal {I}_1}(e_2 {\setminus } e_1)\). Then, \(x \in \textit{ext}_{\mathcal {I}_1}(e_2 {\setminus } e_1) \subseteq \textit{ext}_{\mathcal {H}}(e_2 {\setminus } e_1) \subseteq e_1\) where the first inclusion is by Lemma 11 and the second by \(e_2 \sqsubseteq _{\mathcal {H}} e_1\). Then, because \(x \in \textit{ext}_{\mathcal {H}}(e_2 {\setminus } e_1)\), it follows from the definition of \(\textit{ext}\) that x occurs in some predicate in \(\textit{pred}(\mathcal {H})\). However, X is disjoint with \(\textit{var}(\textit{pred}(\mathcal {H}))\) since it consist only of isolated variables. Therefore, \(x \not \in X\).
(4) \({\textit{Case (ISO, FLT)}}\) Assume that \(\mathcal {I}_1\) is obtained by removing the nonempty set \(X_1 \subseteq {\textit{isol}}_{\mathcal {H}}(e_1)\) from hyperedge \(e_1\), while \(\mathcal {I}_2\) is obtained by removing all predicates in the nonempty set \(\Theta \subseteq \textit{pred}(\mathcal {H})\) with \(\textit{var}(\Theta ) \subseteq e_2\) for some hyperedge \(e_2\) in \(\textit{hyp}(\mathcal {H})\). Observe that \(e_1 \in \textit{hyp}(\mathcal {I}_2)\). By Lemma 11, \(X \subseteq {\textit{isol}}_{\mathcal {H}}(e_1) \subseteq {\textit{isol}}_{\mathcal {I}_2}(e_1)\). Therefore, we may apply reduction operation (ISO) on \(\mathcal {I}_2\) to remove \(X_1\) from \(e_1\). We will now show that, similarly, we may still apply (FLT) on \(\mathcal {I}_1\) to remove all predicates in \(\Theta \) from \(\textit{pred}(\mathcal {I}_1) = \textit{pred}(\mathcal {H})\). The two operations hence commute, and clearly, the resulting triplets in both cases is the same. We distinguish two possibilities. (i) \(e_1 \not = e_2\). Then, \(e_2 \in \mathcal {I}_1\) and, \(\textit{var}(\Theta ) \subseteq e_2\) and, since (ISO) does not remove predicates, \(\Theta \subseteq \textit{pred}(\mathcal {H}) = \textit{pred}(\mathcal {I}_1)\). As such the (FLT) operation indeed applies to remove all predicates in \(\Theta \) from \(\textit{pred}(\mathcal {I}_1)\). (ii) \(e_1 = e_2\). Then, since \(X \subseteq {\textit{isol}}_{\mathcal {H}}(e_1)\) and isolated variables do no occur in any predicate, \(X \cap \textit{var}(\Theta ) = \emptyset \). Then, since \(\textit{var}(\Theta ) \subseteq e_2 = e_1\), it follows that also \(\textit{var}(\Theta ) \subseteq e_1 {\setminus } X\). In particular, since we disallow nullary predicates and \(\Theta \) is nonempty, \(e_1 {\setminus } X \not = \emptyset \). Thus, \(e_1 {\setminus } X \in \textit{hyp}(\mathcal {I}_1)\) and hence operation (FLT) applies indeed applies to remove all predicates in \(\Theta \) from \(\textit{pred}(\mathcal {I}_1)\)
(5) \({\textit{Case (CSE, FLT)}}\) assume that \(\mathcal {I}_1\) is obtained by removing hyperedge \(e_1\), conditional subset of \(e_2\) in \(\mathcal {H}\), while \(\mathcal {I}_2\) is obtained by removing all predicates in the nonempty set \(\Theta \subseteq \textit{pred}(\mathcal {H})\) with \(\textit{var}(\Theta ) \subseteq e_3\) for some hyperedge \(e_3 \in \textit{hyp}(\mathcal {H})\). Since the (FLT) operation does not remove any hyperedges, \(e_1\) and \(e_2\) are in \(\textit{hyp}(\mathcal {I}_2)\). Then, since \(e_1 \sqsubseteq _{\mathcal {H}} e_2\) also \(e_1 \sqsubseteq _{\mathcal {I}_2} e_2\) by Lemma 11. Therefore, we may apply reduction operation (CSE) on \(\mathcal {I}_2\) to remove \(e_1\) from \(\textit{hyp}(\mathcal {I}_2)\) as well as all predicates \(\theta \in \textit{pred}(\mathcal {I}_2)\) for which \(\textit{var}(\theta ) \cap (e_1 {\setminus } e_2) \not = \emptyset \). Let \(\mathcal {J}_2\) be the triplet resulting from this operation. We will show that, similarly, we may apply (FLT) on \(\mathcal {I}_1\) to remove all predicates in \(\Theta \cap \textit{pred}(\mathcal {I}_1)\) from \(\textit{pred}(\mathcal {I}_1)\), resulting in a triplet \(\mathcal {J}_1\). Observe that necessarily, \(\mathcal {J}_1 = \mathcal {J}_2\) (and hence they form the triplet \(\mathcal {J}\)). Indeed, \({\textit{out}}(\mathcal {J}_1) = {\textit{out}}(\mathcal {I}_1) = {\textit{out}}(\mathcal {H}) = {\textit{out}}(\mathcal {I}_2) = {\textit{out}}(\mathcal {J}_2)\) since reduction operations never modify output variables. Moreover,
where the first and third equality is due to fact that (FLT) does not modify the hypergraph of the triplet it operates on. Finally, observe
It remains to show that we may apply (FLT) on \(\mathcal {I}_1\) to remove all predicates in \(\Theta \cap \textit{pred}(\mathcal {I}_1)\), resulting in a triplet \(\mathcal {J}_1\). There are two possibilities.
\(e_3 \not = e_1\). Then, \(e_3 \in \mathcal {I}_1\), \(\Theta \cap \textit{pred}(\mathcal {(}I_1)) \subseteq \textit{pred}(\mathcal {I}_1))\), and \(\textit{var}(\Theta \cap \textit{pred}(\mathcal {I}_1)) \subseteq \textit{var}(\Theta ) \subseteq e_3\). Hence, the (FLT) operation indeed applies to \(\mathcal {I}_1\) to remove all predicates in \(\Theta \cap \textit{pred}(\mathcal {I}_1)\).
\(e_3 = e_1\). In this case, we claim that for every \(\theta \in \Theta \cap \textit{pred}(\mathcal {I}_1)\), we have \(\textit{var}(\theta ) \subseteq e_2\). As such, \(\textit{var}(\Theta \cap \textit{pred}(\mathcal {I}_1)) \subseteq e_2\). Since \(e_2 \in \textit{hyp}(\mathcal {I}_1)\) and \(\Theta \cap \textit{pred}(\mathcal {I}_1) \subseteq \textit{pred}(\mathcal {I}_1)\), we may hence apply (FLT) to remove all predicates in \(\Theta \cap \textit{pred}(\mathcal {I}_1)\) from \(\mathcal {I}_1\). Concretely, let \(\theta \in \Theta \cap \textit{pred}(\mathcal {I}_1)\). Because, in order to obtain \(\mathcal {I}_1\), (CSE) removes all predicates from \(\mathcal {H}\) that share a variable with \(e_1 {\setminus } e_2\), we have \(\textit{var}(\theta ) \cap (e_1 {\setminus } e_2) = \emptyset \). Moreover, because \(\theta \in \Theta \), \(\textit{var}(\theta ) \subseteq e_1\). Hence, \(\textit{var}(\theta ) \subseteq e_2\), as desired.
The remaining cases, (CSE, ISO), (FLT, ISO), and (FLT, CSE), are symmetric to case (3), (4), and (5), respectively. \(\square \)
Proof of Proposition 9
Proposition 9For every GJT pair there exists an equivalent canonical pair.
Proof
Let T be a GJT. The proof proceeds in three steps.
Step 1 Let \(T_1\) be the GJT obtained from T by (i) removing all predicates from T, and (ii) creating a new root node r that is labeled by \(\emptyset \) and attaching the root of T to it, labeled by the empty set of predicates. \(T_1\) satisfies the first canonicality condition, but is not equivalent to T because it has none of T’s predicates. Now readd the predicates in T to \(T_1\) as follows. For each edge \(m \rightarrow n\) in T and each predicate \(\theta \in \textit{pred}_T(m \rightarrow n)\), if \(\textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) \not =\emptyset \) then add \(\theta \) to \(\textit{pred}_{T_1}(m \rightarrow n)\). Otherwise, if \(\textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) = \emptyset \), do the following. First, observe that, by definition of GJTs, \(\textit{var}(\theta ) \subseteq \textit{var}(n) \cup \textit{var}(m)\). Because \(\textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) = \emptyset \) this implies \(\textit{var}(\theta ) \subseteq \textit{var}(m)\). Because we disallow nullary predicates, \(\textit{var}(m) \not = \emptyset \). Let a be the first ancestor of m in \(T_1\) such that \(\textit{var}(\theta ) \not \subseteq \textit{var}(a)\). Such an ancestor exists because the root of \(T_1\) is labeled \(\emptyset \). Let b be the child of a in \(T_1\). Since a is the first ancestor of m with \(\textit{var}(\theta ) \not \subseteq \textit{var}(a)\), \(\textit{var}(\theta ) \subseteq \textit{var}(b)\). Therefore, \(\textit{var}(\theta ) \subseteq \textit{var}(b) \cup \textit{var}(a)\) and \(\textit{var}(\theta ) \cap (\textit{var}(b) {\setminus } \textit{var}(a)) \not = \emptyset \). As such, add \(\theta \) to \(\textit{pred}_{T_1}(a \rightarrow b)\). After having done this for all predicates in T, \(T_1\) becomes equivalent to T and satisfies canonicality conditions (1) and (3). Then, take \(N_1 = N \cup \{r\}\). Clearly, \(N_1\) is a connex subset of \(T_1\) and \(\textit{var}(N) = \textit{var}(N')\).
Therefore, \((T_1,N_1)\) is equivalent to (T, N).
Step 2 Let \(T_2\) be obtained from \(T_1\) by adding, for each leaf node l in \(T_1\) a new interior node \(n_l\) labeled by \(\textit{var}(l)\) and inserting it inbetween l and its parent in \(T_1\), i.e., if l has parent p in \(T_1\), then we have \(p \rightarrow n_l \rightarrow l\) in \(T_2\) with \(\textit{pred}_{T_2}(p \rightarrow n_l) = {\textit{pred}}_{T_1}(p \rightarrow n)\) and \(\textit{pred}_{T_2}(n_l \rightarrow l)= \emptyset \).^{Footnote 9} Furthermore, let \(N_2\) be the connex subset of \(T_2\) obtained by replacing every leaf node l in \(N_1\) by its newly inserted node \(n_l\). Clearly, \(\textit{var}(N_2) = \textit{var}(N_1) = \textit{var}(N)\) because \(var(l) = \textit{var}(n_l)\) for every leaf l of \(T_1\). By our construction, \((T_2, N_2)\) is equivalent to (T, N); \(T_2\) satisfies canonicality conditions (1), (2), and (4); and \(N_2\) is canonical.
Step 3 It remains to enforce condition (3). To this end, observe that, by the connectedness condition of GJTs, \(T_2\) violates canonicality condition (3) if and only if there exist internal nodes m and n where m is the parent of n such that \(\textit{var}(m) = \textit{var}(n)\). In this case, we call n a culprit node. We will now show how to obtain an equivalent pair (U, M) that removes a single culprit node; the final result is then obtained by iterating this reasoning until all culprit nodes have been removed.
The culprit removal procedure is essentially the reverse of the binarization procedure of Fig. 9. Concretely, let n be a culprit node with parent m and let \(n_1,\ldots , n_k\) be the children of n in \(T_2\). Let U be the GJT obtained from \(T_2\) by removing n and attaching all children \(n_i\) of n as children to m with edge label \(\textit{pred}_U(m \rightarrow n_i) = \textit{pred}_{T_2}(n \rightarrow n_i)\), for \(1 \le i \le k\). Because \(\textit{var}(n) = \textit{var}(m)\), the result is still a valid GJT. Moreover, because \(\textit{var}(n) = \textit{var}(m)\) and \(T_2\) satisfied condition (4), we had \({\textit{pred}}_{T_2}(m \rightarrow n) = \emptyset \), so no predicate was lost by the removal of n. Finally, define M as follows. If \(n \in N_2\), then set \(M = N_2 {\setminus } \{n\}\); otherwise, set \(M = N_2\). In the former case, since \(N_2\) is connex and \(n \in N_2\), m must also be in \(N_2\). It is hence in M. Therefore, in both cases, \(\textit{var}(N) = \textit{var}(N_2) = \textit{var}(M)\). Furthermore, it is straightforward to check that M is a connex subset of U. Finally, since \(N_2\) consisted only of interior nodes of \(T_2\), M consists only of interior nodes of U and hence remains canonical. \(\square \)
Proof of Lemma 5
We first require a number of auxiliary results.
We first make the following observations regarding canonical GJT pairs.
Lemma 12
Let (T, N) be a canonical GJT pair, let n be a frontier node of N and let m be the parent of n in T.
 1.
\(x \not \in \textit{var}(N {\setminus } \{n\})\), for every \(x \in \textit{var}(n) {\setminus } \textit{var}(m)\).
 2.
\(\textit{hyp}(T, N {\setminus } \{n\}) = \textit{hyp}(T, N) {\setminus } \{\textit{var}(n)\})\).
 3.
\(\theta \not \in \textit{pred}(m \rightarrow n)\), for every \(\theta \in \textit{pred}(T, N {\setminus } \{n\})\)
 4.
\(\textit{pred}(T, N {\setminus } \{n\}) = \textit{pred}(T, N) {\setminus } \textit{pred}(m \rightarrow n)\).
 5.
\(\textit{pred}(m \rightarrow n) = \{ \theta \in {\textit{pred}}(T, N) \mid \textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) \not = \emptyset \}\).
 6.
\({\textit{pred}}(T, N {\setminus } \{n\}) = \{ \theta \in {\textit{pred}}(T, N) \mid \textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) = \emptyset \}\).
Proof

(1)
Let \(x \in \textit{var}(n) {\setminus } \textit{var}(m)\) and let c be a node in \(N {\setminus } \{n\}\). Clearly the unique undirected path between c and n in T must pass through m. Because \(x \not \in \textit{var}(m)\), it follows from the connectedness condition of GJTs that also \(x \not \in \textit{var}(c)\). As such, \(x \not \in \textit{var}(N {\setminus } \{n\})\).

(2)
The \(\supseteq \) direction is trivial. For the \(\subseteq \) direction, assume that \(m \in N {\setminus } \{n\}\) with \(\textit{var}(m) \not = \emptyset \). Then, clearly \(m \in N\) and hence \(\textit{var}(m) \in \textit{hyp}(T,N)\). Furthermore, because N is canonical, both m and n are interior nodes in T. Then, because T is canonical and \(m \not = n\), we have \(\textit{var}(m) \not =\textit{var}(n)\). Therefore, \(\textit{var}(m) \in \textit{hyp}(T,N) {\setminus } \{\textit{var}(n)\}\).

(3)
Let \(\theta \in {\textit{pred}}(T, N {\setminus } n)\). Then, \(\theta \) occurs on the edge between two nodes in \(N {\setminus } n\), say \(m' \rightarrow n'\). By definition of GJTs, \(\textit{var}(\theta ) \subseteq \textit{var}(n') \cup \textit{var}(m') \subseteq \textit{var}(N {\setminus } \{n\})\). Now suppose for the purpose of contradiction that also \(\theta \in {\textit{pred}}(m \rightarrow n)\). Because T is nice, there is some \(x \in \textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) \not = \emptyset \). Hence, by (1), \(x \not \in \textit{var}(N {\setminus } \{n\})\), which contradicts \(\textit{var}(\theta ) \subseteq \textit{var}(N {\setminus } \{n\})\).

(4)
Clearly, \(\textit{pred}(T,N) {\setminus } \textit{pred}(m \rightarrow n) \subseteq \textit{pred}(T, N {\setminus } \{n\})\). The converse inclusion follows from (3).

(5)
The \(\subseteq \) direction follows from the fact that m and n are in N, and T is nice. To also see \(\supseteq \), let \(\theta \in {\textit{pred}}(T, N)\) with \(\textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) \not = \emptyset \). There exists \(x \in \textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m))\). By (1), \(x \not \in \textit{var}(N {\setminus } \{n\})\). Therefore, \(\theta \) cannot occur between edges in \(N {\setminus } \{n\}\) in T. Since it nevertheless occurs in \({\textit{pred}}(T,N)\), it must hence occur in \(\textit{pred}(m \rightarrow n)\).

(6)
Follows directly from (4) and (5).
\(\square \)
Lemma 13
Let (T, N) be a canonical GJT pair, let n be a frontier node of N and let m be the parent of n in T. Let \(\overline{z} \subseteq \textit{var}(N {\setminus } \{n\})\).
 1.
\(\textit{var}(n) \sqsubseteq _{{\mathcal {H}}(T, N, \overline{z})} \textit{var}(m)\).
 2.
\(x \not \in {\textit{jv}}({\mathcal {H}}(T, N, \overline{z}))\), for every \(x \in (\textit{var}(n) {\setminus } \textit{var}(m))\).
Proof
For reasons of parsimony, let \(\mathcal {H} = {\mathcal {H}}(T, N, \overline{z})\). We first prove (2) and then (1).
(2) Let \(x \in \textit{var}(n) {\setminus } \textit{var}(m)\). By Lemma 12(1), \(x \not \in \textit{var}(N {\setminus } \{n\})\). Therefore, x occurs in \(\textit{var}(n)\) in \(\mathcal {H}\) and in no other hyperedge. Furthermore, because \(\overline{z} \subseteq \textit{var}(N {\setminus } \{n\})\), also \(x \not \in \overline{z}\). Hence, \(x \not \in {\textit{jv}}_{\mathcal {H}}(\textit{var}(n))\).
(1) We need to show that \({\textit{jv}}_{\mathcal {H}}(\textit{var}(n)) \subseteq \textit{var}(m)\) and \(\textit{ext}_{\mathcal {H}}(\textit{var}(n) {\setminus } \textit{var}(m)) \subseteq \textit{var}(m)\). Let \(x \in {\textit{jv}}_{\mathcal {H}}(\textit{var}(n))\). By contraposition of (2), we know that \(x \not \in (\textit{var}(n) {\setminus } \textit{var}(m))\). Therefore, \(x \in \textit{var}(m)\) and thus \({\textit{jv}}_{\mathcal {H}}(\textit{var}(n)) \subseteq \textit{var}(m)\). To show \(\textit{ext}_{\mathcal {H}}(\textit{var}(n) {\setminus } \textit{var}(m)) \subseteq \textit{var}(m)\), let \(y \in \textit{ext}_{\mathcal {H}}(\textit{var}(n) {\setminus } \textit{var}(m))\). Then, \(y \not \in \textit{var}(n) {\setminus } \textit{var}(m)\) and there exists \(\theta \in {\textit{pred}}(T, N)\) with \(\textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) \not = \emptyset \) and \(y \in \textit{var}(\theta )\). By Lemma 12(5), \(\theta \in {\textit{pred}}_T(m \rightarrow n)\). Thus, \(y \in \textit{var}(m) \cup \textit{var}(n)\). Since also \(y \not \in \textit{var}(n) {\setminus } \textit{var}(m)\), it follows that \(y \in \textit{var}(m)\). Therefore, \(\textit{ext}_{\mathcal {H}}(\textit{var}(n) {\setminus } \textit{var}(m)) \subseteq \textit{var}(m)\). \(\square \)
Lemma 14
s Let (T, N) be a canonical GJT pair and let n be a frontier node of N. Then, \({\mathcal {H}}(T,N, \overline{z}) \rightsquigarrow ^* {\mathcal {H}}(T,N {\setminus }\{n\}, \overline{z})\) for every \(\overline{z} \subseteq \textit{var}(N {\setminus } \{n\})\).
Proof
For reasons of parsimony, let us abbreviate \(\mathcal {H}_1 = {\mathcal {H}}(T, N, \overline{z})\) and \(\mathcal {H}_2 = {\mathcal {H}}(T, N {\setminus }\{n\}, \overline{z})\). We make the following case analysis.
Case 1 Node n is the root in N. Because the root of a canonical tree is labeled by \(\emptyset \), we have \(\textit{var}(n) = \emptyset \). Since n is a frontier node of N, \(N = \{n\}\). Thus, \(\textit{hyp}(T, N) = \emptyset \) and \(\textit{hyp}(T, N {\setminus } \{n\}) = \emptyset \). Furthermore, \(\textit{pred}(T, N) = \textit{pred}(T, N {\setminus } \{n\}) = \emptyset \) and \(\overline{z} \subseteq \textit{var}(N {\setminus } \{n\}) = \textit{var}(\emptyset ) = \emptyset \). As such, both \(\mathcal {H}_1\) and \(\mathcal {H}_2\) are the empty triplet \((\emptyset , \emptyset , \emptyset )\). Therefore, \(\mathcal {H}_1 \rightsquigarrow ^* H_2\).
Case 2n has parent m in N and \(\textit{var}(m) \not = \emptyset \). Then, \(\textit{var}(n) \not = \emptyset \) since in a canonical tree the root node is the only interior node that is labeled by the empty hyperedge. Therefore, \(\textit{var}(n) \in \textit{hyp}(T, N)\), \(\textit{var}(m) \in \textit{hyp}(T, N)\), and \(\textit{var}(n) \sqsubseteq _{\mathcal {H}_1} \textit{var}(m)\) by Lemma 13(1). We can hence apply reduction (CSE) to remove \(\textit{var}(n)\) from \(\textit{hyp}(\mathcal {H}_1)\) and all predicates that intersect with \(\textit{var}(n) {\setminus } \textit{var}(m)\) from \(\textit{pred}(\mathcal {H}_1)\). By Lemma 12(2) and 12 (6) the result is exactly \(\mathcal {H}_2\):
Case 3n has parent m in N and \(\textit{var}(m) = \emptyset \). Then, \(\textit{var}(n) \not = \emptyset \) since in a canonical tree the root node is the only interior node that is labeled by the empty hyperedge. By definition of GJTs, it follows that for every \(\theta \in {\textit{pred}}(m \rightarrow n)\), we have \(\textit{var}(\theta ) \subseteq \textit{var}(n) \cup \textit{var}(m) = \textit{var}(n)\). In other words: all \(\theta \in {\textit{pred}}(m \rightarrow n)\) are filters. As such, we can use reduction (FLT) to remove all predicates in \({\textit{pred}}(m \rightarrow n)\) from \(\mathcal {H}_1\). This yields a triplet \(\mathcal {I}\) with the same hypergraph as \(\mathcal {H}_1\), same set of output variables as \(\mathcal {H}_1\), and
where the third equality is due to Lemma 12(4). We claim that every variable in e is isolated in \(\mathcal {I}\). From this the result follows, because then we can apply (ISO) to remove the entire hyperedge \(\textit{var}(e)\) from \(\textit{hyp}(\mathcal {I}) = \textit{hyp}(\mathcal {H}_1)\) while preserving \({\textit{out}}(\mathcal {I})\) and \(\textit{pred}(\mathcal {I})\). The resulting triplet hence equals \(\mathcal {H}_2\). To see that \(e \subseteq {\textit{isol}}(\mathcal {I})\), observe that no predicate in \(\textit{pred}(\mathcal {I}) = \textit{pred}(T, N {\setminus } \{n\})\) shares a variable with \(\textit{var}(n) = (\textit{var}(n) {\setminus } \textit{var}(m))\) by Lemma 12(6). Therefore, \(\textit{var}(n) \cap \textit{var}(\textit{pred}(\mathcal {I})) = \emptyset \). Furthermore, \(\textit{var}(n) \cap {\textit{jv}}(\mathcal {I}) = \emptyset \) because \({\textit{jv}}(\mathcal {I}) = {\textit{jv}}(\mathcal {H}_1)\) and no \(x \in \textit{var}(n) = \textit{var}(n) {\setminus } \textit{var}(m)\) is in \({\textit{jv}}(\mathcal {H}_1)\) by Lemma 13(2). \(\square \)
Lemma 5Let\((T,N_1)\)and\((T,N_2)\)be canonical GJT pairs with\(N_2 \subseteq N_1\). Then,\({\mathcal {H}}(T,N_1, \overline{z}) \rightsquigarrow ^* {\mathcal {H}}(T,N_2, \overline{z})\)for every\(\overline{z} \subseteq \textit{var}(N_2)\).
Proof
By induction on k, the number of nodes in \(N_1 {\setminus } N_2\). In the base case where \(k = 0\), the result trivially holds since then \(N_1 = N_2\) and the two triplets are identical. For the induction step, assume that \(k > 0\) and the result holds for \(k1\). Because both \(N_1\) and \(N_2\) are connex subsets of the same tree T, there exists a node \(n \in N_1\) that is a frontier node in \(N_1\) and which is not in \(N_2\). Then, define \(N'_1 = N_1 {\setminus } \{n\}\). Clearly \((T, N'_1)\) is again canonical, and \(N'_1{\setminus } N_2 = k1\). Therefore, \({\mathcal {H}}(T, N'_1, \overline{z}) \rightsquigarrow ^* {\mathcal {H}}(T, N_2, \overline{z})\) by induction hypothesis. Furthermore, by \({\mathcal {H}}(T, N_1, \overline{z}) \rightsquigarrow ^* {\mathcal {H}}(T, N'_1, \overline{z})\) by Lemma 14, from which the result follows. \(\square \)
Proof of Lemma 6
Lemma 6Let\(H_1\)and\(H_2\)be two hypergraphs such that for all\(e \in H_2\)there exists\(\ell \in H_1\)such that\(e \subseteq \ell \). Then,\((H_1 \cup H_2, \overline{z}, \Theta ) \rightsquigarrow ^* (H_1, \overline{z}, \Theta )\), for every hyperedge\(\overline{z}\)and set of predicates\(\Theta \).
Proof
The proof is by induction on k, the number of hyperedges in \(H_2 {\setminus } H_1\). In the base case where \(k = 0\), the result trivially holds since \(H_1 \cup H_2 = H_1\) and the two triplets are hence identical. For the induction step, assume that \(k > 0\) and the result holds for \(k 1\). Fix some \(e \in H_2 {\setminus } H_1\) and define \(H'_2 = H_2 {\setminus } \{e\}\). Then, \(H'_2 {\setminus } H_1 = k 1\). We show that \((H_1 \cup H_2, \overline{z}, \Theta ) \rightsquigarrow ^* (H_1 \cup H'_2, \overline{z}, \Theta )\), from which the result follows since \((H_1 \cup H'_2, \overline{z}, \Theta ) \rightsquigarrow ^* (H_1, \overline{z}, \Theta )\) by induction hypothesis. To this end, we observe that there exists \(\ell \in H_1 {\setminus } \{e\}\) with \(e \subseteq \ell \). Therefore, \({\textit{jv}}_{(H_1 \cup H_2, \overline{z}, \Theta )}(e) \subseteq e \subseteq \ell \). Moreover, \(e {\setminus } \ell = \emptyset \). Therefore, \(\textit{ext}_{(H_1\cup H_2, \overline{z}, \Theta )}(e {\setminus } \ell ) = \emptyset \subseteq \ell \). Thus \(e \sqsubseteq _{(H_1 \cup H_2, \overline{z}, \Theta )} \ell \). We may, therefore, apply (CSE) to remove e from \(H_1 \cup H_2\), yielding \(H_1 \cup H'_2\). Since no predicate shares variables with \(e {\setminus } \ell = \emptyset \) this does not modify \(\Theta \). Therefore, \((H_1 \cup H_2, \overline{z}, \Theta ) \rightsquigarrow ^* (H_1 \cup H'_2, \overline{z}, \Theta )\). \(\square \)
Proofs of Section 5.2
Lemma 7Letnbe a violator of type 1 in (T, N) and assume\((T,N) \xrightarrow {1,n} (T',N')\). Then,\((T',N')\)is a GJT pair and it is equivalent to (T, N). Moreover, the number of violators in\((T',N')\)is strictly smaller than the number of violators in (T, N).
Proof
The lemma follows from the following observations. (1) It is straightforward to observe that \(T'\) is a valid GJT: the construction has left the set of leaf nodes untouched; took care to ensure that all nodes (including the newly added node p) continue to have a guard child; ensures that the connectedness condition continues to hold also for the relocated children of n because every variable in n is present on the entire path between n and p; and have ensured that also edge labels remain valid (for the relocated nodes this is because \(\textit{var}(p) = \textit{var}(g) \subseteq \textit{var}(n)\)).
(2) \(N'\) is a connex subset of \(T'\) because the subtree of T induced by N equals to subtree of \(T'\) induced by \(N'\), modulo the replacement of l by p in case that l was in N and p is hence in \(N'\).
(3) (T, N) is equivalent to \((T', N')\) because the construction leaves leaf atoms untouched, preserves edge labels, and \(\textit{var}(N) = \textit{var}(N')\). The latter is clear if \(l \not \in N\) because then \(N = N'\). It follows from the fact that \(\textit{var}(l) = \textit{var}(p)\) if \(l \in N\), in which case \(N'=N{\setminus } \{l\} \cup \{p\}\).
(4) All nodes in \({{\,\mathrm{ch}\,}}_T(n) {\setminus } N\) (and their descendants) are relocated to p in \(T'\). Therefore, n is no longer a violator in \((T', N')\). Because we do not introduce new violators, the number of violators of \((T', N')\) is strictly smaller than the number of violators of (T, N). \(\square \)
Lemma 8Letnbe a violator of type 2 in (T, N) and assume\((T,N) \xrightarrow {2,n} (T',N')\). Then,\((T',N')\)is a GJT pair and it is equivalent to (T, N). Moreover, the number of violators in\((T',N')\)is strictly smaller than the number of violators in (T, N).
Proof
The lemma follows from the following observations. (1) It is straightforward to observe that \(T'\) is a valid GJT: the construction has left the set of leaf nodes untouched; took care to ensure that all nodes (including the newly added node p) continue to have a guard child; ensures that the connectedness condition continues to hold also for the relocated children of n because every variable in n is also present in p, their new parent; and have ensured that also edge labels remain valid (for the relocated nodes this is because \(\textit{var}(p) = \textit{var}(n)\)).
(2) \(N'\) is a connex subset of \(T'\) because (i) the subtree of T induced by N equals to subtree of \(T'\) induced by \(N' \ \{p\}\), (ii) \(n \in N\), and (iii) p is a child of n in \(T'\). Therefore, \(N'\) must be connex.
(3) (T, N) is equivalent to \((T', N')\) because the construction leaves leaf atoms untouched, preserves edge labels, and \(\textit{var}(N) = \textit{var}(N')\). The latter follows because \(\textit{var}(N') = \textit{var}(N \cup \{p\})\) and because \(\textit{var}(p) = \textit{var}(n) \subseteq \textit{var}(N)\) since \(n \in N\).
(4) All nodes in \({{\,\mathrm{ch}\,}}_T(n) {\setminus } N\) (and their descendants) are relocated to p in \(T'\). Therefore, n is no longer a violator in \((T', N')\). Because we do not introduce new violators, the number of violators of \((T', N')\) is strictly smaller than the number of violators of (T, N). \(\square \)
Description of competing systems
DBToaster DBToaster (henceforth denoted DBT) is a stateoftheart implementation of HIVM. It operates in pullbased mode and can deal with randomly ordered update streams. DBT is particularly meticulous in that it materializes only useful views, and therefore, it is an interesting implementation for comparison. It has been extensively tested on equijoin queries and has proven to be more efficient than a commercial database management system, a commercial stream processing system and an IVM implementation [30]. DBT compiles given SQL statements into executable trigger programs in different programming languages. We compare against those generated in Scala from the DBToaster Release 2.2,^{Footnote 10} and it uses actors^{Footnote 11} to generate events from the input files. During our experiments, however, we have found that this creates unnecessary memory overhead. For a fair memorywise comparison, we have, therefore, removed these actors.
Esper Esper (E) is a CER engine with a relational model based on Stanford STREAM [4]. It is push based and can deal with randomly ordered update streams. We use the Javabased open source^{Footnote 12} for our comparisons. Esper processes queries expressed in the Esper event processing language (EPL).
SASE SASE (SE) is an automatonbased CER system. It operates in pushbased mode and can deal with temporally ordered update streams only. We use the publicly available Javabased implementation of SASE.^{Footnote 13} This implementation does not support projections. Furthermore, since SASE requires queries to specify a match semantics (any match, next match, partition contiguity) but does not allow combinations of such semantics, we can only express queries \(Q_1\), \(Q_2\), and \(Q_4\) in SASE. Hence, we compare against SASE for these queries only. To be coherent with our semantics, the corresponding SASE expressions use the any match semantics [3].
Tesla/TRex Tesla/TRex (T) is also an automatonbased CER system. It operates in pushbased mode only, and supports temporally ordered update streams only. We use the publicly available Cbased implementation.^{Footnote 14} This implementation operates in a publishsubscribe model where events are published by clients to the server, known as TRexServer. Clients can subscribe to receive recognized composite events. Tesla cannot deal with queries involving inequalities on multiple attributes, e.g., \(Q_3\), therefore, we do not show results for \(Q_3\). Since Tesla works in a decentralized manner, we measure the update processing time by logging the time at the Tesla TRexServer from the stream start until the end.
ZStream ZStream (Z) is a CER system based on a relational internal architecture. It operates in pushbased mode and can deal with temporally ordered update streams only. ZStream is not available publicly. Hence, we have created our own implementation following the lazy evaluation algorithm described in the original paper [31]. This paper does not describe how to treat projections, and as such we compare against ZStream only for full join queries \(Q_1\)–\(Q_8\).
Rights and permissions
About this article
Cite this article
Idris, M., Ugarte, M., Vansummeren, S. et al. General dynamic Yannakakis: conjunctive queries with theta joins under updates. The VLDB Journal (2019). https://doi.org/10.1007/s00778019005909
Received:
Revised:
Accepted:
Published:
Keywords
 Incremental view maintenance
 Dynamic query processing
 Complex event processing
 Theta joins
 Inequalities
 Acyclic joins