Skip to main content
Log in

General dynamic Yannakakis: conjunctive queries with theta joins under updates

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

The ability to efficiently analyze changing data is a key requirement of many real-time analytics applications. In prior work, we have proposed general dynamic Yannakakis (GDyn), a general framework for dynamically processing acyclic conjunctive queries with \(\theta \)-joins in the presence of data updates. Whereas traditional approaches face a trade-off between materialization of subresults (to avoid inefficient recomputation) and recomputation of subresults (to avoid the potentially large space overhead of materialization), GDyn is able to avoid this trade-off. It intelligently maintains a succinct data structure that supports efficient maintenance under updates and from which the full query result can quickly be enumerated. In this paper, we consolidate and extend the development of GDyn. First, we give full formal proof of GDyn ’s correctness and complexity. Second, we present a novel algorithm for computing GDyn query plans. Finally, we instantiate GDyn to the case where all \(\theta \)-joins are inequalities and present extended experimental comparison against state-of-the-art engines. Our approach performs consistently better than the competitor systems with multiple orders of magnitude improvements in both time and memory consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Note that, in this framework, value modifications inside a tuple are modeled by deleting the tuple with the old value, and then reinserting the tuple, but now with the new value.

  2. Note that such queries may also contain equijoins by sharing variables between atoms.

  3. Strictly speaking, we described in section that R needs to be sorted lexicographically, first on \(\overline{x} \cap \overline{y}\) and then on x. The grouping + sorting of the enumeration index obtains the same result.

  4. In the conference version of this paper [26], there was an incorrect claim: we stated that updates could be processed in time \(O(M\cdot \log (M))\) in the general case of multiple inequalities. We then found a bug in our proof and we currently do not know if this bound can be achieved.

  5. Note that because we set \({\textit{out}}(\mathcal {I}) = \emptyset \) on the residual, new variables may become isolated and therefore more reductions steps may be possible on the normal form of \(\mathcal {I}\).

  6. In the sense that batch updates are only supported by treating each update tuple in the batch individually.

  7. Should \(X_2 {\setminus } X_1\) be empty, we don’t actually need to do anything on \(\mathcal {I}_1\): \(X_1 \cup X_2\) is already removed from it. A similar remark holds for \(\mathcal {I}_2\) when \(X_1 {\setminus } X_2\) is empty.

  8. Note that, since \(e_1\) does not share variables with any predicate, the CSE operation also does not remove any predicates from \(\mathcal {H}_1\), similar to the ISO operation and hence yields \(\mathcal {I}_1\).

  9. Note that all leafs have a parent since the root of \(T_1\) is an interior node labeled by \(\emptyset \).

  10. https://dbtoaster.github.io/.

  11. https://doc.akka.io/docs/akka/2.5/.

  12. http://www.espertech.com/esper/esper-downloads/.

  13. https://github.com/haopeng/sase.

  14. https://github.com/deib-polimi/TRex.

References

  1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley Longman Publishing Co., Inc., Boston (1995)

    MATH  Google Scholar 

  2. Abo Khamis, M., Ngo, H.Q., Rudra, A.: FAQ: questions asked frequently. In: Proceedings of PODS, pp. 13–28 (2016)

  3. Agrawal, J., Diao, Y., Gyllstrom, D., Immerman, N.: Efficient pattern matching over event streams. Proc. SIGMOD 2008, 147–160 (2008)

    Google Scholar 

  4. Arasu, A., Babcock, B., Babu, S., Cieslewicz, J., Datar, M., Ito, K., Motwani, R., Srivastava, U., Widom, J.: STREAM: the stanford data stream management system. In: Data Stream Management—Processing High-Speed Data Streams, pp. 317–336 (2016)

    Google Scholar 

  5. Baader, F., Nipkow, T.: Term Rewriting and All That. Cambridge University Press, Cambridge (1998)

    Book  Google Scholar 

  6. Bagan, G., Durand, A., Grandjean, E.: On acyclic conjunctive queries and constant delay enumeration. In: Proceedings of CSL, pp. 208–222 (2007)

  7. Bakibayev, N., Kočiský, T., Olteanu, D., Závodný, J.: Aggregation and ordering in factorised databases. Proc. VLDB 6(14), 1990–2001 (2013)

    Article  Google Scholar 

  8. Berkholz, C., Keppeler, J., Schweikardt, N.: Answering conjunctive queries under updates. In: Proceedings of PODS, pp. 303–318 (2017)

  9. Bernstein, P.A., Goodman, N.: The power of inequality semijoins. Inf. Syst. 6(4), 255–265 (1981)

    Article  Google Scholar 

  10. Brault-Baron, J.: De la pertinence de l’énumération: complexité en logiques. Ph.D. thesis, Université de Caen (2013)

  11. Brenna, L., Demers, A.J., Gehrke, J., Hong, M., Ossher, J., Panda, B., Riedewald, M., Thatte, M., White, W.M.: Cayuga: a high-performance event processing engine. Proc. SIGMOD 2007, 1100–1102 (2007)

    Google Scholar 

  12. Chirkova, R., Yang, J.: Materialized views. Found. Trends Databases 4(4), 295–405 (2012)

    Article  Google Scholar 

  13. Cormen, T.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  14. Cugola, G., Margara, A.: TESLA: a formally defined event specification language. Proc. DEBS 2010, 50–61 (2010)

    Article  Google Scholar 

  15. Cugola, G., Margara, A.: Complex event processing with T-REX. J. Syst. Softw. 85(8), 1709–1728 (2012)

    Article  Google Scholar 

  16. Cugola, G., Margara, A.: Processing flows of information: from data stream to complex event processing. ACM Comput. Surv. 44(3), 15:1–15:62 (2012)

    Article  Google Scholar 

  17. DeWitt, D.J., Naughton, J.F., Schneider, D.A.: An evaluation of non-equijoin algorithms. VLDB 1991, 443–452 (1991)

    Google Scholar 

  18. Enderle, J., Hampel, M., Seidl, T.: Joining interval data in relational databases. Proc SIGMOD 2004, 683–694 (2004)

    Google Scholar 

  19. EsperTech. Esper complex event processing engine. http://www.espertech.com/

  20. Golab, L., Özsu, M.T.: Processing sliding window multi-joins in continuous queries over data streams. In: Proceedings of VLDB, pp. 500–511 (2003)

    Chapter  Google Scholar 

  21. Gupata, A., Mumick, I.S. (eds.): Materialized Views: Techniques, Implementations, and Applications. MIT Press, Cambridge (1999)

    Google Scholar 

  22. Gupta, A., Mumick, I.S., Subrahmanian, V.S.: Maintaining views incrementally. In: Proceedings of SIGMOD, pp. 157–166 (1993)

    Article  Google Scholar 

  23. Hellerstein, J.M., Naughton, J.F., Pfeffer, A.: Generalized search trees for database systems. In: VLDB’95, pp. 562–573 (1995)

  24. Henzinger, M., Krinninger, S., Nanongkai, D., Saranurak, T.: Unifying and strengthening hardness for dynamic problems via the online matrix-vector multiplication conjecture. In: Proceedings of STOC, pp. 21–30 (2015)

  25. Idris, M., Ugarte, M., Vansummeren, S.: The dynamic Yannakakis algorithm: compact and efficient query processing under updates. In: Proceedings of SIGMOD 2017 (2017)

  26. Idris, M., Ugarte, M., Vansummeren, S., Voigt, H., Lehner, W.: Conjunctive queries with inequalities under updates. PVLDB 11(7), 733–745 (2018)

    Google Scholar 

  27. Kang, J., Naughton, J.F., Viglas, S.: Evaluating window joins over unbounded streams. In: Proceedings of ICDE, pp. 341–352 (2003)

  28. Khayyat, Z., Lucia, W., Singh, M., Ouzzani, M., Papotti, P., Quiané-Ruiz, J., Tang, N., Kalnis, P.: Fast and scalable inequality joins. VLDB J. 26(1), 125–150 (2017)

    Article  Google Scholar 

  29. Koch, C.: Incremental query evaluation in a ring of databases. In: Proceedings of PODS, pp. 87–98 (2010)

  30. Koch, C., Ahmad, Y., Kennedy, O., Nikolic, M., Nötzli, A., Lupei, D., Shaikhha, A.: Dbtoaster: higher-order delta processing for dynamic, frequently fresh views. VLDB J. 23, 253–278 (2014)

    Article  Google Scholar 

  31. Mei, Y., Madden, S.: Zstream: a cost-based query processor for adaptively detecting composite events. Proc. SIGMOD 2009, 193–206 (2009)

    Google Scholar 

  32. Nikolic, M., Olteanu, D.: Incremental view maintenance with triple lock factorization benefits. Proc. SIGMOD 2018, 365–380 (2018)

    Google Scholar 

  33. Olteanu, D., Závodný, J.: Size bounds for factorised representations of query results. ACM TODS 40(1), 2:1–2:44 (2015)

    Article  MathSciNet  Google Scholar 

  34. Roy, P., Teubner, J., Gemulla, R.: Low-latency handshake join. PVLDB 7(9), 709–720 (2014)

    Google Scholar 

  35. Sahay, B., Ranjan, J.: Real time business intelligence in supply chain analytics. Inf. Manage. Comput. Secur. 16(1), 28–48 (2008)

    Article  Google Scholar 

  36. Schleich, M., Olteanu, D., Ciucanu, R.: Learning linear regression models over factorized joins. In: Proceedings of SIGMOD, pp. 3–18 (2016)

  37. Schultz-Møller, N.P., Migliavacca, M., Pietzuch, P.R.: Distributed complex event processing with query rewriting. In: Proceedings of DEBS 2009 (2009)

  38. Segoufin, L.: Constant delay enumeration for conjunctive queries. SIGMOD Rec. 44(1), 10–17 (2015)

    Article  Google Scholar 

  39. Stonebraker, M., Çetintemel, U., Zdonik, S.: The 8 requirements of real-time stream processing. SIGMOD Rec. 4, 42–47 (2005)

    Article  Google Scholar 

  40. Teubner, J., Müller, R.: How soccer players would do stream joins. In: Proceedings of SIGMOD, pp. 625–636 (2011)

  41. Urhan, T., Franklin, M.J.: Xjoin: a reactively-scheduled pipelined join operator. IEEE Data Eng. Bull. 23(2), 27–33 (2000)

    Google Scholar 

  42. Vardi, M.Y.: The complexity of relational query languages (extended abstract). In: Proceedings of STOC, pp. 137–146 (1982)

  43. Viglas, S., Naughton, J. F., Burger, J.: Maximizing the output rate of multi-way join queries over streaming information sources. In: Proceedings of VLDB, pp. 285–296 (2003)

    Chapter  Google Scholar 

  44. Wang, W., Gao, J., Zhang, M., Wang, S., Chen, G., Ng, T.K., Ooi, B.C., Shao, J., Reyad, M.: Rafiki: machine learning as an analytics service system. PVLDB 12(2), 128–140 (2018)

    Google Scholar 

  45. Wilschut, A.N., Apers, P.M.G.: Dataflow query execution in a parallel main-memory environment. In: Proceedings of the First International Conference on Parallel and Distributed Information Systems (PDIS 1991), pp. 68–77. IEEE Computer Society (1991)

  46. Wu, E., Diao, Y., Rizvi, S.: High-performance complex event processing over streams. Proc. SIGMOD 2006, 407–418 (2006)

    Google Scholar 

  47. Yannakakis, M.: Algorithms for acyclic database schemes. In: Proceedings of VLDB, pp. 82–94 (1981)

  48. Yoshikawa, M., Kambayashi, Y.: Processing inequality queries based on generalized semi-joins. In: VLDB, pp. 416–428 (1984)

  49. Zhang, H., Diao, Y., Immerman, N.: On complexity and optimization of expensive queries in complex event processing. In: Proceedings of SIGMOD (2014)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martín Ugarte.

Additional information

Dr. Sihem Amer-Yahia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

M. Ugarte This work was done while the author was affiliated to ULB, Belgium.

H. Voigt This work was done while the author was affiliated to TU Dresden, Germany.

Appendices

Proofs of Sect. 4

Lemma 1   \(\rho _n= {\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}}(\textit{db})\), for every node\(n \in T\).

Proof

We proceed by induction on the number of descendants of n. If n has no descendants, then \(T_n\) is a single atom \(r(\overline{x})\) with \(\overline{x} = \textit{var}(n) = {\textit{out}}({\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}})\). Then, \({\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}}(\textit{db}) = (\pi _{var(n)} r(\overline{x}))(\textit{db})= r(\overline{x}) (\textit{db})=\textit{db}_{r(\overline{x})}=\rho _n\), concluding the basic case. Now, for the inductive case, we distinguish whether n has one or two children.

Assume n has a single child c. Then, \(\textit{at}(T_n) = \textit{at}(T_c)\) and \(\textit{pred}(T_n) = \textit{pred}(T_c) \cup {\textit{pred}}(n)\). Therefore, by definition of \({\mathcal {Q}}{\texttt {[}}\cdot {\texttt {]}}\), we have \({\mathcal {Q}}{\texttt {[}}T_n{\texttt {]}} \equiv \sigma _{{\textit{pred}}(n)}{\mathcal {Q}}{\texttt {[}}T_c{\texttt {]}}\), which implies that \({\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}} = \pi _{\textit{var}(n)}{\mathcal {Q}}{\texttt {[}}T_n{\texttt {]}} \equiv \pi _{\textit{var}(n)}\sigma _{{\textit{pred}}(n)}{\mathcal {Q}}{\texttt {[}}T_c{\texttt {]}}\). Furthermore, since \({\textit{pred}}(n)\) only mentions variables in \(\textit{var}(c) \cup \textit{var}(n)\) and \(\textit{var}(n)\subseteq \textit{var}(c)\), as c is a guard of n, this is equivalent to

$$\begin{aligned} \pi _{\textit{var}(n)}\sigma _{{\textit{pred}}(n)}{\mathcal {Q}}{\texttt {[}}T_c{\texttt {]}}&\equiv \pi _{\textit{var}(n)}\sigma _{{\textit{pred}}(n)}\pi _{\textit{var}(c)}{\mathcal {Q}}{\texttt {[}}T_c{\texttt {]}} \\&= \pi _{\textit{var}(n)} \sigma _{{\textit{pred}}(n)} {\mathcal {Q}}{\texttt {[}}T_c,c{\texttt {]}}. \end{aligned}$$

By induction, \(\pi _{\textit{var}(n)} \sigma _{{\textit{pred}}(n)} {\mathcal {Q}}{\texttt {[}}T_c,c{\texttt {]}}(\textit{db}) = \pi _{\textit{var}(n)}\sigma _{{\textit{pred}}(n)}\rho _c = \rho _n\), showing that \({\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}}(\textit{db})=\rho _n\).

Assume now that n has two children \(c_1\) and \(c_2\). We assume w.l.o.g. that \(c_1\) is a guard for n. Note that \(\textit{at}(T_n) = \textit{at}(T_{c_1}) \cup \textit{at}(T_{c_2})\) and \(\textit{pred}(T_n)= \textit{pred}(T_{c_1}) \cup \textit{pred}(T_{c_2}) \cup \textit{pred}(n)\). Therefore,

$$\begin{aligned} {\mathcal {Q}}{\texttt {[}}T_n{\texttt {]}} \equiv \sigma _{{\textit{pred}}(n)}\sigma _{{\textit{pred}}(T_{c_1})}\sigma _{{\textit{pred}}(T_{c_2})}\left( \textit{at}(T_{c_1}) \bowtie \textit{at}(T_{c_2}) \right) . \end{aligned}$$

Here, we abuse notation and write \(\textit{at}(T_i)\) for the natural join of all atoms in \(T_{c_i}\). Since \(\textit{pred}(T_{c_i})\) only mentions variables of atoms in \(T_{c_i}\) (for \(i\in \{1,2\}\)), we can push the selections:

$$\begin{aligned} {\mathcal {Q}}{\texttt {[}}T_n{\texttt {]}}&\equiv \sigma _{{\textit{pred}}(n)} \left( \sigma _{{\textit{pred}}(T_{c_1})} \textit{at}(T_{c_1}) \bowtie \sigma _{{\textit{pred}}(T_{c_2})}\textit{at}(T_{c_2}) \right) \\&\equiv \sigma _{{\textit{pred}}(n)} \left( {\mathcal {Q}}{\texttt {[}}T_{c_1}{\texttt {]}} \bowtie {\mathcal {Q}}{\texttt {[}}T_{c_2}{\texttt {]}}\right) . \end{aligned}$$

Therefore,

$$\begin{aligned} {\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}} {=} \pi _{\textit{var}(n)}{\mathcal {Q}}{\texttt {[}}T_n{\texttt {]}} \equiv \pi _{\textit{var}(n)}\sigma _{{\textit{pred}}(n)}\left( {\mathcal {Q}}{\texttt {[}}T_{c_1}{\texttt {]}} \bowtie {\mathcal {Q}}{\texttt {[}}T_{c_2}{\texttt {]}}\right) . \end{aligned}$$

Since \(\textit{var}({\textit{pred}}(n))\subseteq \textit{var}(c_1)\cup \textit{var}(c_2)\cup \textit{var}(n)\) and \(\textit{var}(n)\subseteq \textit{var}(c_1)\), we have \(\textit{var}({\textit{pred}}(n))\subseteq \textit{var}(c_1)\cup \textit{var}(c_2)\). This is combined with the fact that, due to the connectedness property of T, we have \(\textit{var}({\mathcal {Q}}{\texttt {[}}T_{c_1}{\texttt {]}})\cap \textit{var}({\mathcal {Q}}{\texttt {[}}T_{c_2}{\texttt {]}}) \subseteq \textit{var}(c_i)\) for \(i\in \{1,2\}\), we can add the following projections

$$\begin{aligned} {\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}}&\equiv \pi _{\textit{var}(n)}\sigma _{{\textit{pred}}(n)}\left( \pi _{\textit{var}(c_1)} {\mathcal {Q}}{\texttt {[}}T_{c_1}{\texttt {]}}\bowtie \pi _{\textit{var}(c_2)} {\mathcal {Q}}{\texttt {[}}T_{c_2}{\texttt {]}}\right) \\&\equiv \pi _{\textit{var}(n)}\sigma _{{\textit{pred}}(n)}\left( {\mathcal {Q}}{\texttt {[}}T_{c_1}, c_1{\texttt {]}} \bowtie {\mathcal {Q}}{\texttt {[}}T_{c_2}, c_2{\texttt {]}}\right) . \end{aligned}$$

Hence, by induction hypothesis, we have

$$\begin{aligned} {\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}}(\textit{db}) =\pi _{\textit{var}(n)}\sigma _{{\textit{pred}}(n)}\left( \rho _{c_1}\bowtie \rho _{c_2}\right) =\rho _n, \end{aligned}$$

concluding our proof. \(\square \)

Lemma 3

  1. 1.

    \(Q(\textit{db})\)is a positive GMR, for any\(GCQ \)Qand any database\(\textit{db}\).

  2. 2.

    IfRis a positive GMR over\(\overline{x}\)and\(\overline{y} \subseteq \overline{x}\), then\(\mathbf {t}[\overline{y}] \in \pi _{\overline{y}} R\)for every tuple\(\mathbf {t} \in R\).

Proof

(1) Follows by straightforward induction on Q, using the fact that the GMRs in \(\textit{db}\) are themselves positive by definition. (2) Is a standard result in relational algebra, which hence transfers to the case of positive GMRs. \(\square \)

Lemma 10

Let R be a positive GMR over \(\overline{x}\), S a positive GMR over \(\overline{y}\) and \(\mathbf {t}\) a tuple over \(\overline{z}\). If \(\overline{z} \subseteq \overline{y} \subseteq \overline{x}\), then \(R \ltimes (S \ltimes \mathbf {t}) = (R \ltimes S) \ltimes \mathbf {t}\).

Proof

This results well know in standard relational algebra, and its proof transfers to the case of positive GMRs. \(\square \)

Lemma 2    For every node\(n\in N\)and every tuple\(\mathbf {t}\)in\(\rho _n\), \(\textsc {enum}_{T,N}(n, \mathbf {t},\rho )\)enumerates\({\mathcal {Q}}{\texttt {[}}T_n, N_n{\texttt {]}}(\textit{db}) \ltimes \mathbf {t}\).

Proof

Let \(n \in N\) and \(\mathbf {t}\in \rho _n\). We need to show that executing \(\textsc {enum}_{T,N}(n, \mathbf {t},\rho )\) outputs all (tuple, multiplicity) pairs of \({\mathcal {Q}}{\texttt {[}}T_n, N_n{\texttt {]}}(\textit{db}) \ltimes \mathbf {t}\) exactly once. We proceed by induction on the number of nodes in \(N_n\). If \(N_n = \{n\}\), then \({\mathcal {Q}}{\texttt {[}}T_n, N_n{\texttt {]}} = {\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}}\). Therefore, by Lemma 1, \({\mathcal {Q}}{\texttt {[}}T_n, N_n{\texttt {]}}(\textit{db}) = \rho _n\). Since \(\mathbf {t}\in \rho _n\), this implies that the only tuple in \({\mathcal {Q}}{\texttt {[}}T_n,N_n{\texttt {]}}(\textit{db})\) that is compatible with \(\mathbf {t}\) is \(\mathbf {t}\) itself. Furthermore, since \(N_n = \{n\}\), n must be in the frontier of n. Therefore, \(\textsc {enum}_{T,N}(n, \mathbf {t},\rho )\) will output precisely \(\{(\mathbf {t}, \rho _n(\mathbf {t}))\}\) (line 4), which concludes the base case.

For the inductive step we need to consider two cases depending on the number of children of n.

Case 1 If n has a single child c, then necessarily c is a guard of n, i.e., \(\textit{var}(n) \subseteq \textit{var}(c)\). In this case, Algorithm 1 will call \(\textsc {enum}_{T, N}(c, \mathbf {s}, \rho )\) for each tuple \(\mathbf {s}\in \left( \rho _c\ltimes _{{\textit{pred}}(n)} \mathbf {t}\right) \). By induction hypothesis and Lemma 1, this will correctly enumerate and output the elements of \({\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}}(\textit{db})\ltimes \mathbf {s}\), for every \(\mathbf {s}\) in \({\mathcal {Q}}{\texttt {[}}T_c, c{\texttt {]}}(\textit{db})\ltimes _{{\textit{pred}}(n)} \mathbf {t}\). Note that the sets \({\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}}(\textit{db})\ltimes \mathbf {s}\) are disjoint for different values of \(\mathbf {s}\). Thus, no element is output twice. Hence, \(\textsc {enum}_{T, N}(n, \mathbf {t}, \rho )\) enumerates the GMR

$$\begin{aligned} {\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}}(\textit{db}) \ltimes ({\mathcal {Q}}{\texttt {[}}T_c,c{\texttt {]}}(\textit{db})\ltimes _{_{{\textit{pred}}(n)}}\mathbf {t}). \end{aligned}$$
(2)

Since \(\textit{var}({\textit{pred}}(n))\subseteq \textit{var}(c) \cup \textit{var}(n) = \textit{var}(c) = {\textit{out}}({\mathcal {Q}}{\texttt {[}}T_c,c{\texttt {]}}\), we can pull out the selection:

$$\begin{aligned} (2) = {\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}}(\textit{db}) \ltimes \sigma _{{\textit{pred}}{n}} ({\mathcal {Q}}{\texttt {[}}T_c,c{\texttt {]}}(\textit{db})\ltimes \mathbf {t}). \end{aligned}$$
(3)

Subsequently, because \(var({\textit{pred}}(n)) = \textit{var}(c) \subseteq {\textit{out}}({\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}})\), we can pull out the selection again:

$$\begin{aligned} (3) = \sigma _{{\textit{pred}}(n)}\left( {\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}}(\textit{db}) \ltimes ({\mathcal {Q}}{\texttt {[}}T_c,c{\texttt {]}}(\textit{db})\ltimes \mathbf {t})\right) . \end{aligned}$$
(4)

Because the variables in \(\mathbf {t}\) are a subset of \(\textit{var}(c)\), because \(\textit{var}(c) \subseteq \textit{var}(N_c)\), and because \({\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}}(\textit{db})\) and \({\mathcal {Q}}{\texttt {[}}T_c,c{\texttt {]}}(\textit{db})\) are positive (Lemma 3(1)), we can apply Lemma 10:

$$\begin{aligned} (4) =\sigma _{{\textit{pred}}(n)}\left( ({\mathcal {Q}}{\texttt {[}}T_c,N_c{\texttt {]}}(\textit{db})\ltimes {\mathcal {Q}}{\texttt {[}}T_c,c{\texttt {]}}(\textit{db}) ) \ltimes \mathbf {t}\right) .\qquad \end{aligned}$$
(5)

Next, observe that, since \(\textit{var}(n_c) \subseteq \textit{var}(N_c)\) as \(c \in N_c\), we have

$$\begin{aligned} {\mathcal {Q}}{\texttt {[}}T_c, n_c{\texttt {]}}&= \pi _{\textit{var}(c)} {\mathcal {Q}}{\texttt {[}}T_c{\texttt {]}} \\&\equiv \pi _{\textit{var}(c)} \pi _{\textit{var}(N_c)} {\mathcal {Q}}{\texttt {[}}T_c{\texttt {]}} \\&\equiv \pi _{\textit{var}(c)} {\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}} \end{aligned}$$

Then, because \({\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}})(\textit{db})\) is positive, we obtain from Lemma 3(2) that

$$\begin{aligned} (5)=\sigma _{{\textit{pred}}(n)}({\mathcal {Q}}{\texttt {[}}T_c,N_c{\texttt {]}}(\textit{db})\ltimes \mathbf {t}). \end{aligned}$$
(6)

Finally, because \({\textit{pred}}(n) \subseteq \textit{var}(n) \subseteq \textit{var}(N_c)\), we push the selection again and obtain

$$\begin{aligned} (6)&= (\sigma _{{\textit{pred}}(n)} {\mathcal {Q}}{\texttt {[}}T_c,N_c{\texttt {]}}(\textit{db})) \ltimes \mathbf {t} \end{aligned}$$
(7)
$$\begin{aligned}&= (\pi _{var(N_n)} \sigma _{{\textit{pred}}(n)} {\mathcal {Q}}{\texttt {[}}T_c,N_c{\texttt {]}})(\textit{db}) \ltimes \mathbf {t}. \end{aligned}$$
(8)

Here, the last equality is due to the fact that \(\textit{var}(N_n) = \textit{var}(n) \cup \textit{var}(N_c) = \textit{var}(N_c)\), as \(\textit{var}(n) \subseteq \textit{var}(c)\) and \(c \in N_c\), which implies that projecting on \(\textit{var}(N_n)\) does not modify the result. The result then follows from the observation that \({\mathcal {Q}}{\texttt {[}}T_n, N_n{\texttt {]}} \equiv \pi _{\textit{var}(N_n)} \sigma _{{\textit{pred}}(n)} {\mathcal {Q}}{\texttt {[}}T_c, N_c{\texttt {]}}\).

Case 2 Otherwise, n has two children \(c_1\) and \(c_2\). We assume w.l.o.g. that \(c_1\) is a guard of n, i.e., \(\textit{var}(n) \subseteq \textit{var}(c_1)\). Since \(|{N_n}|>1\) and N is sibling closed, we have \(\{c_1,c_2\}\subset N\). In this case, Algorithm 1 will first enumerate \(\mathbf {t_i} \in \rho _{c_i}\ltimes _{{\textit{pred}}(n\rightarrow c_1)} \mathbf {t}\) for \(i\in \{1, 2\}\). By Lemma 1, this is equivalent to enumerate every \(\mathbf {t_i}\) in \({\mathcal {Q}}{\texttt {[}}T_{c_i}, c_i{\texttt {]}}(\textit{db})\ltimes _{{\textit{pred}}(n\rightarrow c_1)} \mathbf {t}\). Then, for each such \(\mathbf {t_i}\) the algorithm will enumerate every pair \((\mathbf {s_i}, \mu _i)\) generated by \(\textsc {enum}_{T, N}(c_i, \mathbf {t_i}, \rho )\), which by induction is the same as enumerating every \((\mathbf {s_i}, \mu _i)\) in \({\mathcal {Q}}{\texttt {[}}T_{c_i}, N_{c_i}{\texttt {]}}(\textit{db})\ltimes \mathbf {t_i}\). Note that the sets \({\mathcal {Q}}{\texttt {[}}T_{c_i}, N_{c_i}{\texttt {]}}(\textit{db})\ltimes \mathbf {t_i}\) are disjoint for distinct \(\mathbf {t_i}\). Therefore, no \((\mathbf {s_i},\mu _i)\) is generated twice. the algorithm is hence enumerating

$$\begin{aligned} {\mathcal {Q}}{\texttt {[}}T_{c_i}, N_{c_i}{\texttt {]}}(\textit{db}) \ltimes \left( {\mathcal {Q}}{\texttt {[}}T_{c_i}, c_i{\texttt {]}}(\textit{db})\ltimes _{{\textit{pred}}(n\rightarrow c_i)} \mathbf {t}\right) \end{aligned}$$

By the same reasoning as in Case (1), this is equivalent to enumerating every \((\mathbf {s_i}, \mu _i)\) in \((\sigma _{{\textit{pred}}(n\rightarrow c_i)}{\mathcal {Q}}{\texttt {[}}T_{c_i}{\texttt {]}}(\textit{db}))\ltimes \mathbf {t}.\) From the connectedness property of T, it follows that \(\textit{var}({\mathcal {Q}}{\texttt {[}}T_{c_1}{\texttt {]}})\cap \textit{var}({\mathcal {Q}}{\texttt {[}}T_{c_2}{\texttt {]}}) \subseteq \textit{var}(n)\). Thus, \(\textit{var}({\mathcal {Q}}{\texttt {[}}T_{c_1}{\texttt {]}})\cap \textit{var}({\mathcal {Q}}{\texttt {[}}T_{c_2}{\texttt {]}})\) is a subset of the variables of \(\mathbf {t}\). Hence, every tuple \(\mathbf {s_1}\) will be compatible with every tuple \(\mathbf {s_2}\), and therefore, enumeration of every pair \((\mathbf {s_1}\cup \mathbf {s_2},\mu _1\times \mu _2)\) is the same as the enumeration of

$$\begin{aligned}&\left( (\sigma _{{\textit{pred}}(n\rightarrow c_1)}{\mathcal {Q}}{\texttt {[}}T_{c_1}, N_{c_1}{\texttt {]}}(\textit{db}))\ltimes \mathbf {t}\right) \nonumber \\&\quad \bowtie \left( (\sigma _{{\textit{pred}}(n\rightarrow c_2)}{\mathcal {Q}}{\texttt {[}}T_{c_2}, N_{c_2}{\texttt {]}}(\textit{db}))\ltimes \mathbf {t}\right) . \end{aligned}$$
(9)

The semijoin with \(\mathbf {t}\) factors out of the join:

$$\begin{aligned} \begin{aligned} (9)&= \big (\sigma _{{\textit{pred}}(n\rightarrow c_1)}{\mathcal {Q}}{\texttt {[}}T_{c_1}, N_{c_1}{\texttt {]}} \\&\quad \bowtie \sigma _{{\textit{pred}}(n\rightarrow c_2)}{\mathcal {Q}}{\texttt {[}}T_{c_2}, N_{c_2}{\texttt {]}}\big )(\textit{db}) \ltimes \mathbf {t} \end{aligned} \end{aligned}$$
(10)

We can now pull out the selections and obtain

$$\begin{aligned} (10)&= \big ( \sigma _{{\textit{pred}}(n\rightarrow c_1)}\sigma _{{\textit{pred}}(n\rightarrow c_2)} ({\mathcal {Q}}{\texttt {[}}T_{c_1}, N_{c_1}{\texttt {]}} \\&\quad \bowtie {\mathcal {Q}}{\texttt {[}}T_{c_2}, N_{c_2}{\texttt {]}})(\textit{db}) \big )\ltimes \mathbf {t}.\\&= \left( \sigma _{{\textit{pred}}(n)} ({\mathcal {Q}}{\texttt {[}}T_{c_1}, N_{c_1}{\texttt {]}} \bowtie {\mathcal {Q}}{\texttt {[}}T_{c_2}, N_{c_2}{\texttt {]}})(\textit{db})\right) \ltimes \mathbf {t}.\\&= \left( \pi _{\textit{var}(N_n)} \sigma _{{\textit{pred}}(n)} ({\mathcal {Q}}{\texttt {[}}T_{c_1}, N_{c_1}{\texttt {]}} \bowtie {\mathcal {Q}}{\texttt {[}}T_{c_2}, N_{c_2}{\texttt {]}})(\textit{db})\right) \ltimes \mathbf {t} \end{aligned}$$

Here, the last equality is due to the fact that \(\textit{var}(N_n) = \textit{var}(n) \cup \textit{var}(N_{c_1}) \cup \textit{var}(N_{c_2}) = \textit{var}(N_{c_1}) \cup \textit{var}(N_{c_2})\), as \(\textit{var}(n) \subseteq \textit{var}(c_1) \subseteq \textit{var}(N_{c_{1}})\). This implies that

$$\begin{aligned} \textit{var}(N_n) = {\textit{out}}({\mathcal {Q}}{\texttt {[}}T_{c_1}, N_{c_1}{\texttt {]}}) \cup {\textit{out}}({\mathcal {Q}}{\texttt {[}}T_{c_2}, N_{c_2}{\texttt {]}}) \end{aligned}$$

Hence, projecting the join result on \(\textit{var}(N_n)\) does not modify the result. The result then follows from the observation that \({\mathcal {Q}}{\texttt {[}}T_n, N_n{\texttt {]}} \equiv \pi _{\textit{var}(N_n)} \sigma _{{\textit{pred}}(n)} ({\mathcal {Q}}{\texttt {[}}T_{c_1}, N_{c_2}{\texttt {]}} \bowtie {\mathcal {Q}}{\texttt {[}}T_{c_2}, N_{c_2}{\texttt {]}})\). \(\square \)

Proposition 4Assume that all join indexes in the (TN)-rep have access timeg, and that all indexes (join and enumeration) have update timeh, wheregandhare monotone functions. Further assume that, during the entire execution of\(\textsc {update}\), KandUbound the size of\(\rho _n\)and\(\Delta _n\), respectively, for alln. Then,\(\textsc {update}_{T,N}(\rho ,{ u })\)runs in time\({\mathcal {O}}\left( |{T}| \cdot \left( U + h(K, U) + g(K,U) \right) \right) \).

Proof

First, note that the initialization of \(\Delta _n\) in line 15 can be done in \({\mathcal {O}}(U)\) time (by copying \({ u }_{r(\overline{x}))}\) to \(\Delta _n\) tuple by tuple) and the initialization of \(\Delta _n\) in line 17 in \({\mathcal {O}}(1)\) time. Therefore, lines 14–17 run in \({\mathcal {O}}(|T|\cdot U)\) time, which falls within the claimed bounds. We next show that the for loop of lines 18–23 also runs within the claimed bounds. Since the body of this for loop is executed \(|{T}|\) times, it suffices to show that each of the lines 19–23 run in time \({\mathcal {O}}(U + h(K, U) + g(K,U))\). Since \(|{\Delta _n}| \le U\) by assumption, the statement \(\rho _n += \Delta _n\) of line 19 can be executed in \({\mathcal {O}}(U)\) time by iterating over the tuples \(\mathbf {t} \in \Delta _n\), and updating \(\rho _n(\mathbf {t})\) for each such tuple. (Recall that multiplicity lookup and modification in a GMR are \({\mathcal {O}}(1)\) operations). The indexes associated with \(\rho _n\) (if any) are updated in time h(KU). Therefore, the total time require to execute line 19 is \({\mathcal {O}}(U + h(K,U))\). We next bound the complexity of line 21. Computing \(\pi _{\textit{var}(p)} (\rho _m \bowtie _{{\textit{pred}}{p}} \Delta _n)\) using the join index on \(\rho _m\) takes \({\mathcal {O}}(g(K,U))\) time. Furthermore, the number of tuples in \(\pi _{\textit{var}(p)} (\rho _m \bowtie _{{\textit{pred}}{p}} \Delta _n)\) can be at most 2U. This is because \(|{\Delta _p}| \le U\) at any time during the execution. In the worst case, therefore, \(\pi _{\textit{var}(p)} (\rho _m \bowtie _{{\textit{pred}}{p}} \Delta _n)\) can at most delete the tuples already present in \(\Delta _p\) (which requires U tuples) and subsequently insert U new tuples (requiring another U tuples), for at most 2U tuples in total. For each of the 2U resulting tuples, we update \(\Delta _p\) accordingly in \({\mathcal {O}}(1)\) time. The total time to execute line 21 is hence \({\mathcal {O}}(2 \cdot U + g(K,U))\). Finally, using similar reasoning, the complexity of line 23 can be shown to be \({\mathcal {O}}(U)\). \(\square \)

Proofs of Sect. 5.1

1.1 Proof of Proposition 7

Because no infinite sequences of reduction steps are possible, it suffices to demonstrate local confluence:

Proposition 14

If \(\mathcal {H} \rightsquigarrow \mathcal {I}_1\) and \(\mathcal {H} \rightsquigarrow \mathcal {I}_2\), then there exists \(\mathcal {J}\) such that both \(\mathcal {I}_1 \rightsquigarrow ^* \mathcal {J}\) and \(\mathcal {I}_2 \rightsquigarrow ^* \mathcal {J}\).

Indeed, it is a standard result in the theory of rewriting systems that confluence (Lemma 7) and local confluence (Lemma 14) coincide when infinite sequences of reductions steps are impossible [5].

Before proving Lemma 14, we observe that the property of being isolated or being a conditional subset is preserved under reductions, in the following sense.

Lemma 11

Assume that \(\mathcal {H} \rightsquigarrow \mathcal {I}\). Then, \(\textit{pred}(\mathcal {I}) \subseteq \textit{pred}(\mathcal {H})\) and for every hyperedge e, we have \(\textit{ext}_{\mathcal {I}}(e) \subseteq \textit{ext}_{\mathcal {H}}(e)\), \({\textit{jv}}_{\mathcal {I}}(e) \subseteq {\textit{jv}}_{\mathcal {H}}(e)\), and \({\textit{isol}}_{\mathcal {H}}(e) \subseteq {\textit{isol}}_{\mathcal {I}}(e)\). Furthermore, if \(e \sqsubseteq _{\mathcal {H}} f\) then also \(e \sqsubseteq _{\mathcal {I}} f\).

Proof

First, observe that \(\textit{pred}(\mathcal {I}) \subseteq \textit{pred}(\mathcal {H})\), since reduction operators only remove predicates. This implies that \(\textit{ext}_{\mathcal {I}}(e) \subseteq \textit{ext}_{\mathcal {H}}(e)\) for every hyperedge e. Furthermore, because reduction operators only remove hyperedges and never add them, it is easy to see that \({\textit{jv}}_{\mathcal {H}}(e) \subseteq {\textit{jv}}_{\mathcal {I}}(e)\). Hence, if \(x \in {\textit{isol}}_{\mathcal {H}}(e)\) then \(x \not \in {\textit{jv}}_{\mathcal {H}}(e) \supseteq {\textit{jv}}_{\mathcal {I}}(e)\) and \(x \not \in \textit{var}(\textit{pred}(\mathcal {H})) \supseteq \textit{var}(\textit{pred}(\mathcal {I}))\). Therefore, \(x \in {\textit{isol}}_{\mathcal {I}}(e)\). As such, \({\textit{isol}}_{\mathcal {I}}(e) \subseteq {\textit{isol}}_{\mathcal {H}}(e)\).

Next, assume that \(e \sqsubseteq _{\mathcal {H}} f\). We need to show that \({\textit{jv}}_{\mathcal {I}}(e) \subseteq f\) and \(\textit{ext}_{\mathcal {I}}(e {\setminus } f) \subseteq f\). The first condition follows since \({\textit{jv}}_{\mathcal {I}}(e) \subseteq {\textit{jv}}_{\mathcal {H}}(e) \subseteq f\) where the last inclusion is due to \(e \sqsubseteq _{\mathcal {H}} f\). The second also follows since \(\textit{ext}_{\mathcal {I}}(e {\setminus } f) \subseteq \textit{ext}_{\mathcal {H}}(e {\setminus } f) \subseteq f\) where the last inclusion is due to \(e \sqsubseteq _{\mathcal {H}} f\). \(\square \)

Proof of Proposition 14

If \(\mathcal {I}_1 = \mathcal {I}_2\), then it suffices to take \(\mathcal {J} =\mathcal {I}_1 = \mathcal {I}_2\). Therefore, assume in the following that \(\mathcal {I}_1 \not = \mathcal {I}_2\). Then, necessarily \(\mathcal {I}_1\) and \(\mathcal {I}_2\) are obtained by applying two different reduction operations on \(\mathcal {H}\). We make a case analysis on the types of reductions applied.

(1) \({\textit{Case (ISO, ISO)}}\) assume that \(\mathcal {I}_1\) is obtained by removing the nonempty set \(X_1 \subseteq {\textit{isol}}_{\mathcal {H}}(e_1)\) from hyperedge \(e_1\), while \(\mathcal {I}_2\) is obtained by removing nonempty \(X_2 \subseteq {\textit{isol}}_{\mathcal {H}}(e_2)\) from \(e_2\) with \(X_1 \not = X_2\). There are two possibilities.

(1a) \(e_1 \not = e_2\). Then, \(e_2\) is still a hyperedge in \(\mathcal {I}_2 \) and \(e_1\) is still a hyperedge in \(\mathcal {I}_1\). By Lemma 11, \({\textit{isol}}_{\mathcal {H}}(e_1) \subseteq {\textit{isol}}_{\mathcal {I}_2}(e_1)\) and \({\textit{isol}}_{\mathcal {H}}(e_2) \subseteq {\textit{isol}}_{\mathcal {I}_1}(e_2)\). Therefore, we can still remove \(X_2\) from \(\mathcal {I}_1\) by means of rule ISO and similarly remove \(X_1\) from \(\mathcal {I}_2\). Let \(\mathcal {J}_1\) (resp. \(\mathcal {J}_2\)) be the result of removing \(X_2\) from \(\mathcal {I}_1\) (resp. \(\mathcal {I}_2\)). Then, \(\mathcal {J}_1 = \mathcal {J}_2\) (and hence equals triplet \(\mathcal {J}\)):

$$\begin{aligned} \textit{hyp}(\mathcal {J}_1)&= \textit{hyp}(\mathcal {H}) {\setminus } \{e_1,e_2\} \cup \{ e_1 {\setminus } X_1 \mid e_1 {\setminus } X_1 \not = \emptyset \} \\&\quad \cup \{ e_2 {\setminus } X_2 \mid e_2 {\setminus } X_2 \not = \emptyset \} \\&= \textit{hyp}(\mathcal {J}_2) \\ \textit{pred}(\mathcal {J}_1)&= \textit{pred}(\mathcal {H}) = \textit{pred}(\mathcal {J}_2) \end{aligned}$$

(1b) \(e_1 = e_2\). We show that \(X_2 {\setminus } X_1 \subseteq {\textit{isol}}_{\mathcal {I}_1}(e_1 {\setminus } X_1)\) and similarly \(X_1 {\setminus } X_2 \subseteq {\textit{isol}}_{\mathcal {I}_1}(e_2 {\setminus } X_1)\). This suffices because we can then apply ISO to remove \(X_2 {\setminus } X_1\) from \(\mathcal {I}_1\) and \(X_1 {\setminus } X_2\) from \(\mathcal {I}_2\). In both cases, we reach the same triplet as removing \(X_1 \cup X_2 \subseteq {\textit{isol}}_{\mathcal {H}}(e_1)\) from \(\mathcal {H}\).Footnote 7

To see that \(X_2 {\setminus } X_1 \subseteq {\textit{isol}}_{\mathcal {I}_1}(e_1 {\setminus } X_1)\), let \(x \in X_2 {\setminus } X_1\). We need to show \(x \not \in {\textit{jv}}_{\mathcal {I}_1}(e_1 {\setminus } X_1)\) and \(x \not \in \textit{var}(\textit{pred}(\mathcal {I}_1))\). Because \(x \in X_2 \subseteq {\textit{isol}}_{\mathcal {H}}(e_1)\), we know \(x \not \in {\textit{jv}}_{\mathcal {H}}(e_1)\). Then, since \(x \not \in X_1\), also \(x \not \in {\textit{jv}}_{\mathcal {H}}(e_1 {\setminus } X_1)\). By Lemma 11, \({\textit{jv}}_{\mathcal {I}_1}(e_1 {\setminus } X_1) \subseteq {\textit{jv}}_{\mathcal {H}}(e_1 {\setminus } X_1)\). Therefore, \(x \not \in {\textit{jv}}_{\mathcal {I}_1}(e_1 {\setminus } X_1)\). Furthermore, because \(x \in {\textit{isol}}_{\mathcal {H}}(e_1)\), we know \(x \not \in \textit{var}(\textit{pred}(\mathcal {H}))\). Since \(\textit{var}(\textit{pred}(\mathcal {I}_1)) \subseteq \textit{var}(\textit{pred}(\mathcal {H}))\) by Lemma 11, also \(x\ not \in \textit{var}(\textit{pred}(\mathcal {I}_1))\).

\(X_1 {\setminus } X_2 \subseteq {\textit{isol}}_{\mathcal {I}_1}(e_2 {\setminus } X_1)\) is shown similarly.

(2) \({\textit{Case (CSE, CSE)}}\) assume that \(\mathcal {I}_1\) is obtained by removing hyperedge \(e_1\) because it is a conditional subset of hyperedge \(f_1\), while \(\mathcal {I}_2\) is obtained by removing \(e_2\), conditional subset of \(f_2\). Since \(\mathcal {I}_1 \not = \mathcal {I}_2\), it must be \(e_1 \not = e_2\). We need to further distinguish the following cases.

(2a) \(e_1 \not = f_2\) and \(e_2 \not = f_1\). In this case, \(e_2\) and \(f_2\) remain hyperedges in \(\mathcal {I}_1\) while \(e_1\) and \(f_1\) remain hyperedges in \(\mathcal {I}_2\). Then, by Lemma 11, \(e_2 \sqsubseteq _{\mathcal {I}_1} f_2\) and \(e_1 \sqsubseteq _{\mathcal {I}_2} f_2\). Let \(\mathcal {J}_1\) (resp. \(\mathcal {J}_2\)) be the triplet obtained by removing \(e_2\) from \(\mathcal {I}_1\) (resp. \(e_1\) from \(\mathcal {I}_2\)). Then, \(\mathcal {J}_1 = \mathcal {J}_2\) since clearly \({\textit{out}}(\mathcal {J}_1) = {\textit{out}}(\mathcal {J}_2)\) and

$$\begin{aligned} \textit{hyp}(\mathcal {J}_1)&= \textit{hyp}(\mathcal {H}) {\setminus } \{e_1,e_2\} = \textit{hyp}(\mathcal {J}_2) \\ \textit{pred}(\mathcal {J}_1)&= \{ \theta \in \textit{pred}(\mathcal {H}) \mid \textit{var}(\theta ) \cap (e_1 {\setminus } f_1)= \emptyset , \\&\quad \textit{var}(\theta ) \cap (e_2 {\setminus } f_2) = \emptyset \}\\&= \textit{pred}(\mathcal {J}_2) \end{aligned}$$

From this the result follows by taking \(\mathcal {J} = \mathcal {J}_1 = \mathcal {J}_2\).

(2b) \(e_1 \not = f_2\) but \(e_2 = f_1\). Then, \(e_1 \sqsubseteq _{\mathcal {H}} e_2\) and \(e_2 \sqsubseteq _{\mathcal {H}} f_2\) with \(f_2 \not = e_1\). It suffices to show that \(e_1 \sqsubseteq _{\mathcal {H}} f_2\) and \(e_1 {\setminus } f_2 = e_1 {\setminus } f_1\), because then (CSE) due to \(e_1\sqsubseteq _{\mathcal {H}} f_1\) has the same effect as CSE on \(e_1 \sqsubseteq _{\mathcal {H}} f_2\), and we can apply the reasoning of case (2a) because \(e_1 \not = f_2\) and \(e_2 \not = f_2\).

We first show \(e_1 {\setminus } f_2 = e_1 {\setminus } f_1\). Let \(x \in e_1 {\setminus } f_2\) and suppose for the purpose of contradiction that that \(x \in e_2 = f_1\). Then, since \(e_1 \not = e_2\), \(x \in {\textit{jv}}(e_2) \subseteq f_2\) where the last inclusion is due to \(e_2 \sqsubseteq _{\mathcal {H}} f_2\). Hence, \(e_1 {\setminus } f_2 \subseteq e_1 {\setminus } f_1\). Conversely, let \(x \in e_1 {\setminus } f_1\). Since \(f_1 = e_2\), \(x \not \in e_2\). Suppose for the purpose of contradiction that \(x \in f_2\). Because \(e_1 \not = f_2\), \(x \in {\textit{jv}}_{\mathcal {H}}(e_1) \subseteq e_2\) where the last inclusion is due to \(e_1 \sqsubseteq _{\mathcal {H}} e_2\). Therefore, \(e_2 {\setminus } f_1 = e_1 {\setminus } f_2\).

To show that \(e_1 \sqsubseteq _{\mathcal {H}} f_2\), let \(x \in {\textit{jv}}_{\mathcal {H}}(e_1)\). Because \(e_1 \sqsubseteq _{\mathcal {H}} e_2\), \(x \in e_2\). Because x occurs in two distinct hyperedges in \(\mathcal {H}\), also \(x \in {\textit{jv}}_{\mathcal {H}}(e_2)\). Then, because \(e_2 \sqsubseteq _{\mathcal {H}} f_2\), \(x \in f_2\). Hence, \({\textit{jv}}_{\mathcal {H}}(e_1) \subseteq f_2\). It remains to show \(\textit{ext}_{\mathcal {H}}(e_1 {\setminus } f_2) \subseteq f_2\). To this end, let \(x \in \textit{ext}_{\mathcal {H}}(e_1 {\setminus } f_2)\) and suppose for the purpose of contradiction that \(x \not \in f_2\). By definition of \(\textit{ext}\) there exists \(\theta \in \textit{pred}(\mathcal {H})\) and \(y \in \textit{var}(\theta ) \cap (e_1 {\setminus } f_2)\) such that \(x \in \textit{var}(\theta ) {\setminus } (e_1 {\setminus } f_2)\). In particular, \(y \not \in f_2\). Since \(e_1 {\setminus } f_2 = e_1 {\setminus } e_2\), \(y \in \textit{var}(\theta ) \cap (e_1 {\setminus } e_2) \) and \(x \in \textit{var}(\theta ) {\setminus } (e_1 {\setminus } e_2)\). Thus, \(x \in \textit{ext}_{\mathcal {H}}(e_1 {\setminus } e_2)\). Then, since \(e_1 \sqsubseteq _{\mathcal {H}} e_2\), \(x \in e_2\). Thus, \(x \in e_2 {\setminus } f_2\) since \(x \not \in f_2\). Hence, \(x \in \textit{var}(\theta ) \cap (e_2 {\setminus } f_2)\). Furthermore, since \(y \not \in e_2\) also \(y \not \in e_2 {\setminus } f_2\). Hence, \(y \in \textit{var}(\theta ) {\setminus } (e_2 {\setminus } f_2)\). But then \(\theta \) shows that \(y \in \textit{ext}_{\mathcal {H}}(e_2 {\setminus } f_2)\). Then, by because \(e_2 \sqsubseteq _{\mathcal {H}} f_2\), also \(y \in f_2\) which yields the desired contradiction.

(2c) \(e_1 = f_2\) but \(e_2 \not = f_1\). Similar to case (2b).

(2d) \(e_1 = f_2\) and \(e_2 = f_1\). Then, \(e_1 \sqsubseteq _{\mathcal {H}} e_2\) and \(e_2 \sqsubseteq {_\mathcal {H}} e_1\) and \(e_1 \not = e_2\). Let \(\mathcal {K}_1\) (resp. \(\mathcal {K}_2\)) be the triplet obtained by applying (FLT) to remove all \(\theta \in \textit{pred}(\mathcal {I}_1)\) (resp. \(\theta \in \textit{pred}(\mathcal {I}_2)\) for which \(\textit{var}(\theta ) \subseteq \textit{var}(e_2)\) (resp. \((\textit{var}(\theta ) \subseteq \textit{var}(e_2)\). Furthermore, let \(\mathcal {J}_1\) (resp. \(\mathcal {J}_2\)) be the triplet obtained by applying ISO to removing \({\textit{isol}}_{\mathcal {I}_1}(e_2)\) from \(\mathcal {K}_1\) (resp. removing \({\textit{isol}}_{\mathcal {I}_2}(e_1)\) from \(\mathcal {K}_2\)). Here, we take \(\mathcal {J}_1 = \mathcal {K}_1\) if \({\textit{isol}}_{\mathcal {K}_1}(e_2)\) is empty (and similarly for \(\mathcal {J}_2\)). Then, clearly \(\mathcal {H} \rightsquigarrow \mathcal {I}_1 \rightsquigarrow ^* \mathcal {K}_1 \rightsquigarrow ^* J_1\) and \(\mathcal {H} \rightsquigarrow \mathcal {I}_2 \rightsquigarrow ^* \mathcal {K}_2 \rightsquigarrow ^* \mathcal {J}_2\). The result then follows by showing that \(\mathcal {J}_1 = \mathcal {J}_2\). Toward this end, first observe that \({\textit{out}}(\mathcal {J}_1) = {\textit{out}}(\mathcal {K}_1) = {\textit{out}}(\mathcal {I}_1) = {\textit{out}}(\mathcal {H}) = {\textit{out}}(\mathcal {I}_2) = {\textit{out}}(\mathcal {K}_2) = {\textit{out}}(\mathcal {J}_2)\). Next, we show that \(\textit{pred}(\mathcal {J}_1) = \textit{pred}(\mathcal {J}_2)\). We first observe that \(\textit{pred}(\mathcal {J}_1) = \textit{pred}(\mathcal {K}_1)\) and \(\textit{pred}(\mathcal {J}_2) = \textit{pred}(\mathcal {K}_2)\) since the ISO operation does not remove predicates. Then, observe that

$$\begin{aligned} \textit{pred}(\mathcal {K}_1)&= \{ \theta \in \textit{pred}(\mathcal {I}_1) \mid \textit{var}(\theta ) \not \subseteq \textit{var}(e_2) \} \\&= \{ \theta \in \textit{pred}(\mathcal {H}) \mid \textit{var}(\theta ) \cap (e_1{\setminus } e_2) = \emptyset \quad \text { and } \\&\quad \textit{var}(\theta ) \not \subseteq e_2 \}, \\ \textit{pred}(\mathcal {K}_2)&= \{ \theta \in \textit{pred}(\mathcal {I}_2) \mid \textit{var}(\theta ) \not \subseteq e_1 \} \\&= \{ \theta \in \textit{pred}(\mathcal {H}) \mid \textit{var}(\theta ) \cap (e_2{\setminus } e_1) = \emptyset \quad \text { and }\\&\quad \textit{var}(\theta ) \not \subseteq e_1 \}. \end{aligned}$$

We only show the reasoning for \(\textit{pred}(\mathcal {K}_1) \subseteq \textit{pred}(\mathcal {K}_2)\), the other direction being similar. Let \(\theta \in \textit{pred}(\mathcal {K}_1)\). Then, \(\textit{var}(\theta \cap (e_1 {\setminus } e_2) =\emptyset \) and \(\textit{var}(\theta ) \not \subseteq e_2\). Since \(\textit{var}(\theta ) \not \subseteq e_2\) there exists \(y \in \textit{var}(\theta ) {\setminus } e_2\). Then, because \(\textit{var}(\theta ) \cap (e_1 {\setminus } e_2) = \emptyset \), \(y \not \in e_1\). Thus, \(\textit{var}(\theta ) \not \subseteq e_1\). Now, suppose for the purpose of obtaining a contradiction, that \(\textit{var}(\theta ) \cap (e_2 {\setminus } e_1) \not = \emptyset \). Then take \(z \in \textit{var}(\theta ) \cap (e_2 {\setminus } e_1)\). But then \(y \in \textit{ext}_{\mathcal {H}}(e_2 {\setminus } e_1)\). Hence, \(y \in e_1\) because \(e_2 \sqsubseteq _{\mathcal {H}} e_1\), which yields the desired contradiction with \(y \not \in e_2\). Therefore, \(\textit{var}(\theta ) \cap (e_2 {\setminus } e_1) = \emptyset \), as desired. Hence, \(\theta \in \textit{pred}(\mathcal {K}_2)\).

It remains to show that \(\textit{hyp}(\mathcal {J}_1) = \textit{hyp}(\mathcal {J}_2)\). To this end, first observe

$$\begin{aligned} \textit{hyp}(\mathcal {J}_1)&= \textit{hyp}(\mathcal {K}_1) {\setminus } \{e_2\} \cup \{ e_2 {\setminus } {\textit{isol}}_{\mathcal {K}_1}(e_2) \},\\&= \textit{hyp}(\mathcal {H}) {\setminus } \{ e_1 \} {\setminus } \{e_2\} \cup \{ e_2 {\setminus } {\textit{isol}}_{\mathcal {K}_1}(e_2) \},\\ \textit{hyp}(\mathcal {J}_2)&= \textit{hyp}(\mathcal {K}_2) {\setminus } \{e_1 \} \cup \{ e_1 {\setminus } {\textit{isol}}_{\mathcal {K}_2}(e_1) \} \\&= \textit{hyp}(\mathcal {H}) {\setminus } \{ e_2 \} {\setminus } \{e_1 \} \cup \{ e_1 {\setminus } {\textit{isol}}_{\mathcal {K}_2}(e_1) \}. \end{aligned}$$

Clearly, \(\textit{hyp}(\mathcal {J}_1) = \textit{hyp}(\mathcal {J}_2)\) if \(e_2 {\setminus } {\textit{isol}}_{\mathcal {K}_1}(e_2) = e_1 {\setminus } {\textit{isol}}_{\mathcal {K}_2}(e_1)\).

We only show \(e_2 {\setminus } {\textit{isol}}_{\mathcal {K}_1}(e_2) \subseteq e_1 {\setminus } {\textit{isol}}_{\mathcal {K}_2}(e_1)\), the other inclusion being similar. Let \(x \in e_2 {\setminus } {\textit{isol}}_{\mathcal {K}_1}(e_2)\). Since \(x \not \in {\textit{isol}}_{\mathcal {K}_1}(e_2)\) one of the following hold.

  • \(x \in {\textit{out}}(\mathcal {K}_1)\). But then, \(x \in {\textit{out}}(\mathcal {K}_1) = {\textit{out}}(\mathcal {I}_1) = {\textit{out}}(\mathcal {H}) = {\textit{out}}(\mathcal {I}_2) = {\textit{out}}(\mathcal {K}_2)\). In particular, x is an equijoin variable in \(\mathcal {H}\) and \(\mathcal {K_2}\). Then, \(x \in {\textit{jv}}_{\mathcal {H}}(e_2) \subseteq e_1\) because \(e_2 \sqsubseteq _{\mathcal {H}} e_1\). From this and the fact that x remains an equijoin variable in \(\mathcal {K}_2\), we obtain \(x \in e_1 {\setminus } {\textit{isol}}_{\mathcal {K}_2}(e_1)\).

  • x occurs in \(e_2\) and in some hyperedge g in \(\mathcal {K}_1\) with \(g \not = e_2\). Since \(e_1\) is not in \(\mathcal {K}_1\) also \(g \not = e_1\). Since every hyperedge in \(\mathcal {K}_1\) is in \(\mathcal {I}_1\) and every hyperedge in \(\mathcal {I}_1\) is in \(\mathcal {H}\), also g is in \(\mathcal {H}\). But then, x occurs in two distinct hyperedges in \(\mathcal {H}\), namely \(e_2\) and g, and hence \(x \in {\textit{jv}}_{\mathcal {H}}(e_2) \subseteq e_1\) because \(e_2 \sqsubseteq _{\mathcal {H}} e_1\). However, because x also occurs in g which must also be in \(\mathcal {I}_2\) and therefore also in \(\mathcal {K}_2\), x also occurs in two distinct hyperedges in \(\mathcal {K}_2\), namely \(e_1\) and g. Therefore, \(x \in {\textit{jv}}_{\mathcal {I}_2}(e_1)\) and hence \(x \in e_1 {\setminus } {\textit{isol}}_{\mathcal {I}_2}(e_1)\), as desired.

  • \(x \in \textit{var}(\textit{pred}(\mathcal {K}_1))\). Then, there exists \(\theta \in \textit{pred}(\mathcal {K}_1)\) such that \(x \in \textit{var}(\theta )\). Since \(\textit{pred}(\mathcal {K}_1) =\textit{pred}(\mathcal {K}_2)\), \(\theta \in \textit{pred}(\mathcal {K}_2)\). As such, \(\theta \in \textit{pred}(\mathcal {H})\), \(\textit{var}(\theta ) \cap (e_2 {\setminus } e_1) = \emptyset \), and \(\textit{var}(\theta ) \not \subseteq e_1\). But then, since \(x \in \textit{var}(\theta )\); \(x \in e_2\); and \(\textit{var}(\theta ) \cap (e_2 {\setminus } e_1) = \emptyset \), it must be the case that \(x \in e_1\). As such, \(x \in e_1\) and \(x \in \textit{var}(\mathcal {K}_2)\). Hence, \(x \in e_1 {\setminus } {\textit{isol}}_{\mathcal {K}_2}(e_1)\).

(3) \({\textit{Case (ISO, CSE)}}\) assume that \(\mathcal {I}_1\) is obtained by removing the nonempty set of isolated variables \(X_1 \subseteq {\textit{isol}}_{\mathcal {H}}(e_1)\) from \(e_1\), while \(\mathcal {I}_2\) is obtained by removing hyperedge \(e_2\), conditional subset of hyperedge \(f_2\). We may assume w.l.o.g. that \(e_1 \not = {\textit{isol}}_{\mathcal {H}}(e_1)\): if \(e_1 = {\textit{isol}}_{\mathcal {H}}(e_1)\), then the ISO operation removes the complete hyperedge \(e_1\). However, because no predicate in \(\mathcal {H}\) shares any variable with \(e_1\), it is readily verified that \(e_1 \sqsubseteq _{\mathcal {H}} e_2\) and thus the removal of \(e_1\) can also be seen as an application of CSE on \(e_1\),Footnote 8 and we are hence back in case (2).

Now reason as follows. Because \(e_2 \sqsubseteq _{\mathcal {H}} f_2\) and because isolated variables of \(e_1\) occur in no other hyperedge in \(\mathcal {H}\), it must be the case that \(e_2 \cap X_1 = \emptyset \). In particular, \(e_1\) and \(e_2\) must hence be distinct. Therefore, \(e_1 \in \textit{hyp}(\mathcal {I}_2)\) and \(e_2 \in \textit{hyp}(\mathcal {I}_1)\). By Lemma 11, we can apply ISO on \(\mathcal {I}_2\) to remove \(X_1\) from \(e_1\). It then suffices to show that \(e_2\) remains a conditional subset of some hyperedge \(f'_2\) in \(\mathcal {I}_1\) with \(e_2 {\setminus } f_2 = e_2 {\setminus } f'_2\). Indeed, we can then use ECQ to remove \(e_2\) from \(\textit{hyp}(\mathcal {I}_1)\) as well as predicates \(\theta \) with \(\textit{var}(\theta ) \cap (e_2 {\setminus } f_2) \not = \emptyset \) from \(\textit{pred}(\mathcal {I}_1)\). This clearly yields the same triplet as the one obtained by removing \(X_1\) from \(e_1\) in \(\mathcal {I}_2\). We need to distinguish two cases.

(3a) \(f_2 \not = e_1\). Then, \(f_2 \in \textit{hyp}(\mathcal {I}_1)\) and hence \(e_2 \sqsubseteq _{\mathcal {I}_1} f_2\) by Lemma 11. We hence take \(f'_2 = f_2\).

(3b) \(f_2 = e_1\). Then, we take \(f'_2 = e_1 {\setminus } X\). Since \(e_1 \not = {\textit{isol}}_{\mathcal {H}}(e_1)\), it follows that \(e_1 {\setminus } X_1 \not = \emptyset \). Therefore, \(f'_2 = e_1 {\setminus } X_1 \in \textit{hyp}(\mathcal {I}_1)\). Furthermore, since \(X \subseteq {\textit{isol}}_{\mathcal {H}}(e_1)\), no variable in X is in any other hyperedge in \(\mathcal {H}\). In particular \(X \cap e_2 = \emptyset \). Therefore, \(e_2 {\setminus } f'_2 = e_2 {\setminus } (e_1 {\setminus } X) = (e_2 {\setminus } e_1) \cup (e_2 \cap X) = e_2 {\setminus } e_1 {\setminus } e_1 = e_2 {\setminus } f_2\). It remains to show that \(e_2 \sqsubseteq _{\mathcal {I}_1} e_1 {\setminus } X_1\).

  • \({\textit{jv}}_{\mathcal {I}_1}(e_2) \subseteq e_1 {\setminus } X_1\). Let \(x \in {\textit{jv}}_{\mathcal {I}_1}(e_2)\). By Lemma 11, \(x \in {\textit{jv}}_{\mathcal {I}_1}(e_2) \subseteq {\textit{jv}}_{\mathcal {H}}(e_2) \subseteq e_1\) where the last inclusion is due to \(e_2 \sqsubseteq _{\mathcal {H}} e_1\). In particular, x is an equijoin variable in \(\mathcal {H}\). But then it cannot be an isolated variable in any hyperedge. Therefore, \(x \not \in X_1\).

  • \(\textit{ext}_{\mathcal {I}_1}(e_2 {\setminus } e_1) \subseteq e_1 {\setminus } X\). Let \(x \in \textit{ext}_{\mathcal {I}_1}(e_2 {\setminus } e_1)\). Then, \(x \in \textit{ext}_{\mathcal {I}_1}(e_2 {\setminus } e_1) \subseteq \textit{ext}_{\mathcal {H}}(e_2 {\setminus } e_1) \subseteq e_1\) where the first inclusion is by Lemma 11 and the second by \(e_2 \sqsubseteq _{\mathcal {H}} e_1\). Then, because \(x \in \textit{ext}_{\mathcal {H}}(e_2 {\setminus } e_1)\), it follows from the definition of \(\textit{ext}\) that x occurs in some predicate in \(\textit{pred}(\mathcal {H})\). However, X is disjoint with \(\textit{var}(\textit{pred}(\mathcal {H}))\) since it consist only of isolated variables. Therefore, \(x \not \in X\).

(4) \({\textit{Case (ISO, FLT)}}\) Assume that \(\mathcal {I}_1\) is obtained by removing the nonempty set \(X_1 \subseteq {\textit{isol}}_{\mathcal {H}}(e_1)\) from hyperedge \(e_1\), while \(\mathcal {I}_2\) is obtained by removing all predicates in the nonempty set \(\Theta \subseteq \textit{pred}(\mathcal {H})\) with \(\textit{var}(\Theta ) \subseteq e_2\) for some hyperedge \(e_2\) in \(\textit{hyp}(\mathcal {H})\). Observe that \(e_1 \in \textit{hyp}(\mathcal {I}_2)\). By Lemma 11, \(X \subseteq {\textit{isol}}_{\mathcal {H}}(e_1) \subseteq {\textit{isol}}_{\mathcal {I}_2}(e_1)\). Therefore, we may apply reduction operation (ISO) on \(\mathcal {I}_2\) to remove \(X_1\) from \(e_1\). We will now show that, similarly, we may still apply (FLT) on \(\mathcal {I}_1\) to remove all predicates in \(\Theta \) from \(\textit{pred}(\mathcal {I}_1) = \textit{pred}(\mathcal {H})\). The two operations hence commute, and clearly, the resulting triplets in both cases is the same. We distinguish two possibilities. (i) \(e_1 \not = e_2\). Then, \(e_2 \in \mathcal {I}_1\) and, \(\textit{var}(\Theta ) \subseteq e_2\) and, since (ISO) does not remove predicates, \(\Theta \subseteq \textit{pred}(\mathcal {H}) = \textit{pred}(\mathcal {I}_1)\). As such the (FLT) operation indeed applies to remove all predicates in \(\Theta \) from \(\textit{pred}(\mathcal {I}_1)\). (ii) \(e_1 = e_2\). Then, since \(X \subseteq {\textit{isol}}_{\mathcal {H}}(e_1)\) and isolated variables do no occur in any predicate, \(X \cap \textit{var}(\Theta ) = \emptyset \). Then, since \(\textit{var}(\Theta ) \subseteq e_2 = e_1\), it follows that also \(\textit{var}(\Theta ) \subseteq e_1 {\setminus } X\). In particular, since we disallow nullary predicates and \(\Theta \) is nonempty, \(e_1 {\setminus } X \not = \emptyset \). Thus, \(e_1 {\setminus } X \in \textit{hyp}(\mathcal {I}_1)\) and hence operation (FLT) applies indeed applies to remove all predicates in \(\Theta \) from \(\textit{pred}(\mathcal {I}_1)\)

(5) \({\textit{Case (CSE, FLT)}}\) assume that \(\mathcal {I}_1\) is obtained by removing hyperedge \(e_1\), conditional subset of \(e_2\) in \(\mathcal {H}\), while \(\mathcal {I}_2\) is obtained by removing all predicates in the nonempty set \(\Theta \subseteq \textit{pred}(\mathcal {H})\) with \(\textit{var}(\Theta ) \subseteq e_3\) for some hyperedge \(e_3 \in \textit{hyp}(\mathcal {H})\). Since the (FLT) operation does not remove any hyperedges, \(e_1\) and \(e_2\) are in \(\textit{hyp}(\mathcal {I}_2)\). Then, since \(e_1 \sqsubseteq _{\mathcal {H}} e_2\) also \(e_1 \sqsubseteq _{\mathcal {I}_2} e_2\) by Lemma 11. Therefore, we may apply reduction operation (CSE) on \(\mathcal {I}_2\) to remove \(e_1\) from \(\textit{hyp}(\mathcal {I}_2)\) as well as all predicates \(\theta \in \textit{pred}(\mathcal {I}_2)\) for which \(\textit{var}(\theta ) \cap (e_1 {\setminus } e_2) \not = \emptyset \). Let \(\mathcal {J}_2\) be the triplet resulting from this operation. We will show that, similarly, we may apply (FLT) on \(\mathcal {I}_1\) to remove all predicates in \(\Theta \cap \textit{pred}(\mathcal {I}_1)\) from \(\textit{pred}(\mathcal {I}_1)\), resulting in a triplet \(\mathcal {J}_1\). Observe that necessarily, \(\mathcal {J}_1 = \mathcal {J}_2\) (and hence they form the triplet \(\mathcal {J}\)). Indeed, \({\textit{out}}(\mathcal {J}_1) = {\textit{out}}(\mathcal {I}_1) = {\textit{out}}(\mathcal {H}) = {\textit{out}}(\mathcal {I}_2) = {\textit{out}}(\mathcal {J}_2)\) since reduction operations never modify output variables. Moreover,

$$\begin{aligned} \textit{hyp}(\mathcal {J}_1)&= \textit{hyp}(\mathcal {I}_1) \\&= \textit{hyp}(\mathcal {H}) {\setminus } \{ e_1 \} \\&= \textit{hyp}(\mathcal {I}_2) {\setminus } \{ e_1 \} \\&= \textit{hyp}(\mathcal {J}_2) \end{aligned}$$

where the first and third equality is due to fact that (FLT) does not modify the hypergraph of the triplet it operates on. Finally, observe

$$\begin{aligned} \textit{pred}(\mathcal {J}_1)&= \textit{pred}(\mathcal {I}_1) {\setminus } (\Theta \cap \textit{pred}(\mathcal {I}_1)) \\&= \textit{pred}(\mathcal {I}_1) {\setminus } \Theta \\&= \{ \theta \in \textit{pred}(\mathcal {H}) \mid \textit{var}(\theta ) \cap (e_1 {\setminus } e_2) = \emptyset \} {\setminus } \Theta \\&= \{ \theta \in \textit{pred}(\mathcal {H}) {\setminus } \Theta \mid \textit{var}(\theta ) \cap (e_1 {\setminus } e_2) = \emptyset \} \\&= \{ \theta \in \textit{pred}(\mathcal {I}_2) \mid \textit{var}(\theta ) \cap (e_1 {\setminus } e_2) = \emptyset \} \\&= \textit{pred}(\mathcal {J}_2) \end{aligned}$$

It remains to show that we may apply (FLT) on \(\mathcal {I}_1\) to remove all predicates in \(\Theta \cap \textit{pred}(\mathcal {I}_1)\), resulting in a triplet \(\mathcal {J}_1\). There are two possibilities.

  • \(e_3 \not = e_1\). Then, \(e_3 \in \mathcal {I}_1\), \(\Theta \cap \textit{pred}(\mathcal {(}I_1)) \subseteq \textit{pred}(\mathcal {I}_1))\), and \(\textit{var}(\Theta \cap \textit{pred}(\mathcal {I}_1)) \subseteq \textit{var}(\Theta ) \subseteq e_3\). Hence, the (FLT) operation indeed applies to \(\mathcal {I}_1\) to remove all predicates in \(\Theta \cap \textit{pred}(\mathcal {I}_1)\).

  • \(e_3 = e_1\). In this case, we claim that for every \(\theta \in \Theta \cap \textit{pred}(\mathcal {I}_1)\), we have \(\textit{var}(\theta ) \subseteq e_2\). As such, \(\textit{var}(\Theta \cap \textit{pred}(\mathcal {I}_1)) \subseteq e_2\). Since \(e_2 \in \textit{hyp}(\mathcal {I}_1)\) and \(\Theta \cap \textit{pred}(\mathcal {I}_1) \subseteq \textit{pred}(\mathcal {I}_1)\), we may hence apply (FLT) to remove all predicates in \(\Theta \cap \textit{pred}(\mathcal {I}_1)\) from \(\mathcal {I}_1\). Concretely, let \(\theta \in \Theta \cap \textit{pred}(\mathcal {I}_1)\). Because, in order to obtain \(\mathcal {I}_1\), (CSE) removes all predicates from \(\mathcal {H}\) that share a variable with \(e_1 {\setminus } e_2\), we have \(\textit{var}(\theta ) \cap (e_1 {\setminus } e_2) = \emptyset \). Moreover, because \(\theta \in \Theta \), \(\textit{var}(\theta ) \subseteq e_1\). Hence, \(\textit{var}(\theta ) \subseteq e_2\), as desired.

The remaining cases, (CSE, ISO), (FLT, ISO), and (FLT, CSE), are symmetric to case (3), (4), and (5), respectively. \(\square \)

1.2 Proof of Proposition 9

Proposition 9For every GJT pair there exists an equivalent canonical pair.

Proof

Let T be a GJT. The proof proceeds in three steps.

Step 1 Let \(T_1\) be the GJT obtained from T by (i) removing all predicates from T, and (ii) creating a new root node r that is labeled by \(\emptyset \) and attaching the root of T to it, labeled by the empty set of predicates. \(T_1\) satisfies the first canonicality condition, but is not equivalent to T because it has none of T’s predicates. Now re-add the predicates in T to \(T_1\) as follows. For each edge \(m \rightarrow n\) in T and each predicate \(\theta \in \textit{pred}_T(m \rightarrow n)\), if \(\textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) \not =\emptyset \) then add \(\theta \) to \(\textit{pred}_{T_1}(m \rightarrow n)\). Otherwise, if \(\textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) = \emptyset \), do the following. First, observe that, by definition of GJTs, \(\textit{var}(\theta ) \subseteq \textit{var}(n) \cup \textit{var}(m)\). Because \(\textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) = \emptyset \) this implies \(\textit{var}(\theta ) \subseteq \textit{var}(m)\). Because we disallow nullary predicates, \(\textit{var}(m) \not = \emptyset \). Let a be the first ancestor of m in \(T_1\) such that \(\textit{var}(\theta ) \not \subseteq \textit{var}(a)\). Such an ancestor exists because the root of \(T_1\) is labeled \(\emptyset \). Let b be the child of a in \(T_1\). Since a is the first ancestor of m with \(\textit{var}(\theta ) \not \subseteq \textit{var}(a)\), \(\textit{var}(\theta ) \subseteq \textit{var}(b)\). Therefore, \(\textit{var}(\theta ) \subseteq \textit{var}(b) \cup \textit{var}(a)\) and \(\textit{var}(\theta ) \cap (\textit{var}(b) {\setminus } \textit{var}(a)) \not = \emptyset \). As such, add \(\theta \) to \(\textit{pred}_{T_1}(a \rightarrow b)\). After having done this for all predicates in T, \(T_1\) becomes equivalent to T and satisfies canonicality conditions (1) and (3). Then, take \(N_1 = N \cup \{r\}\). Clearly, \(N_1\) is a connex subset of \(T_1\) and \(\textit{var}(N) = \textit{var}(N')\).

Therefore, \((T_1,N_1)\) is equivalent to (TN).

Step 2 Let \(T_2\) be obtained from \(T_1\) by adding, for each leaf node l in \(T_1\) a new interior node \(n_l\) labeled by \(\textit{var}(l)\) and inserting it in-between l and its parent in \(T_1\), i.e., if l has parent p in \(T_1\), then we have \(p \rightarrow n_l \rightarrow l\) in \(T_2\) with \(\textit{pred}_{T_2}(p \rightarrow n_l) = {\textit{pred}}_{T_1}(p \rightarrow n)\) and \(\textit{pred}_{T_2}(n_l \rightarrow l)= \emptyset \).Footnote 9 Furthermore, let \(N_2\) be the connex subset of \(T_2\) obtained by replacing every leaf node l in \(N_1\) by its newly inserted node \(n_l\). Clearly, \(\textit{var}(N_2) = \textit{var}(N_1) = \textit{var}(N)\) because \(var(l) = \textit{var}(n_l)\) for every leaf l of \(T_1\). By our construction, \((T_2, N_2)\) is equivalent to (TN); \(T_2\) satisfies canonicality conditions (1), (2), and (4); and \(N_2\) is canonical.

Step 3 It remains to enforce condition (3). To this end, observe that, by the connectedness condition of GJTs, \(T_2\) violates canonicality condition (3) if and only if there exist internal nodes m and n where m is the parent of n such that \(\textit{var}(m) = \textit{var}(n)\). In this case, we call n a culprit node. We will now show how to obtain an equivalent pair (UM) that removes a single culprit node; the final result is then obtained by iterating this reasoning until all culprit nodes have been removed.

The culprit removal procedure is essentially the reverse of the binarization procedure of Fig. 9. Concretely, let n be a culprit node with parent m and let \(n_1,\ldots , n_k\) be the children of n in \(T_2\). Let U be the GJT obtained from \(T_2\) by removing n and attaching all children \(n_i\) of n as children to m with edge label \(\textit{pred}_U(m \rightarrow n_i) = \textit{pred}_{T_2}(n \rightarrow n_i)\), for \(1 \le i \le k\). Because \(\textit{var}(n) = \textit{var}(m)\), the result is still a valid GJT. Moreover, because \(\textit{var}(n) = \textit{var}(m)\) and \(T_2\) satisfied condition (4), we had \({\textit{pred}}_{T_2}(m \rightarrow n) = \emptyset \), so no predicate was lost by the removal of n. Finally, define M as follows. If \(n \in N_2\), then set \(M = N_2 {\setminus } \{n\}\); otherwise, set \(M = N_2\). In the former case, since \(N_2\) is connex and \(n \in N_2\), m must also be in \(N_2\). It is hence in M. Therefore, in both cases, \(\textit{var}(N) = \textit{var}(N_2) = \textit{var}(M)\). Furthermore, it is straightforward to check that M is a connex subset of U. Finally, since \(N_2\) consisted only of interior nodes of \(T_2\), M consists only of interior nodes of U and hence remains canonical. \(\square \)

1.3 Proof of Lemma 5

We first require a number of auxiliary results.

We first make the following observations regarding canonical GJT pairs.

Lemma 12

Let (TN) be a canonical GJT pair, let n be a frontier node of N and let m be the parent of n in T.

  1. 1.

    \(x \not \in \textit{var}(N {\setminus } \{n\})\), for every \(x \in \textit{var}(n) {\setminus } \textit{var}(m)\).

  2. 2.

    \(\textit{hyp}(T, N {\setminus } \{n\}) = \textit{hyp}(T, N) {\setminus } \{\textit{var}(n)\})\).

  3. 3.

    \(\theta \not \in \textit{pred}(m \rightarrow n)\), for every \(\theta \in \textit{pred}(T, N {\setminus } \{n\})\)

  4. 4.

    \(\textit{pred}(T, N {\setminus } \{n\}) = \textit{pred}(T, N) {\setminus } \textit{pred}(m \rightarrow n)\).

  5. 5.

    \(\textit{pred}(m \rightarrow n) = \{ \theta \in {\textit{pred}}(T, N) \mid \textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) \not = \emptyset \}\).

  6. 6.

    \({\textit{pred}}(T, N {\setminus } \{n\}) = \{ \theta \in {\textit{pred}}(T, N) \mid \textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) = \emptyset \}\).

Proof

  1. (1)

    Let \(x \in \textit{var}(n) {\setminus } \textit{var}(m)\) and let c be a node in \(N {\setminus } \{n\}\). Clearly the unique undirected path between c and n in T must pass through m. Because \(x \not \in \textit{var}(m)\), it follows from the connectedness condition of GJTs that also \(x \not \in \textit{var}(c)\). As such, \(x \not \in \textit{var}(N {\setminus } \{n\})\).

  2. (2)

    The \(\supseteq \) direction is trivial. For the \(\subseteq \) direction, assume that \(m \in N {\setminus } \{n\}\) with \(\textit{var}(m) \not = \emptyset \). Then, clearly \(m \in N\) and hence \(\textit{var}(m) \in \textit{hyp}(T,N)\). Furthermore, because N is canonical, both m and n are interior nodes in T. Then, because T is canonical and \(m \not = n\), we have \(\textit{var}(m) \not =\textit{var}(n)\). Therefore, \(\textit{var}(m) \in \textit{hyp}(T,N) {\setminus } \{\textit{var}(n)\}\).

  3. (3)

    Let \(\theta \in {\textit{pred}}(T, N {\setminus } n)\). Then, \(\theta \) occurs on the edge between two nodes in \(N {\setminus } n\), say \(m' \rightarrow n'\). By definition of GJTs, \(\textit{var}(\theta ) \subseteq \textit{var}(n') \cup \textit{var}(m') \subseteq \textit{var}(N {\setminus } \{n\})\). Now suppose for the purpose of contradiction that also \(\theta \in {\textit{pred}}(m \rightarrow n)\). Because T is nice, there is some \(x \in \textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) \not = \emptyset \). Hence, by (1), \(x \not \in \textit{var}(N {\setminus } \{n\})\), which contradicts \(\textit{var}(\theta ) \subseteq \textit{var}(N {\setminus } \{n\})\).

  4. (4)

    Clearly, \(\textit{pred}(T,N) {\setminus } \textit{pred}(m \rightarrow n) \subseteq \textit{pred}(T, N {\setminus } \{n\})\). The converse inclusion follows from (3).

  5. (5)

    The \(\subseteq \) direction follows from the fact that m and n are in N, and T is nice. To also see \(\supseteq \), let \(\theta \in {\textit{pred}}(T, N)\) with \(\textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) \not = \emptyset \). There exists \(x \in \textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m))\). By (1), \(x \not \in \textit{var}(N {\setminus } \{n\})\). Therefore, \(\theta \) cannot occur between edges in \(N {\setminus } \{n\}\) in T. Since it nevertheless occurs in \({\textit{pred}}(T,N)\), it must hence occur in \(\textit{pred}(m \rightarrow n)\).

  6. (6)

    Follows directly from (4) and (5).

\(\square \)

Lemma 13

Let (TN) be a canonical GJT pair, let n be a frontier node of N and let m be the parent of n in T. Let \(\overline{z} \subseteq \textit{var}(N {\setminus } \{n\})\).

  1. 1.

    \(\textit{var}(n) \sqsubseteq _{{\mathcal {H}}(T, N, \overline{z})} \textit{var}(m)\).

  2. 2.

    \(x \not \in {\textit{jv}}({\mathcal {H}}(T, N, \overline{z}))\), for every \(x \in (\textit{var}(n) {\setminus } \textit{var}(m))\).

Proof

For reasons of parsimony, let \(\mathcal {H} = {\mathcal {H}}(T, N, \overline{z})\). We first prove (2) and then (1).

(2) Let \(x \in \textit{var}(n) {\setminus } \textit{var}(m)\). By Lemma 12(1), \(x \not \in \textit{var}(N {\setminus } \{n\})\). Therefore, x occurs in \(\textit{var}(n)\) in \(\mathcal {H}\) and in no other hyperedge. Furthermore, because \(\overline{z} \subseteq \textit{var}(N {\setminus } \{n\})\), also \(x \not \in \overline{z}\). Hence, \(x \not \in {\textit{jv}}_{\mathcal {H}}(\textit{var}(n))\).

(1) We need to show that \({\textit{jv}}_{\mathcal {H}}(\textit{var}(n)) \subseteq \textit{var}(m)\) and \(\textit{ext}_{\mathcal {H}}(\textit{var}(n) {\setminus } \textit{var}(m)) \subseteq \textit{var}(m)\). Let \(x \in {\textit{jv}}_{\mathcal {H}}(\textit{var}(n))\). By contraposition of (2), we know that \(x \not \in (\textit{var}(n) {\setminus } \textit{var}(m))\). Therefore, \(x \in \textit{var}(m)\) and thus \({\textit{jv}}_{\mathcal {H}}(\textit{var}(n)) \subseteq \textit{var}(m)\). To show \(\textit{ext}_{\mathcal {H}}(\textit{var}(n) {\setminus } \textit{var}(m)) \subseteq \textit{var}(m)\), let \(y \in \textit{ext}_{\mathcal {H}}(\textit{var}(n) {\setminus } \textit{var}(m))\). Then, \(y \not \in \textit{var}(n) {\setminus } \textit{var}(m)\) and there exists \(\theta \in {\textit{pred}}(T, N)\) with \(\textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) \not = \emptyset \) and \(y \in \textit{var}(\theta )\). By Lemma 12(5), \(\theta \in {\textit{pred}}_T(m \rightarrow n)\). Thus, \(y \in \textit{var}(m) \cup \textit{var}(n)\). Since also \(y \not \in \textit{var}(n) {\setminus } \textit{var}(m)\), it follows that \(y \in \textit{var}(m)\). Therefore, \(\textit{ext}_{\mathcal {H}}(\textit{var}(n) {\setminus } \textit{var}(m)) \subseteq \textit{var}(m)\). \(\square \)

Lemma 14

s Let (TN) be a canonical GJT pair and let n be a frontier node of N. Then, \({\mathcal {H}}(T,N, \overline{z}) \rightsquigarrow ^* {\mathcal {H}}(T,N {\setminus }\{n\}, \overline{z})\) for every \(\overline{z} \subseteq \textit{var}(N {\setminus } \{n\})\).

Proof

For reasons of parsimony, let us abbreviate \(\mathcal {H}_1 = {\mathcal {H}}(T, N, \overline{z})\) and \(\mathcal {H}_2 = {\mathcal {H}}(T, N {\setminus }\{n\}, \overline{z})\). We make the following case analysis.

Case 1 Node n is the root in N. Because the root of a canonical tree is labeled by \(\emptyset \), we have \(\textit{var}(n) = \emptyset \). Since n is a frontier node of N, \(N = \{n\}\). Thus, \(\textit{hyp}(T, N) = \emptyset \) and \(\textit{hyp}(T, N {\setminus } \{n\}) = \emptyset \). Furthermore, \(\textit{pred}(T, N) = \textit{pred}(T, N {\setminus } \{n\}) = \emptyset \) and \(\overline{z} \subseteq \textit{var}(N {\setminus } \{n\}) = \textit{var}(\emptyset ) = \emptyset \). As such, both \(\mathcal {H}_1\) and \(\mathcal {H}_2\) are the empty triplet \((\emptyset , \emptyset , \emptyset )\). Therefore, \(\mathcal {H}_1 \rightsquigarrow ^* H_2\).

Case 2n has parent m in N and \(\textit{var}(m) \not = \emptyset \). Then, \(\textit{var}(n) \not = \emptyset \) since in a canonical tree the root node is the only interior node that is labeled by the empty hyperedge. Therefore, \(\textit{var}(n) \in \textit{hyp}(T, N)\), \(\textit{var}(m) \in \textit{hyp}(T, N)\), and \(\textit{var}(n) \sqsubseteq _{\mathcal {H}_1} \textit{var}(m)\) by Lemma 13(1). We can hence apply reduction (CSE) to remove \(\textit{var}(n)\) from \(\textit{hyp}(\mathcal {H}_1)\) and all predicates that intersect with \(\textit{var}(n) {\setminus } \textit{var}(m)\) from \(\textit{pred}(\mathcal {H}_1)\). By Lemma 12(2) and 12 (6) the result is exactly \(\mathcal {H}_2\):

$$\begin{aligned}&\textit{hyp}(\mathcal {H}_2) \\&\quad = \textit{hyp}(T, N {\setminus } \{n\}) \\&\quad = \textit{hyp}(T, N) {\setminus } \{ \textit{var}(n) \} = \textit{hyp}(\mathcal {H}_1) {\setminus } \{ \textit{var}(n) \} \\&\textit{pred}(\mathcal {H}_2) \\&\quad = \textit{pred}(T, N {\setminus } \{ n\}) \\&\quad = \{ \theta \in \textit{pred}(T,N) \mid \textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) = \emptyset \}\\&\quad = \{ \theta \in \textit{pred}(\mathcal {H}_1) \mid \textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) = \emptyset \} \end{aligned}$$

Case 3n has parent m in N and \(\textit{var}(m) = \emptyset \). Then, \(\textit{var}(n) \not = \emptyset \) since in a canonical tree the root node is the only interior node that is labeled by the empty hyperedge. By definition of GJTs, it follows that for every \(\theta \in {\textit{pred}}(m \rightarrow n)\), we have \(\textit{var}(\theta ) \subseteq \textit{var}(n) \cup \textit{var}(m) = \textit{var}(n)\). In other words: all \(\theta \in {\textit{pred}}(m \rightarrow n)\) are filters. As such, we can use reduction (FLT) to remove all predicates in \({\textit{pred}}(m \rightarrow n)\) from \(\mathcal {H}_1\). This yields a triplet \(\mathcal {I}\) with the same hypergraph as \(\mathcal {H}_1\), same set of output variables as \(\mathcal {H}_1\), and

$$\begin{aligned} \textit{pred}(\mathcal {I})&= \textit{pred}(\mathcal {H}_1) {\setminus } {\textit{pred}}_T(m \rightarrow n) \\&= \textit{pred}(T, N) {\setminus } {\textit{pred}}_T(m \rightarrow n) \\&= \textit{pred}(T, N {\setminus } \{n\}) = \textit{pred}(\mathcal {H}_2), \end{aligned}$$

where the third equality is due to Lemma 12(4). We claim that every variable in e is isolated in \(\mathcal {I}\). From this the result follows, because then we can apply (ISO) to remove the entire hyperedge \(\textit{var}(e)\) from \(\textit{hyp}(\mathcal {I}) = \textit{hyp}(\mathcal {H}_1)\) while preserving \({\textit{out}}(\mathcal {I})\) and \(\textit{pred}(\mathcal {I})\). The resulting triplet hence equals \(\mathcal {H}_2\). To see that \(e \subseteq {\textit{isol}}(\mathcal {I})\), observe that no predicate in \(\textit{pred}(\mathcal {I}) = \textit{pred}(T, N {\setminus } \{n\})\) shares a variable with \(\textit{var}(n) = (\textit{var}(n) {\setminus } \textit{var}(m))\) by Lemma 12(6). Therefore, \(\textit{var}(n) \cap \textit{var}(\textit{pred}(\mathcal {I})) = \emptyset \). Furthermore, \(\textit{var}(n) \cap {\textit{jv}}(\mathcal {I}) = \emptyset \) because \({\textit{jv}}(\mathcal {I}) = {\textit{jv}}(\mathcal {H}_1)\) and no \(x \in \textit{var}(n) = \textit{var}(n) {\setminus } \textit{var}(m)\) is in \({\textit{jv}}(\mathcal {H}_1)\) by Lemma 13(2). \(\square \)

Lemma 5Let\((T,N_1)\)and\((T,N_2)\)be canonical GJT pairs with\(N_2 \subseteq N_1\). Then,\({\mathcal {H}}(T,N_1, \overline{z}) \rightsquigarrow ^* {\mathcal {H}}(T,N_2, \overline{z})\)for every\(\overline{z} \subseteq \textit{var}(N_2)\).

Proof

By induction on k, the number of nodes in \(N_1 {\setminus } N_2\). In the base case where \(k = 0\), the result trivially holds since then \(N_1 = N_2\) and the two triplets are identical. For the induction step, assume that \(k > 0\) and the result holds for \(k-1\). Because both \(N_1\) and \(N_2\) are connex subsets of the same tree T, there exists a node \(n \in N_1\) that is a frontier node in \(N_1\) and which is not in \(N_2\). Then, define \(N'_1 = N_1 {\setminus } \{n\}\). Clearly \((T, N'_1)\) is again canonical, and \(|N'_1{\setminus } N_2| = k-1\). Therefore, \({\mathcal {H}}(T, N'_1, \overline{z}) \rightsquigarrow ^* {\mathcal {H}}(T, N_2, \overline{z})\) by induction hypothesis. Furthermore, by \({\mathcal {H}}(T, N_1, \overline{z}) \rightsquigarrow ^* {\mathcal {H}}(T, N'_1, \overline{z})\) by Lemma 14, from which the result follows. \(\square \)

1.4 Proof of Lemma 6

Lemma 6Let\(H_1\)and\(H_2\)be two hypergraphs such that for all\(e \in H_2\)there exists\(\ell \in H_1\)such that\(e \subseteq \ell \). Then,\((H_1 \cup H_2, \overline{z}, \Theta ) \rightsquigarrow ^* (H_1, \overline{z}, \Theta )\), for every hyperedge\(\overline{z}\)and set of predicates\(\Theta \).

Proof

The proof is by induction on k, the number of hyperedges in \(H_2 {\setminus } H_1\). In the base case where \(k = 0\), the result trivially holds since \(H_1 \cup H_2 = H_1\) and the two triplets are hence identical. For the induction step, assume that \(k > 0\) and the result holds for \(k -1\). Fix some \(e \in H_2 {\setminus } H_1\) and define \(H'_2 = H_2 {\setminus } \{e\}\). Then, \(|H'_2 {\setminus } H_1| = k -1\). We show that \((H_1 \cup H_2, \overline{z}, \Theta ) \rightsquigarrow ^* (H_1 \cup H'_2, \overline{z}, \Theta )\), from which the result follows since \((H_1 \cup H'_2, \overline{z}, \Theta ) \rightsquigarrow ^* (H_1, \overline{z}, \Theta )\) by induction hypothesis. To this end, we observe that there exists \(\ell \in H_1 {\setminus } \{e\}\) with \(e \subseteq \ell \). Therefore, \({\textit{jv}}_{(H_1 \cup H_2, \overline{z}, \Theta )}(e) \subseteq e \subseteq \ell \). Moreover, \(e {\setminus } \ell = \emptyset \). Therefore, \(\textit{ext}_{(H_1\cup H_2, \overline{z}, \Theta )}(e {\setminus } \ell ) = \emptyset \subseteq \ell \). Thus \(e \sqsubseteq _{(H_1 \cup H_2, \overline{z}, \Theta )} \ell \). We may, therefore, apply (CSE) to remove e from \(H_1 \cup H_2\), yielding \(H_1 \cup H'_2\). Since no predicate shares variables with \(e {\setminus } \ell = \emptyset \) this does not modify \(\Theta \). Therefore, \((H_1 \cup H_2, \overline{z}, \Theta ) \rightsquigarrow ^* (H_1 \cup H'_2, \overline{z}, \Theta )\). \(\square \)

Proofs of Section 5.2

Lemma 7Letnbe a violator of type 1 in (TN) and assume\((T,N) \xrightarrow {1,n} (T',N')\). Then,\((T',N')\)is a GJT pair and it is equivalent to (TN). Moreover, the number of violators in\((T',N')\)is strictly smaller than the number of violators in (TN).

Proof

The lemma follows from the following observations. (1) It is straightforward to observe that \(T'\) is a valid GJT: the construction has left the set of leaf nodes untouched; took care to ensure that all nodes (including the newly added node p) continue to have a guard child; ensures that the connectedness condition continues to hold also for the relocated children of n because every variable in n is present on the entire path between n and p; and have ensured that also edge labels remain valid (for the relocated nodes this is because \(\textit{var}(p) = \textit{var}(g) \subseteq \textit{var}(n)\)).

(2) \(N'\) is a connex subset of \(T'\) because the subtree of T induced by N equals to subtree of \(T'\) induced by \(N'\), modulo the replacement of l by p in case that l was in N and p is hence in \(N'\).

(3) (TN) is equivalent to \((T', N')\) because the construction leaves leaf atoms untouched, preserves edge labels, and \(\textit{var}(N) = \textit{var}(N')\). The latter is clear if \(l \not \in N\) because then \(N = N'\). It follows from the fact that \(\textit{var}(l) = \textit{var}(p)\) if \(l \in N\), in which case \(N'=N{\setminus } \{l\} \cup \{p\}\).

(4) All nodes in \({{\,\mathrm{ch}\,}}_T(n) {\setminus } N\) (and their descendants) are relocated to p in \(T'\). Therefore, n is no longer a violator in \((T', N')\). Because we do not introduce new violators, the number of violators of \((T', N')\) is strictly smaller than the number of violators of (TN). \(\square \)

Lemma 8Letnbe a violator of type 2 in (TN) and assume\((T,N) \xrightarrow {2,n} (T',N')\). Then,\((T',N')\)is a GJT pair and it is equivalent to (TN). Moreover, the number of violators in\((T',N')\)is strictly smaller than the number of violators in (TN).

Proof

The lemma follows from the following observations. (1) It is straightforward to observe that \(T'\) is a valid GJT: the construction has left the set of leaf nodes untouched; took care to ensure that all nodes (including the newly added node p) continue to have a guard child; ensures that the connectedness condition continues to hold also for the relocated children of n because every variable in n is also present in p, their new parent; and have ensured that also edge labels remain valid (for the relocated nodes this is because \(\textit{var}(p) = \textit{var}(n)\)).

(2) \(N'\) is a connex subset of \(T'\) because (i) the subtree of T induced by N equals to subtree of \(T'\) induced by \(N' \ \{p\}\), (ii) \(n \in N\), and (iii) p is a child of n in \(T'\). Therefore, \(N'\) must be connex.

(3) (TN) is equivalent to \((T', N')\) because the construction leaves leaf atoms untouched, preserves edge labels, and \(\textit{var}(N) = \textit{var}(N')\). The latter follows because \(\textit{var}(N') = \textit{var}(N \cup \{p\})\) and because \(\textit{var}(p) = \textit{var}(n) \subseteq \textit{var}(N)\) since \(n \in N\).

(4) All nodes in \({{\,\mathrm{ch}\,}}_T(n) {\setminus } N\) (and their descendants) are relocated to p in \(T'\). Therefore, n is no longer a violator in \((T', N')\). Because we do not introduce new violators, the number of violators of \((T', N')\) is strictly smaller than the number of violators of (TN). \(\square \)

Description of competing systems

DBToaster DBToaster (henceforth denoted DBT) is a state-of-the-art implementation of HIVM. It operates in pull-based mode and can deal with randomly ordered update streams. DBT is particularly meticulous in that it materializes only useful views, and therefore, it is an interesting implementation for comparison. It has been extensively tested on equijoin queries and has proven to be more efficient than a commercial database management system, a commercial stream processing system and an IVM implementation [30]. DBT compiles given SQL statements into executable trigger programs in different programming languages. We compare against those generated in Scala from the DBToaster Release 2.2,Footnote 10 and it uses actorsFootnote 11 to generate events from the input files. During our experiments, however, we have found that this creates unnecessary memory overhead. For a fair memory-wise comparison, we have, therefore, removed these actors.

Esper Esper (E) is a CER engine with a relational model based on Stanford STREAM [4]. It is push based and can deal with randomly ordered update streams. We use the Java-based open sourceFootnote 12 for our comparisons. Esper processes queries expressed in the Esper event processing language (EPL).

SASE SASE (SE) is an automaton-based CER system. It operates in push-based mode and can deal with temporally ordered update streams only. We use the publicly available Java-based implementation of SASE.Footnote 13 This implementation does not support projections. Furthermore, since SASE requires queries to specify a match semantics (any match, next match, partition contiguity) but does not allow combinations of such semantics, we can only express queries \(Q_1\), \(Q_2\), and \(Q_4\) in SASE. Hence, we compare against SASE for these queries only. To be coherent with our semantics, the corresponding SASE expressions use the any match semantics [3].

Tesla/T-Rex Tesla/T-Rex (T) is also an automaton-based CER system. It operates in push-based mode only, and supports temporally ordered update streams only. We use the publicly available C-based implementation.Footnote 14 This implementation operates in a publish-subscribe model where events are published by clients to the server, known as TRexServer. Clients can subscribe to receive recognized composite events. Tesla cannot deal with queries involving inequalities on multiple attributes, e.g., \(Q_3\), therefore, we do not show results for \(Q_3\). Since Tesla works in a decentralized manner, we measure the update processing time by logging the time at the Tesla TRexServer from the stream start until the end.

ZStream ZStream (Z) is a CER system based on a relational internal architecture. It operates in push-based mode and can deal with temporally ordered update streams only. ZStream is not available publicly. Hence, we have created our own implementation following the lazy evaluation algorithm described in the original paper [31]. This paper does not describe how to treat projections, and as such we compare against ZStream only for full join queries \(Q_1\)\(Q_8\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Idris, M., Ugarte, M., Vansummeren, S. et al. General dynamic Yannakakis: conjunctive queries with theta joins under updates. The VLDB Journal 29, 619–653 (2020). https://doi.org/10.1007/s00778-019-00590-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-019-00590-9

Keywords

Navigation