F-IVM: Analytics over Relational Databases under Updates

This article describes F-IVM, a unified approach for maintaining analytics over changing relational data. We exemplify its versatility in four disciplines: processing queries with group-by aggregates and joins; learning linear regression models using the covariance matrix of the input features; building Chow-Liu trees using pairwise mutual information of the input features; and matrix chain multiplication. F-IVM has three main ingredients: higher-order incremental view maintenance; factorized computation; and ring abstraction. F-IVM reduces the maintenance of a task to that of a hierarchy of simple views. Such views are functions mapping keys, which are tuples of input values, to payloads, which are elements from a ring. F-IVM also supports efficient factorized computation over keys, payloads, and updates. Finally, F-IVM treats uniformly seemingly disparate tasks. In the key space, all tasks require joins and variable marginalization. In the payload space, tasks differ in the definition of the sum and product ring operations. We implemented F-IVM on top of DBToaster and show that it can outperform classical first-order and fully recursive higher-order incremental view maintenance by orders of magnitude while using less memory.


Introduction
Supporting modern applications that rely on accurate and real-time analytics computed over large and continuously evolving databases is a challenging data management problem [7].Special cases are the classical problems of incremental view maintenance (IVM) [18,33] and stream query processing [4,37].
Recent efforts studied the problem of computing machine learning (ML) tasks over static databases.The predominant approach loosely integrates the database systems with the statistical packages [27,57,40,56,35]: First, the database system computes the input to the statistical package by joining the database relations.It then exports the join result to the statistical package for training ML models.This approach precludes realtime analytics due to the expensive export/import steps.Systems like Morpheus [36] and LMFAO [61] push the ML task inside the database and learn ML models over static normalized data.In particular, LMFAO, and its precursors F [60] and AC/DC [31], decompose the task of learning classification and regression models over arbitrary joins into factorized computation of aggregates over joins and fixpoint computation of model parameters.This factorization may significantly lower the complexity by avoiding the computation of Cartesian products lurking within joins [8,54].Both the tight integration of the database computation step and of the statistical computation step as well as the factorized computation are pre-requisites for real-time analytics.
This article describes F-IVM 1 , a unified approach for maintaining analytics over changing relational data.We exemplify its versatility in four disciplines: processing queries with group-by aggregates and joins; learning linear regression models using the covariance matrix of the input features; building Chow-Liu trees using pairwise mutual information matrix of the input features; and matrix chain multiplication.F-IVM was introduced in prior work [47].This article revisits and extends this prior work with: a more refined analysis of F-IVM for the q-hierarchical and free-connex acyclic queries in the presence of functional dependencies; the covariance ring over continuous and categorical features; an overview of the design of F-IVM; further experiments on: the covariance matrix; end-to-end linear regression models; Chow-Liu trees; q-hierarchical queries with eager and lazy approaches and payloads carrying the listing or the factorized representation of the query result; and path queries of increasing length on graph data to stress-test the scalability of the IVM engines.
F-IVM has three main ingredients: higher-order incremental view maintenance (IVM); factorized computation and data representation; and ring abstraction.
The first ingredient reduces the maintenance task to that of a hierarchy of simple views.Such views are functions mapping keys, which are tuples of input values, to payloads, which are elements from a ring.In contrast to classical (first-order) IVM, which computes changes to the query result on the fly and does not use extra views, F-IVM can significantly speed up the maintenance task and lower its complexity by using carefully chosen views.Yet F-IVM can use substantially fewer views than the fully-recursive IVM, which is used by the state-of-the-art IVM system DBToaster [33].In our experiments, F-IVM outperforms first-order and higher-order IVM by up to two orders of magnitude in both runtime and memory requirements.
The second ingredient supports efficient computation and representation for keys, payloads, and updates.F-IVM exploits insights from query evaluation algorithms with best known complexity and optimizations that push aggregates past joins [8,54,6].It can process bulk updates expressed as low-rank decompositions [34,63] and maintain a factorized representation of query results, which is essential to achieve low complexity for free-connex acyclic and q-hierarchical queries.
The third ingredient allows F-IVM to treat uniformly seemingly disparate tasks.In the key space, all tasks require joins and variable marginalization.In the payload space, tasks differ in the ring operations.To maintain linear regression models and Chow-Liu trees under updates, F-IVM uses a new ring that captures the maintenance of a covariance matrix over continuous and categorical features from the input database.Furthermore, it composes rings to capture the data-dependent computation for complex analytics.Thanks to the ring abstraction, F-IVM is highly extensible: efficient maintenance for new analytics over relational databases is readily available as long as they come with appropriate sum and product ring operations.

F-IVM by Example
Consider the following SQL query over a database D with relations R(A, B), S(A, C, E), and T (C, D):

NATURAL JOIN S NATURAL JOIN T GROUP BY A, C;
A naïve query evaluation approach first computes the join and then the aggregate.This takes O(N 3 ) time, where N is the size of D. An alternative approach exploits the distributivity of SUM over multiplication to partially push the aggregate past joins and then combine the partial aggregates.For instance, one such partial sum over S can be expressed as the view V S : In the view V S , we identify keys, which are tuples over (A, C), and payloads, which are aggregate values S E .Similarly, we compute partial sums over R and T as views V R and V T .These views are joined as depicted by the view tree in Figure 1, which is akin to a query plan with aggregates pushed past joins.This view tree computes the result of Q in O(N ) time.
Consider now the problem of learning, for each pair (a, c) of (A, C)-values in the natural join of R, S, and T , a linear function f a,c with parameters θ 0 , θ D and θ E that predicts the label B given features D and E: Our insight is that the same view tree in Figure 1 can compute the gradient vector used for learning f a,c , where we replace the SQL SUM and * operators.As shown in Section 8.1, the gradient of the square loss objective function needs the computation of three types of aggregates: the scalar c that is the count aggregate SUM(1); the vector s of linear aggregates SUM(i), for i ∈ {B, D, E}; and the matrix Q of quadratic aggregates SUM(i * j), where i, j ∈ {B, D, E}.These aggregates capture the correlation between the features and the label.
We treat these aggregates as one compound aggregate (c, s, Q) so we can share computation across them.This compound aggregate can be partially pushed past joins similarly to the SUM aggregate discussed before.Its values are carried in the key payloads of views in the view tree from Figure 1.For instance, the partial compound aggregate (c T , s T , Q T ) at the view V T computes, for each C-value, the count, sum, and sum of squares of the D-values in T .Similarly, the partial aggregate (c S , s S , Q S ) at the view V S computes, for each pair (A, C), the count, sum, and sum of squares of E-values in S. In the view V ST , which is the join of V T and V S , each key (a, c) is associated with the multiplication of the payloads for the keys c in V T and (a, c) in V S .This multiplication works on compound aggregates: The scalar c ST is the arithmetic multiplication of c T and c S ; the vector of linear aggregates s ST is the sum of the scalar-vector products c T s S and c S s T ; finally, the matrix Q ST of quadratic aggregates is the sum of the scalar-matrix products c T Q S and c S Q T , and of the outer products of the vectors s T and the transpose of s S and also of s S and the transpose of s T .Our approach shares the computation across the aggregates: The scalar aggregates are used to scale up the linear and quadratic aggregates, while the linear aggregates are used to compute the quadratic aggregates.
We now turn to incremental view maintenance.F-IVM operates over view trees.Whereas for non-incremental computation we only materialize the top view in the tree and the input relations, for incremental computation we may materialize additional views to speed up the maintenance task.Our approach is an instance of higher-order IVM, where an update to one relation may trigger the maintenance of several views.
Figure 1 shows the leaf-to-root maintenance paths under changes to S and T. For updates δS to S, each delta view δV S , δV ST , and δQ, is computed using delta rules: An update may consist of both inserts and deletes, which are encoded as keys with positive and respectively negative payloads.For the count aggregate, the payload is 1 for an insert and −1 for a delete.For the compound aggregate, the payload is (1, 0 5×1 , 0 5×5 ) for an insert and (−1, 0 5×1 , 0 5×5 ) for a delete, where 0 n×m is the n-by-m matrix with all zero values.F-IVM materializes and maintains views depending on the update workload.For updates to all input relations, it materializes each view in the view tree.For updates to R only, it materializes V ST ; for updates to S only, it materializes V R and V T ; for updates to T only, it materializes V R and V S .F-IVM takes constant time for updates to S and linear time for updates to R and T; these complexities are in the number of distinct keys in the views.In contrast, the first-order IVM computes one delta query per each updated relation and without the use of extra views.It takes linear time for updates to any of the three relations for our example query.The fully-recursive higher-order IVM constructs a view tree for each delta query, so overall more views, including the view materializing the join of V R , V S , and δT.
F-IVM thus needs the same view tree and views for our query with one SUM aggregate and even for the learning task with the ten SUM aggregates.In contrast, the first-order IVM needs to compute a distinct delta query for each of these aggregates for updates to any of the three relations.DBToaster, which is the state-of-the-art fully recursive IVM, computes 31 views, ten top views and 21 auxiliary ones.Whereas F-IVM shares the computation across these aggregates, the other IVM approaches do not.This significantly widens the performance gap between F-IVM and its competitors.

Overview of the F-IVM System
Figure 2 overviews the main components of F-IVM, annotated with the numbers of sections where they are discussed.Applications, e.g., database analytics, training linear regression model and Chow-Liu trees, and linear algebra computation, rely on queries with natural joins and group-by aggregates, where each aggregate is expressed using the sum and product operations in a ring.In particular, Section 8 introduces the covariance ring over continuous and categorical features.Queries and rings serve as input to F-IVM, together with a stream of updates (tuple inserts and deletes) to the underlying database.Section 3 details the data model, the query language supported by F-IVM, and the ring algebraic structure.
The logical optimizer creates a variable order for the input query (Section 4).This is akin to a query plan, albeit expressed as a partial order on the query variables as opposed to a partial order on the relations to join.Classical query evaluation uses query plans that dictate the order in which the relations are joined.F-IVM uses variable orders that dictate the order in which the variables are marginalized.For each join variable, all relations with that variable are joined.This choice is motivated by the observation that relation-at-a-time query plans is suboptimal in general, whereas the evaluation by variable orders is worst-case optimal [43].
Finding a good variable order for a given query is a computationally hard problem.For q-hierarchical queries [12], we can efficiently find variable orders that allow for maintenance with best guarantees in terms of update time and time to present the updated query result to the user (Section 7).This also applies to queries, which become q-hierarchical on databases that satisfy functional dependencies.
Given a variable order for a query, the physical optimizer creates a view tree (Section 4), which is a tree of views to support the maintenance and output enumeration of the query.Updates to base relations are propagated bottom-up in the tree, while output enumeration requires top-down access in the view tree.Depending on which base relations are updatable (dynamic) or non-updatable (static), F-IVM decides to materialize and maintain views in the view tree to support efficient propagation of the updates and avoid recomputation.Section 5 discusses the view materialization problem, whereas Section 6 discusses efficient update propagation.
Each view is accessed via indices with key-payload entries.Its primary index is a hash map over all its keys (Section 3).F-IVM may also need secondary and even tertiary indices, which are hash maps over different subsets of its keys.Such indices are updated lazily: the index updates are buffered and only executed when index access is required.The views for q-hierarchical queries require the primary indices to support updates that are propagated bottom-up in the view tree, and secondary indices to support output enumeration that proceeds top-down in the view tree (Section 7).F-IVM implements equality-based joins using in-memory hash-based join operators.Aggregation is performed using variable marginalization.To marginalize a variable, F-IVM enumerates the entries with the same key, except for the marginalized variable, and applies the aggregation on these entries on the fly.
For a view tree and ring specification for each variable to be marginalized, the compiler outputs code in DBToaster's intermediate representation language M3.DBToaster has its own optimizer and compiler that turns M3 code into highly optimized C++ code.This code takes the stream of input data updates, maintains the views, and enumerates the query output, relying on DBToaster's runtime library for data ingestion.

Data Model and Query Language
The data model of F-IVM is based on relations over rings and its query language allows for natural joins and group-by aggregates over such relations.Definition 1.A ring (D, +, * , 0, 1) is a set D with two closed binary operations + and * , the additive identity 0, and the multiplicative identity 1 such that for all a, b, c ∈ D, the following axioms are satisfied: A schema S is a set of variables.Let Dom(X) denote the domain of a variable X.A tuple t over schema S has the domain Dom(S) = X∈S Dom(X).The empty tuple () is the tuple over the empty schema.Let (D, +, * , 0, 1) be a ring.A relation R over schema S and the ring D is a function R : Dom(S) → D mapping tuples over schema S to values in D such that R[t] ̸ = 0 for finitely many tuples t.The tuple t is called a key, while its mapping R[t] is the payload of t in R. We use sch(R) to denote the schema of R. The statement t ∈ R tests if R[t] ̸ = 0.The size |R| of R is the size of the set {t | t ∈ R}, which consists of all keys with non-0 payloads.A database D is a collection of relations over the same ring.Its size |D| is the sum of the sizes of its relations.This data model is in line with prior work on K-relations over provenance semirings [25], generalized multiset relations [32], and factors over semirings [6].
Each relation or materialized view R over schema S is implemented as a hash map or a multidimensional array that stores key-payload entries (t, R[t]) for each tuple t with R[t] ̸ = 0.The data structure can: (1) look up, insert, and delete entries in amortized constant time, and (2) enumerate all stored entries in R with constant delay, i.e., the following times are constant: (i) the time between the start of the enumeration and outputting the first tuple, (ii) the time between outputting any two consecutive tuples, and (iii) the time between outputting the last tuple and the end of the enumeration [21].For a schema X ⊂ S, we use an index data structure that for any t ∈ Dom(X ) can: (4) enumerate all tuples in σ X =t R with constant delay, (5) check t ∈ π X R in amortized constant time; and (7) insert and delete index entries in amortized constant time.
We give a hash-based example data structure that supports the above operations with the stated complexities.Consider a relation R over schema S. A hash table with chaining stores key-value entries of the form (t, R(t)) for each tuple t over S with R(t) ̸ = 0.The entries are doubly linked to support enumeration with constant delay.The hash table can report the number of its entries in constant time and supports lookups, inserts, and deletes in amortized constant time.To support index operations on a schema X ⊂ S, we create another hash table with chaining where each table entry stores an X -value t as key and a doubly-linked list of pointers to the entries in R having t as X -value.Looking up an index entry given t takes amortized constant time, and its doubly-linked list enables enumeration of the matching entries in R with constant delay.Inserting an index entry into the hash table additionally prepends a new pointer to the doubly-linked list for a given t; overall, this operation takes amortized constant time.For efficient deletion of index entries, each entry in R also stores back-pointers to its index entries (one back-pointer per index for R).When an entry is deleted from R, locating and deleting its index entries in doubly-linked lists takes constant time per index.
Query Language.We consider queries with natural joins and group-by aggregates: The group-by variables X 1 , . . ., X f are free, while the other variables X f +1 , . . ., X m are bound.The SUM aggregate values are from a ring (D, +, * , 0, 1).The SUM operator uses the addition + from D. Further aggregates can be expressed using the sum and product operations from the ring.A lifting function g k : Dom(X k ) → D, for f < k ≤ m, maps X k -values to elements in D: when marginalizing X k , we aggregate the values g k (x) from D and not the values x from Dom(X k ).
Instead of the verbose SQL notation, we use the following more compact encoding: where is the join operator, X f +1 is the aggregation operator that marginalizes over the variable X f +1 , and each relation R i is a function mapping keys over schema S i to payloads in D. We also need a union operator ⊎ to express updates (insert/delete) to relations.Recall that by definition R, S, and T are finite.The relation Q maps the empty tuple () to the count.□ Given a ring (D, +, * , 0, 1), relations R and S over schema S 1 and relation T over schema S 2 , a variable X ∈ S 1 , and a lifting function g X : Dom(X) → D, we define the three operators as follows: aggregation by marginalization: where D 1 = Dom(S 1 ), D 2 = Dom(S 1 ∪ S 2 ), and D 3 = Dom(S 1 \ {X}), and π S (t) is a tuple representing the projection of tuple t on the schema S.
Example 4. Consider relations over a ring (D, +, * , 0, 1): The values r 1 , r 2 , s 1 , s 2 , t 1 , t 2 are non-0 values from D. The operators ⊎, ⊗, and ⊕ are akin to union, join, and aggregation (g A : Dom(A) → D is the lifting for A): When marginalizing over the bound variables, we apply the same lifting function to these variables: The SQL query can be expressed in our formalism as follows: The computation of the aggregate SUM(R.B * T.D * S.E) now happens over payloads.□ By using relations over rings, we avoid the intricacies of incremental computation under multiset semantics caused by the non-commutativity of inserts and deletes.We simplify delta processing by representing both inserts and deletes as tuples, with the distinction that they map to positive and respectively negative ring values.This uniform treatment allows for simple delta rules for the three operators of our query language.

Factorized Ring Computation
This section introduces a framework for query evaluation based on factorized computation and data rings.The next section extends it to incremental maintenance.Variable Orders.Classical query evaluation makes use of query plans that dictate the order in which the relations are joined.We use a different evaluation approach based on variable orders that dictate the order in which we marginalize each join variable.This approach may require to join several relations at a time if they have the same variable.Our choice is motivated by the complexity of the evaluation problem for join queries: standard (relation-at-a-time) query plans are provably suboptimal, whereas the evaluation by variable orders can be worst-case optimal [43].Given a join query Q, a variable X depends on a variable Y if both are in the schema of a relation in Q.
Definition 6 (adapted from [54]).A variable order ω for a join query Q is a pair (F, dep), where F is a rooted forest with one node per variable in Q, and dep is a function mapping each variable X to a set of variables in F .It satisfies the following constraints: • For each relation in Q, all of its variables lie along a root-to-leaf path in F .
• For each variable X, dep(X) is the subset of its ancestors in F on which the variables in the subtree rooted at X depend.For a query Q with free variables, a variable order is free-top if no bound variable is an ancestor of a free variable [29].Variable orders are a different syntax [54] for hypertree decompositions [24].They are more natural for algorithms that proceed one variable at a time.
View Trees.Our framework relies on a variable order ω for the input query Q to describe the structure of the computation and indicate which variable marginalizations are pushed past joins.Based on ω, we construct a tree of views that represent F-IVM's data structure to support query maintenance and enumeration.
Figure 4 gives a function τ that constructs a view tree τ for a variable order ω and the set F of free variables of the query Q.Without loss of generality, we assume that ω is a single rooted tree.Otherwise, we apply τ to each tree in ω to obtain a set of view trees.For simplicity, we assume that ω was first extended with relations as children under their lowest variable.
The function τ maps the variable order to a view tree of the same tree structure, yet with each variable X replaced by a view V @X rels [keys].This notation states that the view V is (recursively) defined over the input relations rels, has free variables keys, and it corresponds to the variable X in ω; in case of a view for an input relation R, we use the simplified notation R[sch(R)].
The base case (leaf in the extended variable order) is that of an input relation: We construct a view that is the relation itself.At a variable X (inner node), we distinguish two cases: If X is a bound variable, then we construct a view that marginalizes out X in the natural join of the views that are children of the current view; we thus first join on X, then apply the lifting function for X on its values, and aggregate X away.If X is a free variable, however, then we retain it in the view schema without applying the lifting function to its values.The schema of the view consists of dep(X) and the free variables in the subtree of ω rooted at X. τ (variable order ω, free variables F) : view tree switch ω: Figure 4: Creating a view tree τ (ω, F) for a variable order ω and a set of free variables F. Example 8. Figure 3 shows the view tree constructed by the function τ from Figure 4 over the variable order ω and the empty set of free variables.Figure 5 depicts the view tree constructed over the same variable order but for the set F = {A, C} of free variables.Figure 6 gives the contents of the views in the view tree from Figure 3, where R, S, and T are relations over a ring D with payloads p i ∈ D for i ∈ [12].Assume that D is the Z ring, each tuple in these relations is mapped to 1, i.e., p i = 1 for i ∈ [12], and the lifting functions map all values to 1.Then, the view tree computes the COUNT query from Example 3 and the root view V @A RST maps the empty tuple to the overall count 10, which is the number of tuples in the natural join of R, S, and T. □ By default, the function τ in Figure 4 constructs one view per variable in the variable order ω.A wide relation (with many variables) leads to long branches in ω with variables that are only local to this relation.This is, for instance, the case of our retailer dataset used in Section 9.Such long branches create long chains of views, where each view marginalizes one bound variable over its child view in the chain.For practical reasons, we compose such long chains into a single view that marginalizes several variables at a time.

Factorized Higher-Order IVM
We introduce incremental view maintenance in our factorized ring computation framework.Unlike evaluation, the incremental maintenance of the query result may require the materialization and maintenance of views.An update to a relation R triggers changes in all views from the leaf R to the root of the view tree.
Updates.The insertion (deletion) of a tuple t into (from) a relation R is expressed as a delta relation δR that maps t to 1 (and respectively −1).In general, δR can be a relation, thus a collection of tuples mapped to payloads.The updated relation is then the union of the old relation and the delta relation: R := R ⊎ δR.
Delta Views.For each view V affected by an update, a delta view δV defines the change in the view content.In case the view V represents a relation R, then δV = δR if there are updates to R and δV = ∅ otherwise.If the view is defined using operators on other views, δV is derived using the following delta rules: The correctness of the rules follows from the associativity of ⊎ and the distributivity of ⊗ over ⊎; X is equivalent to the repeated application of ⊎ for the possible values of X.The derived delta views are subject to standard simplifications: If V is not defined over the updated relation R, then its delta view δV is empty, and then we propagate this information using the identities Delta Trees.Under updates to one relation, a view tree becomes a delta tree where the affected views become delta views.The function ∆ in Figure 7 replaces the views along the path from the updated relation to the root with delta views.The Optimize method rewrites delta view expressions to exploit factorized updates by avoiding the materialization of Cartesian products and pushing marginalization past joins (see Section 6).
Example 9. Consider again the query from Example 3, its view tree in Figure 3, and the same relations over the Z ring and the lifting functions that map all values to 1 as in Example 8.An update δT[C, D] = ∆(view tree τ , update δR) : delta view tree Figure 7: Creating a delta view tree ∆(τ, δR) for a view tree τ to process an update δR to relation R.
→ 3} triggers delta computation at each view from the leaf T to the root of the view tree: , and δV @A RST = {() → 5}.A single-tuple update to T fixes the values for C and D. Computing δV @D T then takes constant time.The delta view δV @C ST iterates over all possible A-values for a fixed C-value, which takes linear time; δV @A RST incurs the same linear-time cost.A single-tuple update to R or S fixes all variables on a leaf-to-root path in the delta view tree, giving a constant view maintenance cost.□ In contrast to classical (first-order) IVM that only requires maintenance of the query result [18], our approach is higher-order IVM as updates may trigger maintenance of several interrelated views.The fullyrecursive IVM scheme of DBToaster [32,33] creates one materialization hierarchy per relation in the query, whereas we use one view tree for all relations.This view tree relies on variable orders to decompose the query into views and factorize its computation and maintenance.
Which Views to Materialize and Maintain?The answer to this question depends on which relations may change.The set of the updatable relations determines the possible delta propagation paths in a view tree, and these paths may use materialized views.
Propagating changes along a leaf-to-root path is computationally most effective if each delta view joins with sibling views that are already materialized.Figure 8 gives an algorithm that reflects this idea: Given a view tree τ and a set of updatable relations U, the algorithm traverses the tree top-down to discover the views that need to be materialized.The root of the view tree τ is always stored as it represents the query result.Every other view V i is stored only if there exists a sibling view V j defined over an updatable relation.
Example 10.We continue with our query from Example 9.For updates to T only, i.e., U = {T}, we store the root V @A RST and the views V @E S and V @B R used to compute the deltas δV @C ST and δV @A RST .Only the root view is affected: It is not necessary to maintain other views.To also support updates to R and S, we need to materialize V @C ST and V @D T .If no updates are supported, then only the root view is stored.□ µ(view tree τ , updatable relations U) : view set Figure 8: Deciding which views in a view tree τ to materialize in order to support updates to a set of relations U.The notation rels(V j ) denotes the relations under the view V j in τ .
For queries with free variables, several views in their (delta) view trees may be identical: This can happen when all variables in their keys are free and thus cannot be marginalized.For instance, a variable order ω for the query from Example 5 may have the variables A and C above all other variables, in which case their views are the same in the view tree for ω.We then store only the top view out of these identical views.
IVM Triggers.For each updatable relation R, F-IVM constructs a trigger procedure that takes as input an update δR and implements the maintenance schema of the corresponding delta view tree.This procedure also maintains all materialized views needed for the given update workload.
A bulk of updates to several relations is handled as a sequence of updates, one per relation.Update sequences can also happen when updating a relation R that occurs several times in the query.The instances representing the same relation are at different leaves in the delta tree and lead to changes along multiple leaf-to-root paths.

Factorizable Updates
Our focus so far has been on supporting updates represented by delta relations.We next consider an alternative approach that decomposes a delta relation into a union of factorizable relations.The cumulative size of the decomposed relations can be much less than the size of the original delta relation.Also, the complexity of propagating a factorized update can be much lower than that of its unfactorized (listing) representation, since the factorization makes explicit the independence between query variables and enables optimizations of delta propagation such as pushing marginalization past joins.Besides the factorized view computation, this is the second instance where F-IVM exploits factorization.
Factorizable updates arise in many domains such as linear algebra and machine learning.Section 8 demonstrates how our framework can be used for the incremental evaluation of matrix chain multiplication, recovering prior work on this [45].Matrix chain computation can be phrased in our language of joins and aggregates, where matrices are binary relations.Changes to one row/column in an input matrix may be expressed as a product of two vectors.In general, an arbitrary update matrix can be decomposed into a sum of rank-1 matrices, each of them expressible as products of vectors, using low-rank tensor decomposition methods [34,63].
Example 11.Arbitrary relations can be decomposed into a union of factorizable relations.The relation We thus reduced a relation of size nm to two relations of cumulative size n + m.If R were a delta relation, the delta views on top of it would now be expressed over R and their computation can be factorized as done for queries in Section 4. Product decomposition of relations can be done in linearithmic time in both the number of variables and the size of the relation [50].
} with R as above.We can decompose each of the two terms in R ′ similarly to R, yielding overall n + 2m values instead of nm + m − 1.A different decomposition with n+m+3 values is given by a factorizable over-approximation of R ′ compensated by a small product with negative payload: The Optimize method used in the delta view tree algorithm in Figure 7 exploits the distributivity of join ⊗ over marginalization X to push the latter past the former and down to the views with variable X.This optimization is reminiscent of pushing aggregates past joins in databases and variable elimination in probabilistic graphical models [6].In case the delta views express Cartesian products, then they are not materialized but instead kept factorized.
Example 12. Consider the query Q from Example 9 and its view tree in Figure 3.In the delta view tree derived for updates to S, the top-level delta is computed as: A single-tuple update δS binds variables A, C, and E, and computing δV @A RST requires O(1) lookups in V @D T and V @B R .An arbitrary-sized update δS can then be processed in O(|δS|) time.Assume now that δS is factorizable as δS In the construction of the delta view tree, the Optimize method exploits this factorization to push the marginalization past joins at each variable; for example, the delta at E becomes: We also transform the top-level delta into a product of three views: The computation time for this delta is proportional to the sizes of the three views representing the update:

F-IVM for Special Query Classes
This section shows how F-IVM maintains free-connex (α-)acyclic queries [28] and q-hierarchical queries [12].The analysis for these queries is refined into: (i) the preprocessing phase, where the view tree is constructed; (ii) the enumeration phase, where we present the query result one tuple at a time; and (iii) the update phase, where we update the view tree.The following data complexity 2 claims assume that the ring operations require constant time, otherwise the complexity results stated in this section have an extra multiplying factor to account for the complexity of the ring operations.
) enumeration delay, and O(1) single-tuple update in case Q is q-hierarchical.Section 7.4 discusses an important extension of our view tree construction to better support cyclic queries.

Free-Connex Acyclic Queries
We first introduce the class of free-connex acyclic queries and then explain how F-IVM maintains them. 2 The data complexity is a function of the database size.
ν (free-top variable order ω) : view tree switch ω: Figure 9: Creating a view tree for a free-top variable order.
Definition 14 ([68, 13]).A join tree for a query is a tree, where each node is a relation and if any two nodes have variables in common, then all nodes along the path between them also have these variables.
A query is (α-)acyclic if it admits a join tree.A query is free-connex acyclic if it is acyclic and remains acyclic after adding a new relation whose schema consists of the free variables of the query.

Example 15. Consider the query
where "−" denotes the parent-child relationship.Hence, Q is acyclic.
Consider the triangle query The variable A occurs in the first and last relations but not in the middle relation; thus, this tree is not a join tree for Q △ .One can show that any tree built from the relations of Q △ is not a join tree.Hence, is a join tree of Q extended with the relation U whose schema consists of the free variables of Q.Hence, Q is free-connex acyclic.Consider now the variant Q ′ of Q where only the variables B and C are free.Adding a fresh relation U ′ with schema (B, C) to Q ′ turns it into a cyclic query Q ′′ that does not admit a join tree.□ We next detail how F-IVM achieves the complexity from Theorem 13 for a free-connex acyclic query Q.
Figure 10: (left) View tree constructed by the function ν in Figure 9 for the variable order ω in Figure 3; (right) Delta view tree for a single-tuple update to T.
Preprocessing.In the preprocessing phase, we create a view tree that compactly represent the result of Q.Given a variable order, the function τ in Figure 4 constructs a view tree where the root view consists of all tuples over the free variables.While this view allows for constant enumeration delay, it may require superlinear computation and maintenance time as the free variables may originate from different input relations.We would like to avoid this super-linearity.
To keep the preprocessing and update times linear, we proceed as follows.We construct view trees such that the query result is kept and maintained factorized over several views at the top of the view tree.This approach still allows for constant enumeration delay, using a known enumeration approach for factorized representations [54].We construct the view tree following a free-top variable order of the query Q and materialize a view over the schema {X} ∪ dep(X) for each variable X in the variable order.A key insight is that every free-connex acyclic query admits a free-top variable order where for each variable X, the set {X} ∪ dep(X) is covered by the variables of a single relation [11].This ensures linear preprocessing and maintenance time for all views in view trees following such variable orders.
The function ν in Figure 9 constructs a view tree for a given free-top variable order of a free-connex query.If a variable X has at least two children, it proceeds as follows.It creates at X a view H @X rels with schema {X} ∪ dep(X) that joins the child views of X.If X has at least one sibling, it additionally creates a view V @X rels on top of H @X rels obtained from H @X rels by marginalizing X.The first view enables efficient enumeration of X-values in the query result given a value tuple over dep(X); the second view enables efficient updates coming from the subtrees rooted at siblings of X.If X has only one child, the creation of the view H @X rels is not needed for efficient enumeration.In this case, the function creates a view V @X rels marginalizing X in the child view if X has siblings.
Example 16.Consider the free-connex acyclic query Q from Example 15. Figure 3 gives a free-top variable order ω for Q. Figure 10 (left) depicts the view tree ν(ω).The view H @C ST can be computed by iterating over the (A, C)-tuples in V @E S and multiplying the payload of each such tuple with the payload of the matching C-value in V @D T .Since each such (A, C)-tuple must be in S, we need to iterate over only linearly many such tuples.Similarly, the view H @A RST can be computed by iterating over the A-values in one of the child views and doing lookups in the other child view to retrieve the payloads.For the computation of both views H @C ST and H @A RST , we iterate over linearly many tuples and do a constant-time lookup for each such tuple.All other views are obtained by marginalizing one variable from their child views.Hence, all views can be computed in linear time.□ Updates.The construction of delta view trees under single-tuple updates is exactly as described by the function ∆ in Figure 7 (Section 5).Since the view trees can be constructed in linear time, the delta view trees can also be constructed in linear time.ST , we iterate over all A-values paired with c in V @E S .This operation takes linear time with the support of an index on payload(view tree τ , tuple t): payload Figure 11: Computing the payload of a tuple from a view tree.
variable C built for this view.We obtain δV @C ST from δH @C ST by marginalizing the variable C.This requires constant time because C is fixed to the constant c.The top delta view δH @A RST is obtained by intersecting the two child views, e.g., by iterating over δV @C ST and doing lookups in V @B R .This requires linear time.We conclude that the delta views can be computed in linear time.□ Enumeration.Consider a view tree τ constructed using the function ν from Figure 9 for a free-top variable order of a query Q.We first describe how to enumerate with constant delay the distinct tuples in the result of Q using τ .Then, we explain how to compute the payload of each result tuple in constant time.
Let X 1 , . . ., X n be an ordering of the free variables of the query that is compatible with a top-down traversal of the free-top variable order.We use the views V 1 , . . ., V n to enumerate the distinct tuples in the result of Q, where V j is H @X j rels if X j has at least two children and it is the child view of X j otherwise.We retrieve from V 1 the first X 1 -value in the result.When we arrive at a view V j with j > 1, we have already fixed the values of the variables above X j in the variable order.We retrieve from V j the first X j -value paired with these values.Once the values over all free variables are fixed, we have a complete result tuple that we output.Then, we iterate over the remaining distinct X n -values in V n paired with the fixed values over the ancestor variables of X n and output a new tuple for each such value.After all X n -values are exhausted, we backtrack, i.e., we move to the next X n−1 -value and restart the iteration of the matching X n -values.
Given a complete tuple t constructed from the view tree τ , we use the function payload from Figure 11 to compute its payload.The function first checks whether the schema of the root view is exactly the schema sch(t) of t.If so, it returns the payload of t in this view.Otherwise, the root view covers only a subset of the schema of the tuple.In this case, the function recursively computes the payload for each subtree τ i of the root view and the projection of t onto the variables in τ i .The final payload is the product of the payloads returned for the subtrees.The returned payloads are from the lowest views in the view tree whose schemas consist of free variables only.If all variables are free, then these lowest views are the input relations themselves.
Remark 18.The enumeration procedure needs the payloads of the lowest views whose schemas consist of free variables.The payloads from the views above these views thus need not be maintained, beyond keeping track of the multiplicities of each of their tuples.The maintenance of multiplicities is important for correctness, as it tells whether a tuple is to be removed from a view or still has at least one possible derivation from the input.For expensive payloads, such as those introduced in Section 8, it is therefore more efficient to only maintain them for the views from the input relations up to the views used to compute the payloads.Their ancestor views only need maintenance of tuple multiplicities.□ Example 19.We enumerate the distinct result tuples of the query Q[A, B , C ] from Example 15 using the view tree in Figure 10 (left).We iterate with constant delay over the A-values in H @A RST [A].For each such A-value a, we iterate with constant delay over the B-values in R[a, B ] and over the C-values in To efficiently support enumeration and updates, we may need several indices for the views in a view tree for a free-connex acyclic query.Each view (and input relation) in the view tree in Figure 10 (left) needs an index that can retrieve the payload for a given tuple of values over its variables.This is a primary index.For (top-down) enumeration, we may also need a secondary index per view to lookup for tuples that have as prefix a tuple of values over the variables shared with its parent view.Yet in case of some views, we may also need a tertiary index to support updates, which are propagated bottom-up.For instance, the view V @E S [A, C ] requires: a primary index to retrieve the payload for each (A, C)-tuple; a secondary index to enumerate the C-values paired with a given A-value fixed by the parent view; and a tertiary index to obtain all A-values paired with a given C-value c fixed by the delta of its left sibling δV @D T [c].All other views only require primary and secondary indices and no tertiary index.□

Q-Hierarchical Queries
Q-hierarchical queries form a strict subclass of the free-connex acyclic queries.They admit linear preprocessing time, constant update time, and constant enumeration delay [12].Under widely-held complexity theoretic assumptions, there is no algorithm that achieves constant update time and enumeration delay for queries that are not q-hierarchical and have no repeating relation symbols [12].F-IVM recovers the aforementioned complexities using exactly the same approach as for free-connex acyclic queries detailed in Section 7.1.This directly implies linear preprocessing time and constant enumeration delay.Constant update time follows from the following observation.Every q-hierarchical query admits a free-top variables order, where each root-to-leaf path consists of variables that represent precisely the schema of a relation in the query.A singletuple update to that relation then sets all these variables to constants, effectively making each delta view along that path of constant size.Our view tree construction also ensures that the computation of each delta view only requires one constant-time lookup per child view.We first define q-hierarchical queries and then show how F-IVM achieves constant-time update for them.For a variable X in a query, we denote by rels(X) the set of relations that contain X in their schema.
A query is q-hierarchical if it is hierarchical and for any two variables X and Y , it holds: if rels(X) ⊃ rels(Y ) and Y is free, then X is free.
Every q-hierarchical query admits a canonical free-top variable order, where (i) each root-to-leaf path consists of variables that form the schema of a relation and (2) no bound variable is above a free variable [29].We can construct such a variable order in polynomial time in the query size as follows.We start with the empty variable order.For each relation R, we add to the variable order a root-to-leaf path made up of R's variables ordered as follows: a variable X is before a variable Y if (1) rels(X) ⊃ rels(Y ) or ( 2) rels(X) ̸ ⊃ rels(Y ), rels(X) ̸ ⊂ rels(Y ), X is free, and Y is bound.

Example 22. The free-connex acyclic query
Example 15 is not hierarchical: the sets rels(A) = {R, S} rels(C) = {S, T} are not disjoint, nor one is included in the other.By extending the schema of T with A, we obtain the q-hierarchical query whose canonical free-top variable order is given in Figure 12 (left).
Figure 13: Delta view trees derived from the view tree in Figure 12 for single-tuple updates to relations R (left) and T (right).
The variant of the query, where variable A is bound is hierarchical but not q-hierarchical because the set rels(A) = {R, S, T} for the bound variable A is a strict superset of the set rels(B) = {R} for the free variable B. □ We next exemplify how F-IVM achieves constant-time update for a q-hierarchical query.
Example 23. Figure 12 shows the view tree (right) modeled on the canonical free-top variable order (left) of the q-hierarchical query Q h in Example 22. Figure 13 shows the delta view trees under single-tuple updates to R and T.
In the delta view tree for R, the delta view δH @A RST can be computed by a constant-time lookup in V @C ST .In the delta view tree for T, the delta views δH @C ST and δH @A RST can be computed by constant-time lookups in V @E S and V @B R , respectively.All other delta views are computed by marginalizing a variable with a single value.□ Remark 24.Q-hierarchical queries admit view trees whose views only need primary indices to support payload lookup and updates and possibly secondary indices to support enumeration.Consider the view tree in Figure 12.Enumeration proceeds top-down: We iterate over the A-values in the top view and for each such value a, we look up in R[a, B ] to enumerate over all the B-values paired with a, and also look up into H @C ST [a, C ] to enumerate over all C-values paired with a.All these look-ups require primary or secondary indices.
Figure 13 shows the delta view trees for single-tuple updates to R and T. To compute a delta view along the path from the delta relation to the root of the delta view tree, we either perform a projection on a delta view or a lookup in the primary index of a sibling view (so with all keys of the index set to constants).□

Queries under Functional Dependencies
Non-hierarchical queries may become hierarchical under functional dependencies (fds) [49].Given a set Σ of fds, we denote by CLOSURE Σ (S) the closure of the set S of variables under Σ [5].For instance, given the fds Σ = {A → D; BD → E}, we have CLOSURE Σ ({A, B, C}) = {A, B, C, D, E}.
Definition 25 (adapted from [49]).Given a set Σ of fds and a query The Σ-reduct of a query is thus another query, where the schema of each relation is extended to include all variables in the closure of this schema under Σ.Since the added variables are functionally determined by the original schema, they do not add more information.So, we could extend these schemas and the underlying database without increasing the number of tuples in the relations.For any database D with fds Σ and a query Q, the query result Q(D) is the same as the result of its Σ-reduct over the extended database.The benefit of this rewriting is that queries may admit free-connex acyclic or even q-hierarchical Σ-reducts.We need not physically extend the database to reap this benefit.Instead, we use the Σ-reduct of Q to infer a  free-top variable order or even a canonical free-top variable order for Q in case the Σ-reduct is free-connex acyclic or q-hierarchical, respectively.Using this variable order, we construct a view tree for Q that enjoys the preprocessing, update, and enumerate times as for its Σ-reduct.Theorem 13 can be generalized to account for fds.
Figure 14 depicts the hypergraphs of Q and Q ′ (left), a free-top variable order for Q that is also canonical for Q ′ (middle), and the view tree for Q modeled on this variable order (right).Since Q is free-connex acylic, we can compute the view tree in linear time and enumerate the result tuples of Q with constant delay, as explained in Section 7.1.We next describe how to achieve constant-time update by exploiting the fds. Figure 15 shows the delta view trees obtained from the view tree for Q for single-tuple updates to R, S, and T.
Consider first the update δR

Cyclic Queries
Our framework supports arbitrary conjunctive queries.Whereas for an acyclic join query the size of each view is asymptotically upper-bounded by the size of the query result, for a cyclic query views may be larger in size than the query result.In prior work [47], we show how to reduce the size of intermediate views for cyclic queries by extending view trees with indicator projections [6].Such projections have no effect on the query result but can constrain view definitions (e.g., create cycles) and bring asymptotic savings in space and time.
Example 28.We consider the triangle query: Figure 16 shows the hypergraph of Q △ and the view tree constructed for the variable order A − B − C by placing each relation directly under its lowest variable.We assume all relations are of size O(N ).Computing the triangle query from scratch using a worst-case optimal join algorithm takes O(N 3/2 ) time [43].
In the given view tree (without the view in red), we first join S and T and then marginalize out C. This view at node C may contain O(N 2 ) pairs of (A, B) values, which is larger than the worst-case size O(N 3/2 ).However, by materializing the view at C, we enable single-tuple updates to R in constant time; single-tuple updates to other relations take O(N ) time.
To avoid the large intermediate result at variable C, we can change the view tree by placing the relation R under variable C.Then, joining all three relations at node C takes O(N 3/2 ) time.Updates to any relation now cause recomputation of a 3-way join, like in first-order IVM.For single-tuple updates, recomputing deltas takes O(N ) as only two of the three variables are bound to constants.In contrast, the first approach trades off space for time: We need O(N 2 ) space but then support O(1) updates to one of the three relations.□ The above example demonstrates how placing a relation under a different node in a view tree can create a cycle of relations and constrain the size of a view.This strategy, however, might not be always feasible or efficient: One relation might form multiple cycles of relations in different parts of a view tree -for example, in the cyclic 4-loop query the chord relation W is part of two triangle subqueries.Since this relation cannot be duplicated in multiple subtrees (for correctness reasons so as to avoid multiplying the same payload several times instead of using it once), one would have to evaluate these subqueries in sequence, which yields a view tree that is higher and more expensive to maintain.
Indicator Projections.Instead of moving relations in a view tree, we extend the tree with indicator projections that identify the active domains of these relations [6].Such projections have no effect on the query result but can constrain view definitions (e.g., create cycles) and bring asymptotic savings in space and time.
We define a new unary operation ∃A R that, given a relation R over schema S with payloads from a ring (D, +, * , 0, 1), and a set of attributes A ⊆ S, projects tuples from R with non-0 payload on A and assigns to these tuples the payload 1.
Definition 29 (Indicator Projection).For a relation R over schema S and A ⊆ S, the indicator projection ∃A R is a relation over A such that ∀t ∈ Dom(A): Indicator projections may change with updates to input relations.For instance, adding a tuple with a new A-value to R enlarges the result of ∃A R; similarly, deleting the last tuple with the given A-value reduces the result.One change in the input may cause at most one change in the output: To facilitate the computation of δ( ∃A R), we keep track of how many tuples with non-0 payloads project on each A-value.For updating the payload of a tuple in R from 0 to non-0 (or vice versa), we increase (decrease) the count corresponding to the given A-value.If this count changes from 0 to 1 (meaning the A-value is unique) or from 1 to 0 (meaning there are no more tuples with the A-value), then δ( ∃A R) contains a tuple of A-values with the payload of 1 or −1, respectively; otherwise, the delta is empty.
Example 30.Consider a relation R over schema {A, B} and with payloads from a ring (D, +, * , 0, 1).We want to maintain the result of the query for updates to R efficiently, we count the tuples from R with non-0 payloads for each A-value, denoted by CNT Q [A].For example: where r 1 , r 2 , and r 3 are non-0 payloads from D. An update δR = {(a 1 , b 2 ) → −r 2 } removes the tuple (a 1 , b 2 ) from R, which in turn decreases CNT Q [a 1 ] by 1.Since there is still a tuple in R that projects on a 1 , the result of Q remains unchanged.A subsequent update {(a 1 , b 1 ) → −r 1 } to R drops the count for a 1 to 0, which triggers a change in the output, δQ = {(a 1 ) → −1}.□ View Trees with Indicator Projections.Figure 17 gives an algorithm that traverses a given view tree recursively and extends it with indicator projections.At each view V @X rels , the algorithm first computes a set I of indicator projections for those relations that share common variables with V @X rels and do not appear in rels, hence do not take part in the view definition.Then, it chooses from this set those indicator projections that form a cycle with the relations in the subtree rooted at V @X rels .To achieve this, it uses a variant of the GYO reduction [10].Given the hypergraph formed by the hyperedges representing the indicator projections I and the relations rels, GYO repeatedly applies two rules until it reaches a fixpoint: (1) Remove a node that only appears in one hyperedge; (2) Remove a hyperedge that is included in another hyperedge.If the result of GYO is a hypergraph with no nodes and one empty hyperedge, then the input hypergraph is acyclic.Otherwise, the input hypergraph is cyclic and the output of GYO is a hypergraph with cycles.The GYO variant, dubbed GYO * in the procedure in Figure 17, returns the hyperedges that originated from the indicator projections in I and contribute to this non-empty output hypergraph.The chosen indicator projections become children of V @X rels .
In a view tree with indicator projections, changes in one relation may propagate along multiple leaf-toroot paths.We propagate them in sequence, that is, updates to one relation are followed by a sequence of updates to its indicator projections.

R(S))
return R(S) V @X rels [keys] Example 31.The algorithm from Figure 17 extends the view tree of the triangle query with an indicator projection ∃A,B R[A, B ] placed below the view V @C ST .This view at C is now a cyclic join of the three relations, which can be computed in O(N3/2 ) time.The indicator projection also reduces the size of this view to O(N ).
Single-tuple updates to S and T still take linear time; however, bulk updates of size O(N ) can now be processed in O(N 3/2 ) time, same as reevaluation.Updates to R might affect the indicator projection: If a single-tuple update δR causes no change in the projection, then incremental maintenance takes constant time; otherwise, joining a tuple δ( ∃A,B R) with S and T at node C takes linear time.Bulk updates δR of size O(N ) can also be processed in O(N 3/2 ) time.We conclude that using indicator projections in this query takes the best of both approaches from Example 28, namely faster incremental maintenance and more succinct view representation.□

Applications
This section highlights four applications of F-IVM, including learning regression models, building Chow-Liu trees, computing listing or factorized representations of the results of conjunctive queries, and multiplying a sequence of matrices.They behave the same in the key space, yet differ in the rings used to define the payloads.

Covariance Matrix and Linear Regression
We next introduce the covariance matrix ring used for training linear regression models.
Linear Regression.Consider a training dataset that consists of k samples with (X i ) i∈[m−1] features and a label X m arranged into a design matrix M of size k × m; in our setting, this design matrix is the result of a join query.The goal of linear regression is to learn the parameters θ , where 0 k×1 is the zero matrix of size k × 1.We can solve this optimization problem using batch gradient descent.This method iteratively updates the model parameters in the direction of the gradient to decrease the squared error loss and eventually converge to the optimal value.Each convergence step iterates over the entire training dataset to update the parameters, θ := θ − αM T Mθ, where α is an adjustable step size.The complexity of each step is O(mk).The covariance matrix M T M quantifies the degree correlation for each pair of features (or feature and label) in the data.Its computation can be done once for all convergence steps [60].This is crucial for performance in case m ≪ k as each iteration step now avoids processing the entire training dataset and takes time O(m 2 ).
We next show how to compute the covariance matrix assuming all features have continuous domains; we consider the case with categorical features later on.
The covariance matrix M T M accounts for the interactions SUM(X*Y) of variables X and Y with continuous domains.We can factorize their computation over training datasets defined by arbitrary join queries [60].We can further share their computation by casting the covariance matrix computation as the computation of one compound aggregate.This compound aggregate is a triple (c, s, Q), where c is the number of tuples in the training dataset (size k of the design matrix), s is an m × 1 matrix (or vector) with one sum of values per variable, and Q is an m × m matrix of sums of products of values for any two variables.The covariance matrix computation can be captured by a ring.
using matrix addition, scalar multiplication, and matrix multiplication over D. We refer to (C, + C , * C , 0 C , 1 C ) as the covariance structure of degree m over D.
Theorem 33.For m ∈ N and a ring D, the covariance structure of degree m over D forms a commutative ring.
Definition 34.The continuous covariance ring of degree m is the covariance structure of degree m over R.
We next show how to use this ring to compute the covariance matrix over a training dataset defined by a join with relations (R i ) i∈[n] over variables (X j ) j∈ [m] .The payload of each tuple in a relation is the identity 1 C from the continuous covariance ring of degree m.The query computing the covariance matrix is: For each X j -value x, the lifting function is g Xj (x) = (1, s, Q), where s is an m × 1 vector with all zeros except the value of x at position j, i.e., s j = x, and Q is an m × m matrix with all zeros except the value x 2 at position (j, j): Example 35.We show how to compute the covariance matrix using the join and view tree from Figure 3 and the database from Figure 6.We assume alphabetical order of the five variables in the covariance matrix.The leaf relations R, S, and T map tuples to 1 C from the continuous covariance ring of degree 5.
In the view V @D T , each D-value d is lifted to a triple (1, s, Q), where s is a 5 × 1 vector with one non-zero element s 4 = d, and Q is a (5 × 5) matrix with one non-zero element Q (4,4) = d 2 .Those covariance triples with the same key c are summed up, yielding: The views V @B R and V @E S are computed similarly.The view V @C ST joins V @D T and V @E S and marginalizes C. For instance, the payload for the key a 2 is: , 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e , 0 0 0 0 0 0 0 0 0 0 c 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 The root view V @A RST maps the empty tuple to the ring element i∈ . This payload has aggregates for the entire join result: the count of tuples in the result, the vector with one sum of values per variable, and the covariance matrix.□ Linear Regression with Categorical Variables.Real-world datasets consists of both continuous and categorical variables.The latter take on values from predefined sets of possible values (categories).It is common practice to one-hot encode categorical variables as indicator vectors.This encoding can blow up the size of the covariance matrix and increase its sparsity.Instead of blowing up the covariance matrix with one-hot encoding, we can capture the interactions between continuous and categorical variables as group-by queries: SUM(X) group by Y , when X is continuous and Y is categorical, and SUM(1) group by X and Y , when X and Y are categorical.Using the groupby queries ensures a compact representation of such interactions by considering only those categories and interactions that exist in the join result.We can encode those interactions as values from the relational data ring, introduced next.Definition 36.Let F[R] denote the set of relations over the R ring, the zero 0 in F[R] is the empty relation {}, which maps every tuple to 0 ∈ R, and the identity 1 is the relation {() → 1}, which maps the empty tuple to 1 ∈ R and all other tuples to 0 ∈ R. The structure (F[R], ⊎, ⊗, 0, 1) forms the relational data ring. 4e generalize the continuous covariance ring from Definition 34 to uniformly treat continuous and categorical variables as follows: we use relations from the relational data ring as values in c, s, and Q instead of scalars; we use union and join instead of scalar addition and multiplication; we use the empty relation 0 instead of the zero scalar.The operations + C and * C over triples (c, s, Q) remain unchanged.Definition 37. The generalized covariance ring of degree m is the covariance structure of degree m over For clarity, we show the operations + C and * C of the generalized covariance ring C of degree m.
. The lifting function g Xj now depends on whether X j is continuous or categorical.For each X j -value x, g Xj (x) = (1, s, Q), where 1 = {() → 1}, s is an m × 1 vector with all 0s except s j = {() → x} if X j is continuous and s j = {x → 1} otherwise, and Q is an m × m matrix with all 0s except Q (j,j) = {() → x 2 } if X j is continuous and Q (j,j) = {x → 1} otherwise.
Example 38.We compute the covariance matrix using the tree and database from Example 35 assuming that C is categorical.Since B, D, and E are continuous, the contents of V @B R , V @D T , and V @E S are similar to those of Example 35 except that every scalar value x in their payloads is replaced by the relation {() → x}.The view V @C ST marginalizes C, lifting every C-value c to (1, ST [a 2 ] are relations mapping the empty tuple to the same scalar value from V @C ST [a 2 ] in Example 35.The root view V @A RST computes the payload associated with the empty tuple in the same manner as in the continuous-only case but under the generalized covariance ring.□ Remark 39.For performance reasons, we only store as payloads blocks of matrices with non-zero values and assemble larger matrices as the computation progresses towards the root of the view tree.We further exploit the symmetry of the covariance matrix to compute only the entries above and including the diagonal.For the generalized covariance ring, we store relations, which have the empty tuple as key, as scalar values.

Mutual Information and Chow-Liu Tree
The mutual information (MI) of two random variables X and Y quantifies their degree of correlation [42]: where p XY (x, y) is the joint probability of X = x and Y = y, and p X (x) and p Y (y) are the marginal probabilities of X = x and Y = y, respectively.A value close to 0 means the variables are almost independent, while a large value means they are highly correlated.It can be used to identify variables that predict a given label variable and can thus be used for model selection [42].
In our case, we are given the joint probability of several categorical variables as a relation, or the join of several relations.The probabilities defining the MI of any pair of variables can be computed as group-by aggregates over this relation.Let C ∅ = SUM(1), C X = SUM(1) group by X, C Y = SUM(1) group by Y , and The aggregates C ∅ , C X , and C XY define the covariance matrix over categorical variables, so we can use the generalized covariance ring to compute and maintain them (Section 8.1).To compute the MI for continuous variables, we first discretize their domains into finitely many bins, so we turn them into categorical variables.Mutual information is used for learning the structure of Bayesian networks.Let a graph with one node per variable and one edge per pair of variables weighted by their MI, a Chow-Liu tree is a maximum weight spanning tree.The Chow-Liu algorithm [19] constructs such a tree in several rounds: it starts with a single node in the tree and in each round it connects a new node to a node already in the tree such that their pairwise MI is maximal among all pairs of variables not chosen yet.

Factorized Representation of Query Results
Our framework can also support scenarios where the view payloads are themselves relations representing results of conjunctive queries, or even their factorized representations.Factorized representations can be much smaller than the listing representation of a query result [54], with orders of magnitude size gaps reported in practice [60].They nevertheless remain lossless and support constant-delay enumeration of the tuples in the query result as well as subsequent aggregate processing in one pass.Besides the factorized view computation and the factorizable updates, this is the third instance where our framework exploits factorization.
We store entire relations as payloads using a variant of the relational data (c.f.Definition 36) where values are relations over the Z ring.We denote this ring as F[Z].When marginalizing a variable, we move its values from the key space to the payload space.The tuple payloads in a view are now relations over the same schema.These relations have themselves payloads in the Z ring used to maintain the multiplicities of their tuples.
We model conjunctive queries as count queries that marginalize every variable but use different lifting functions for the free and bound variables.For a free variable X and any of its values x, we define g X (x) = {x → 1}, i.e., the lifting function maps x to the unary relation that consists of the single value x whose payload is 1.In case X is bound, we define g X (x) = 1 = {() → 1}, i.e., the lifting function maps x to the identity element 1 of the relational data ring.This element is the unique relation that consist of the empty tuple whose payload is 1.We have relational operations occurring at two levels: for keys, we join views and marginalize variables as before; for payloads, we interpret multiplication and addition of payloads as join and union of relations.Q(A, B, C, D) = R(A, B), S(A, C, E), T (C, D) over the three relations from Figure 6, where each tuple gets the identity payload {() → 1} ∈ F[Z].The corresponding view marginalizes all the variables:

Example 40. Consider the conjunctive query
The lifting function for E maps each value to {() → 1}, while the lifting functions for all other variables map value x to {x → 1}.
Figure 18 shows the contents of the views with relational data payloads (in black and red) for the view tree from Figure 3 and the database from Figure 6.The view keys gradually move to payloads as the computation progresses towards the root.The view definitions are identical to those of the COUNT query (but under a different ring!).The view V @D T lifts each D-value d from T to the relation {d → 1} over schema {D}, multiplies (joins) it with the payload 1 of each tuple, and sums up (union) all payloads with the same c-value.The views at V @B R and V @E S are computed similarly, except the latter lifts e-values to 1 since E is a bound variable.The view V @C ST assigns to each A-value a payload that is a union of Cartesian products of the payloads of its children and the lifted C-value.The root view V @A RST similarly computes the payload of the empty tuple, which represents the query result (both views are at the right).□ We next show how to construct a factorized representation of the query result.In contrast to the scenarios discussed above, this representation is not available as one payload at the root view, but distributed over the payloads of all views.This hierarchy of payloads, linked via the keys of the views, becomes the factorized representation.A further difference lies with the multiplication operation.For the listing representation, the multiplication is the Cartesian product.For a given view, it is used to concatenate payloads from its child views.For the factorized representation, we further project away values for all but the marginalized variable.More precisely, for each view V @X rels [S] and each of its keys a S , let P[T ] = V @X rels [a S ] be the corresponding payload relation.Then, instead of computing this payload, we compute Y ∈T −{X} P[T ] by marginalizing the variables in T − {X} and summing up the multiplicities of the tuples in P[T ] with the same X-value.
Example 41.We continue Example 40. Figure 18 shows the contents of the views with factorized payloads (first two columns in black and blue).Each view stores relational payloads that have the schema of the marginalized variable.Together, these payloads form a factorized representation over the variable order ω used to define the view tree in Figure 3.At the top of the factorization, we have a union of two A-values: a 1 and a 2 .This is stored in the payloads of (middle) The payloads of (middle) V @C ST [A] store a union of C-values c 1 and c 2 under a 1 , and a singleton union of c 2 under a 2 .The payloads of V @B R [A] store a union of B-values b 1 and b 2 under a 1 and a singleton union of b 3 under a 2 .Note the (conditional) independence of the variables B and C given a value for A. This is key to succinctness of factorization.In contrast, the listing representation explicitly materializes all pairings of B and C-values for each A-value, as shown in the payload of (right) V A RST [ ]. Furthermore, the variable D is independent of the other variables given C.This A B C D a1 b1 c1 d1→2 a1 b1 c2 d2→1 a1 b1 c2 d3→1 a1 b2 c1 d1→2 a1 b2 c2 d2→1 a1 b2 c2 d3→1 a2 b3 c2 d2→1 a2 b3 c2 d3→1 is a further source of succinctness in the factorization: Even though c 2 occurs under both a 1 and a 2 , the relations under c 2 , in this case the union of d 2 and d 3 , is only stored once in V @D T [C ].Each value in the factorization keeps a multiplicity, that is, the number of its derivations from the input data.This is necessary for maintenance.
This factorization is over a variable order that can be used for all queries with same body and different free variables: As long as their free variables sit on top of the bound variables, the variable order is valid and so is the factorization over it.For instance, if the variable D were not free, then the factorization for the new query would be the same except that we would discard the D-values from the payload of the view V @D T .□

Matrix Chain Multiplication
Consider the problem of computing a product of a sequence of matrices A 1 , . . ., A n over some ring D, where matrix and can be formulated as follows: We model a matrix A i as a relation A i [X i , X i+1 ] with the payload carrying matrix values.The query that computes the matrix A is: where each of the lifting functions {g Xj } j∈ [2,n] maps any key value to payload 1 ∈ D. Different variable orders lead to different evaluation plans for matrix chain multiplication.The optimal variable order corresponds to the optimal sequence of matrix multiplications that minimizes the overall multiplication cost, which is the textbook Matrix Chain Multiplication problem [20].

Example 42. Consider a multiplication chain of 4 matrices of equal size p×p encoded as relations
Let F = {X 1 , X 5 } be the set of free variables and ω be the variable order X 1 − X 5 − X 3 − {X 2 , X 4 }, i.e., X 2 and X 4 are children of X 3 , with the matrix relations placed below the leaf variables in ω.The view tree τ (ω, F) has the following views (from bottom to top; the views at X 5 and X 1 are equivalent to the view at X 3 ): Recomputing these views from scratch for each update to an input matrix takes O(p 3 ) time.A single-value change in any input matrix causes changes in one row or column of the parent view, and propagating them to compute the final delta view takes O(p 2 ) time.Updates to A 2 and A 3 change every value in A. In case of a longer matrix chain, propagating δA further requires O(p 3 ) matrix multiplications, same as recomputation.
We exploit factorization to contain the effect of such changes.For instance, if δA 2 is a factorizable update expressible as Section 6), then we can propagate deltas more efficiently, as products of subexpressions: Using such factorizable updates enables the incremental computation in O(p 2 ) time.The final delta is also in factorized form, suitable for further propagation.
In general, for a chain of k matrices of size p × p, using a binary view tree of the lowest depth, incremental maintenance with factorizable updates takes O(p 2 log k) time, while reevaluation takes O(p 3 k) time.The space needed in both cases is O(p 2 k).□ The above example recovers the main idea of LINVIEW [45]: use factorization in the incremental computation of linear algebra programs where matrix changes are encoded as vector outer products, δA = uv T .Such rank-1 updates can capture many practical update patterns such as perturbations of one complete row or column, or even changes of the whole matrix when the same vector is added to every row or column.F-IVM generalizes this idea to arbitrary join-aggregate queries.

Experiments
This section reports our experimental findings with our system F-IVM and three competitors: first-order IVM (1-IVM), DBToaster's higher-order IVM (DBT), and Apache Flink.We first summarize our findings.
1.For maintaining covariance matrices over continuous variables, F-IVM outperforms DBT and 1-IVM by up to three orders of magnitude.This is primarily due to the use of the covariance ring in F-IVM, which can capture the maintenance for an entire covariance matrix of 100-800 entries with under ten views.In contrast, DBT requires 600-3,000 views, while 1-IVM needs as many delta queries as matrix entries (136 -820).A similar conclusion holds for maintaining covariance matrices over continuous and categorical variables and also only over categorical variables, albeit the performance gap becomes smaller.Thanks to the covariance ring, F-IVM also has a low memory footprint, on par with 1-IVM and 4-16x less than DBT.
2. Maintaining linear regression models over the covariance matrices takes insignificant time if the batch gradient descent resumes with the values for the model parameters computed after the previous update batch.Dataset  3. Maintaining mutual information and Chow-Liu trees over the covariance matrices requires recomputation after every update batch and this can decrease the throughput of F-IVM by up to one order of magnitude.
4. For q-hierarchical queries, F-IVM is the fastest approach in case the updates are followed occasionally by a request to enumerate the query result.F-IVM pushes the updates from the leaves to the root view in the view tree, yet keeps the result factorized.This ensures update time and enumeration delay per tuple proportional to the payload size.We confirmed experimentally that DBT and 1-IVM cannot achieve constant time for both update and enumeration.
5. For path queries of up to 20 joins over the Twitter and TikTok graph datasets, F-IVM's throughput remains at least an order of magnitude larger than of competitors.1-IVM and Apache Flink do not manage to process one 1K-batch within four hours for paths of more than 10 joins.
Our conference paper [46] reports further experiments with F-IVM showing that: (1) F-IVM outperforms competitors in maintaining one sum aggregate over joins; (2) Using batches with 1, 000 − 10, 000 tuples performs best in maintaining the covariance matrix; (3) Factorized updates lead to two orders of magnitude speedup for F-IVM over competitors for matrix chain multiplication; and (4) For conjunctive query evaluation, factorized payloads can speed up view maintenance and reduce memory by up to two orders of magnitude compared to the listing representation of payloads.For convenience, we include these experiments in Sections 9.2 (last two paragraphs), 9.6, 9.7, and 9.8.

Experimental Settings
Competitors.The three maintenance strategies use DBToaster v2.3 [33], a system that compiles SQL queries into code that maintains the query result under updates to input relations.The generated code represents an in-memory stream processor that is standalone and independent of any database system.DBToaster's performance on decision support and financial workloads can be several orders of magnitude better than state-of-the-art commercial databases and stream processing systems [33].DBToaster natively supports DBT and 1-IVM.We use the intermediate language of DBToaster to encode F-IVM that maintains a set of materialized views for a given variable order and a set of updatable relations.We feed this encoding into the code generator of DBToaster.Unless stated otherwise, all approaches use the same runtime and store views as multi-indexed maps with memory-pooled records.The algorithms and record types used in these approaches can differ greatly.We also report on the performance of Apache Flink v1.17.1 [15] (via Table API), configured to utilize all cores and main memory of our machine.
Datasets.Figure 19 summarizes our datasets: • Housing is a synthetic dataset modeling a house price market [60].It consists of six relations: House, Shop, Institution, Restaurant, Demographics, and Transport, arranged into a star schema.The natural join of all relations is on the common variable (postcode) and has 26 non-join variables, 14 continuous and 12 categorical.We consider a variable order where each root-to-leaf path consists of variables of one relation.
• Retailer is a real-world dataset used by a retailer to inform decision-making and forecast user demands [60].It has a snowflake schema with one large fact relation Inventory storing information about the Figure 20: The average throughput (in thousands of tuples/sec) and in parentheses the number of materialized views for the maintenance of the covariance matrix over datasets where features are treated as all continuous (CONT) and as a mix of continuous and categorical (MIXED).The symbol * denotes the one-hour timeout.
inventory units for products in a location, at a given date.This relation joins along three dimension hierarchies: Item (on product id), Weather (on location and date), and Location (on location) with its lookup relation Census (on zip).The natural join of these relations is acyclic and has 33 continuous and 6 categorical non-join variables.We use a variable order, where the variables of each relation form a distinct root-to-leaf path, and the partial order on join variables is: location -{ date -{ product id }, zip }.
• Favorita is a real-world dataset comprising sales data of items sold in grocery stores in Ecuador [2].It has a star schema with one large fact relation Sales storing information on sales transactions, including the date, store, item, and item quantity.This relation joins with five dimension tables: Stores (on store id), Item (on item id), Transaction (on date and store id), Holiday (on date), and Oil (on date).The natural join has 3 continuous and 12 categorical non-join variables.We consider a variable order where the order on join variables is: date -store id -item id.
We evaluate the maintenance strategies over data streams synthesized from the above datasets by interleaving insertions to the input relations in a round-robin fashion.These insertions arrive sorted following a top-down order of F-IVM's variable orders.This leads to improved runtimes of all systems relative to out-of-order insertions.We group insertions into batches of 1000 tuples and place no restriction on the order of records in input relations.In all experiments, we use payloads defined over rings with additive inverse, thus processing deletions is similar to processing insertions.
Queries.We consider the following queries: • Covariance Matrix: For F-IVM, we use one query per dataset to compute one covariance aggregate over the natural join of the input relations.For instance, the query over the Retailer schema is: where {X i } i∈ [39] are all the non-join variables from the Retailer schema.We consider three scenarios: (1) we treat all variables as continuous; (2) with a mix of continuous and categorical variables; and (3) with all categorical variables.For the first, we use the continuous covariance ring of degree 39 and the lifting function for each variable X i , as in Example 35.For the other two, we use the generalized covariance ring with relational values, as in Example 38.Similarly, the queries over Housing (Favorita) use the covariance rings of degree 26 (15).
For DBT and 1-IVM, we use queries that compute scalar sum aggregates in the covariance matrix.When considering all variables as continuous, we use one query per dataset to compute 1 + n + n(n+1) queries as input to DBToaster.For Housing, Retailer, and Favorita, the number of queries with distinct group-by variables is 46, 22, and 79, respectively.
• Q-Hierarchical Queries: We use the natural joins of all relations in each dataset.For Housing, this is a star join query.For Favorita, the relation Stores violates the q-hierarchical property: Its variables form a strict subset of a root-to-leaf path in the canonical variable order.To ensure constant time for single-tuple updates, we require Stores to be non-updatable.For Retailer, the query is q-hierarchical due to (1) the functional dependency zip → location in Census and (2) requiring the relation Item be non-updatable.
• k-Path Queries: These queries join k copies R 1 to R k of the edge relation of the input graph: Each relation R i has schema (A i , A i+1 , W i ) and can be seen as the adjacency matrix of the input graph, with rows indexed by A i , columns indexed by A i+1 , and the value W i = 1 in cell (A i , A i+1 ).The path query is then the k times multiplication of the adjacency matrix.
• Matrix Chain Multiplication: The query in standard SQL is defined over tables A 1 (I, J, P 1 ), A 2 (J, K, P 2 ), A 3 (K, L, P 3 ): In our formalism, each relation maps pairs of indices to matrix values, all lifting functions map values to 1, and the query is: • Factorized Computation of Conjunctive Queries: We consider two full conjunctive queries joining all the relations in the Retailer and respectively Housing datasets.
Experimental Setup.We run the first four experiments, Sections 9.2 (except the last two paragraphs), 9.3, 9.4 and 9.5, on a machine with Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz, 188GB RAM, and Debian 10.We use DBToaster v2.3 for running DBT and 1-IVM and generating code in F-IVM.The generated C++ code is single-threaded and compiled using g++ 8.3.0 with the -O3 flag.
All experiments are run single-threaded.Unless stated otherwise, we set an one-hour timeout on query execution and report wall-clock times by averaging three best results out of four runs.We profile memory utilization using gperftools, not counting the memory used for storing input streams.

Covariance Matrix and Linear Regression
We benchmark the performance of maintaining a covariance matrix for learning regression models over natural joins.We consider updates to all input relations.We compute the covariance matrix over all nonjoin variables of the join query (i.e., over all non-join attributes in the input database), which suffices to learn linear regression models over any label and set of features that is a subset of the set of variables [51].This is achieved by specializing the convergence step in batch gradient descent to the relevant restriction of the covariance matrix.In our approach for learning linear regression models over database joins, the convergence step takes orders of magnitude less time compared to the data-dependent covariance matrix computation.
Figure 20 shows the number of views materialized by F-IVM, DBT, and 1-IVM for computing the covariance matrix.F-IVM computes one aggregate query with payloads from a covariance ring.For Housing, where all relations join on one variable, F-IVM materializes seven views: one view per relation to marginalize  out all non-join variables, and the root view to join these views.For Retailer, F-IVM materializes five views over the input relations, three intermediate views, and the root view; similarly, for Favorita.These views have payloads from the continuous (generalized) covariance ring if all features are continuous (continuous and categorical).
DBT and 1-IVM maintain a batch of sum aggregate queries with scalar payloads.These materialization strategies fail to effectively share the computation of covariance aggregates, materializing linearly many views in the size of the covariance matrix: for instance, when considering all variables as continuous, DBT and 1-IVM materialize 626 and respectively 384 views to maintain 378 scalar aggregates for Housing; similar reasoning holds for the other datasets and the scenarios with both continuous and categorical variables.
Throughput. Figure 21 shows the throughput of F-IVM, DBT, and 1-IVM as they process an increasing fraction of the stream of tuple inserts.Figure 20 shows their average throughput after processing the entire stream.The throughput is higher when all features are continuous than for a mix of continuous and categorical features.This is expected as the latter computes additional group-by aggregates for the categorical features; in this case, the number of computed aggregates is data-dependent.The occasional hiccups in the throughput of F-IVM are due to doubling the memory allocated to the underlying data structures used for the views.
The query for Housing joins all relations on the common variable, which is the root in our variable order; thus, the query is hierarchical.F-IVM computes the covariance matrix using the query with no free variables in both scenarios (CONT and MIXED) and can process a single-tuple update to any input relation in time linear in the size of the payload.In the continuous-only scenario, the update time is O(m 2 ), where m is the number of continuous features; in the mixed scenario, the update time depends on the size of the domain of the categorical features.DBT exploits the conditional independence in the derived deltas to materialize each input relation separately such that all non-join variables are aggregated away.In the case of all continuous features, each materialized view has O(1) maintenance cost per update tuple, but the large number of views in DBT is the main reason for its poor performance.1-IVM stores entire tuples of the input relations including non-join variables.On each update, 1-IVM recomputes a batch of aggregates on top of the join of these input relations and the update tuple.Since the update tuple binds the value of the common join variable, the hypergraph of the delta query consists of disconnected components.DBToaster first aggregates over each relation and then joins together the partial aggregates on the common variable.Even with this optimization, 1-IVM takes time linear in the size of the dataset, which explains its poor performance.
For Retailer, the inserts are mostly into Inventory.Since the variables of this relation form a root-to-leaf path in the variable order, F-IVM can process single-tuple updates to this relation in O(1) time in data complexity in the continuous-only scenario.DBT maintains up tp four views per scalar aggregate and fails to process the entire stream within a one-hour limit in both scenarios.1-IVM maintains one view per scalar group-by aggregate but recomputes the delta query on each update, resulting in 132x (58x) lower throughput than F-IVM in the continuous-only (mixed) scenario.
For Favorita, 1-IVM achieves better performance than on Retailer but still 7.8x (4.1x) slower than F-IVM in the continuous (mixed) scenario.DBT fails to finish the entire stream within a one-hour timeout.
Memory Consumption.Figure 21 shows that F-IVM achieves lower or comparable memory utilization on the three datasets, while providing orders of magnitude better performance than its competitors.The reason behind this memory efficiency is that F-IVM uses compound aggregates and factorization structures to express the covariance matrix computation over fewer views compared to DBT and 1-IVM.
End-to-End Training.We next analyze the cost of learning linear regression models from the computed covariance matrices.We consider the scenario where all variables are continuous and the target label is house price (Housing), inventory units (Retailer), and sold units (Favorita).Using batch gradient descent and the covariance matrix, the time needed to converge on the model parameters represents 0.24%, 0.2%, and 0.001% of the time needed to process all updates in the stream for Housing, Retailer, and Favorita, respectively.
Figure 22 illustrates the performance of F-IVM for maintaining the covariance matrix in three scenarios: 1) without training the linear regression model (baseline); 2) with training after each batch update, starting from previously learned parameters (CONT); and 3) with training after every batch update, starting with value 0 for the parameters (SCRATCH).Continuously refreshing the model after every update reduces the throughput of baseline by 41%, 18%, and 2% for Housing, Retailer, and Favorita.In contrast, retraining from scratch after every update has significantly higher overheads and reduces the baseline throughput by 95%, 99%, and 42% for the three datasets.Decreasing the training frequency brings the throughput closer to the baseline.
Figure 22 (bottom plots) shows for each dataset the cumulative number of iterations of batch gradient descent in the two training scenarios.Continuously improving learned parameters yields 30x, 1160x, and 175x fewer iterations compared to retraining from scratch for Housing, Retailer, and Favorita, respectively.This reflects in the throughput of the two training scenarios.
Covariance Matrix Computation over the Triangle Query.We analyze the covariance matrix computation over the triangle query on the Twitter dataset and updates of size 1, 000 to all the relations.In addition to the three incremental strategies from before, we now also benchmark F-IVM-ONE, which is F-IVM but under updates to relation R only (S and T are non-updatable), DBT-RING, which is DBToaster's recursive IVM strategy with payloads from continuous covariance ring of degree m (cf.Section 8.1) instead of scalars, and SQL-OPT, an optimized SQL encoding of covariance matrix computation.The SQL-OPT strategy arranges regression aggregates -recall there are quadratically many such aggregates in the number of query variables -into a single aggregate column indexed by the degree of each query variable.This encoding takes as input a variable order and constructs one SQL query that intertwines join and aggregate computation by pushing (partial) regression aggregates (counts, sums, and covariance matrices) past joins [52].F-IVM uses the view tree from Figure 16 (right) without the indicator projection and materializes the join of S and T of size O(N 2 ).Its time complexity for a single-tuple update to R is O(1), but updating the join of S and T takes O(N ).F-IVM-ONE uses the same view tree.For updates to R only, F-IVM-ONE requires one lookup in the materialized join of the two non-updatable relations S and T per update, which takes O(1) time.DBT-RING uses payloads from the continuous covariance ring of degree 3 and materializes all three such pairwise joins, each requiring linear time maintenance.DBT uses scalar payloads and materializes 21 views (to maintain 6 aggregates), out of which 12 views are over two relations.Its time complexity for processing single-tuple updates to either of the three relations is also O(N ).1-IVM maintains just the input relations and recomputes the delta upon each update in linear time.
Figure 23 shows the throughputs of the three strategies on the Twitter dataset.This experiment result is from the conference version of this paper [47].The throughput rate of the strategies that materialize views of quadratic size declines sharply as the input stream progresses.DBT exhibits the highest processing and memory overheads caused by storing 12 auxiliary views of quadratic size.DBT-RING underperforms F-IVM due to maintaining two extra views of quadratic size, which contribute to a 2.3x higher peak memory utilization.1-IVM exhibits a 42% decline in performance after processing the entire trace due to its linear time maintenance.The extent of this decrease is much lower compared to the other approaches with the quadratic space complexity.F-IVM-ONE has two orders of magnitude higher throughput than 1-IVM at the cost of using 23x more memory.
Clique queries like triangles provide no factorization opportunities.Materializing auxiliary views to speed up incremental view maintenance increases memory and processing overheads.However, F-IVM can exploit indicator projections to bound the size of such materialized views, as described in Section 7.4.
The Effect of Batch Size on IVM.This experiment evaluates the performance of maintaining a covariance matrix for batch updates of different sizes.Figure 24 shows the throughput of batched incremental processing for batch sizes varying from 100 to 100, 000 on the Retailer, Housing, and Twitter datasets for updates to all relations.We show only the best three approaches for each dataset.This experiment result is from the conference version of this paper [47].
We observe that using very large or small batch sizes can have negative performance effects: Iterating over large batches invalidates previously cached data resulting in future cache misses, whereas using small batches cannot offset the overhead associated with processing each batch.Using batches with 1, 000 − 10, 000 tuples delivers best performance in most cases, except when needed to incrementally maintain a large number of views.This conclusion about covariance matrix computation is in line with similar findings on batched Figure 25: Solid lines: incremental maintenance of the covariance matrix over the Housing (left), Retailer (middle) and Favorita (right) datasets under batches of 1,000 updates to all relations with a one-hour timeout.All features are either categorical in the original dataset or made categorical by discretizing their domains into 100 buckets.Dotted line: computation of the mutual information matrix and the Chow-Liu tree on top of the covariance matrix after each batch of 1,000 updates.delta processing in decision support workloads [44].
Batched incremental processing is also beneficial for one-off computation of the entire covariance matrix.Using medium-sized updates can bring better performance, cf. Figure 24, but can also lower memory requirements and improve cache locality during query processing.For instance, incrementally processing the Retailer dataset in chunks of 1, 000 tuples can bring up to 2.45x better performance compared to processing the entire dataset at once.

Mutual Information and Chow-Liu Trees
We benchmark the performance of maintaining the matrix of pairwise mutual information (MI) for the features representing the non-join variables in our datasets and Chow-Liu trees on top of the MI matrices.
As explained in Section 8.2, the MI matrix can be derived from the covariance matrix over categorical variables.We discretize the active domain of each continuous variable into 100 bins of equal size.The Housing, Retailer, and Favorita datasets have 26, 39, and 15 categorical variables, respectively.Insertions and deletions of values for a continuous variable are distributed into the appropriate bins, without changing the number of bins.Whereas the covariance matrix can be maintained incrementally under updates, the MI matrix needs to be recomputed from scratch after each update batch.
The view construction and maintenance are as in Section 9.2, except that all variables are now categorical.Figure 25 (solid lines) shows the throughput of F-IVM, DBT, and 1-IVM for maintaining the covariance matrix as they process an increasing fraction of the stream of tuple updates.F-IVM is 74x faster than DBT and 28x faster than 1-IVM for the Retailer dataset and 9.3x and 2.1x faster, respectively, for Favorita.For Housing, F-IVM is 2.8x faster than 1-IVM but 4.6x slower than DBT.This is because: (i) Housing is a relatively small dataset and the domain of the categorical variables is also small; (ii) DBT has specific optimizations for group-by count over star joins such as in this case.
Computing the MI matrix from the covariance matrix takes time linear in the number of categories of the variables.Computing the Chow-Liu tree takes time O(m log m), where m is the number of variables.Figure 25 (dotted line) shows the throughput of F-IVM when the MI matrix and Chow-Liu tree are computed after each update batch.This throughput is 46%, 86%, and 35% smaller than the time to maintain the covariance matrix for Housing, Retailer, and Favorita, respectively.

Q-Hierarchical Queries
The q-hierarchical queries are those queries that admit the lowest (i.e., constant) enumeration delay and single-tuple update time (Section 7).We would like to understand how different IVM variants perform for such queries in practice.We consider one such query per dataset, as described in Section 9.1.
We construct a view tree modelled on the canonical free-top variable order for each of the three qhierarchical queries.The query result is constructed and maintained in the payload space.We consider two dimensions.One dimension is whether we push the updates all the way to the result (eager) or we only update the input relations and only construct the query result on an enumeration request (lazy).The other Figure 26: Incremental maintenance of the result of the q-hierarchical queries over the Housing (left), Retailer (middle), and Favorita (right) datasets under update batches and requests to enumerate all tuples in the query results after every INTVL update batches; #ENUM denotes the overall number of the enumeration requests.The symbol * denotes the case where an IVM variant did not finish within the time limit (50 hours) for this experiment.The throughput is not shown in this case.
Figure 27: The number of rounds of updates to all relations in the path queries processed within four hours over the TikTok (left) and Twitter (right) graph datasets under update batches of size 1, 000.dimension is whether the query result has a listing representation (one tuple after the other) or a factorized representation.This defines four variants: eager-list (which is DBT), eager-fact (F-IVM's default strategy), lazy-list (1-IVM), and lazy-fact (a hybrid of F-IVM and 1-IVM).
Figure 26 shows the average throughput of the four variants on the three q-hierarchical queries.We report the overall runtime in case of update batches as in the previous experiments but where in addition we have requests for the enumeration of all tuples in the query result after every INTVAL batches of updates.We tried INTVAL values 1, 10, 100, 1000, and 10000.Each such value corresponds to different numbers of enumeration requests (#ENUM) as the datasets have different sizes.The lazy-list variant did not finish within the time limit of 50 hours (denoted by * in Figure 26).The lazy-list variant has the lowest throughput among the four variants in our experiment.
The two lazy variants are clear winners in case of none or very few enumeration requests.In this case, there is almost no difference between their throughputs since they spend most of their time updating the input relations.In case of more enumeration requests, however, the eager variants are the winners, with eager-fact consistently outperforming eager-list.
Overall, the eager and lazy variants based on factorized representation outperform those based on listing representation in all but the trivial cases of none or few enumeration requests, where the representation of the query result plays no role.This is as expected, since the enumeration delay and the update time can both remain constant for our queries only if the query result is kept factorized over the views in the view tree.

Path Queries
We investigate the scalability of the maintenance approaches as we increase the number of joins in the query.We consider the path query with up to 20 self-joins of the edge relation in the TikTok and Twitter graphs.The edges are partitioned into batches of 1000 inserts.One round of updates processes one batch of inserts  for each copy of the edge relation.Figure 27 shows the number of rounds of updates processed by each approach within four hours.F-IVM outperforms all other approaches on path queries of any length.All approaches are slower for TikTok, since it is more skewed than Twitter.Flink and 1-IVM have a similar poor performance and do not scale for long path queries.Flink maintains the join result via a left-deep binary view tree and computes the aggregates at the root view.It projects away the join variable after each join.This reduces the number of columns but not the number of rows in the join result.For a delta to the bottom relation in the view tree, Flink joins it with all other k − 1 relations in the query.This triggers O(N ⌈ k 2 ⌉ ) inserts to the join result, where N is the number of edges.1-IVM computes the delta query by joining the batch of inserts with k − 1 relations and has the same complexity as Flink.
DBT and F-IVM avoid the materialization of the large join result by pushing the aggregates past the joins at each view.Both of them need O(N 2 ) time to update each view.Like Flink, F-IVM uses a left-deep view tree.DBT uses one view tree per delta query, where the delta relation is a child of the top view and the two subqueries to the left and right of the delta relation have left-deep view trees.F-IVM constructs fewer views than DBT: For 20-path, F-IVM uses 19 views, while DBT uses 190 views.This explains the better performance of F-IVM.

Maintenance of Sum Aggregates
We analyze different strategies for maintaining a sum of one variable on top of a natural join.We measure the average throughput of reevaluation and incremental maintenance under updates of size 1, 000 to all the relations of Retailer and Housing.For the former dataset, we sum the inventory units for products in Inventory; for the latter, we sum over the common join variable.We also benchmark two reevaluation strategies that recompute the results from scratch on every update: F-RE denotes reevaluation using variable orders and DBT-RE denotes reevaluation using DBToaster.Table 28 summarizes the results.
F-IVM achieves the highest average throughput in both cases.For Retailer, the maintenance cost is dominated by the update on Inventory.DBT's recursive delta compilation materializes 13 views representing connected subqueries: five group-by aggregates over the input relations, Inv, It, W, L, and C; one group-by aggregate joining L and C; six views joining Inv with subsets of the others, namely {It}, {It, W}, {It, W, L}, {W}, {W, L}, and {W, L, C}; and the final aggregate.The two views joining Inv with { W, L } and { It, W, L } require linear maintenance for a single-tuple change in Inventory.1-IVM recomputes deltas from scratch on each update using only the input relations with no aggregates on top of them.Updates to Inventory are efficient due to small sizes of the other relations.F-IVM uses the given variable order to materialize 9 views, four of them over Inventory, {Inv}, {Inv, It}, { Inv, It, W }, and the final sum, but each with constant maintenance for single-tuple updates to this relation.In contrast to 1-IVM, our approach materializes precomputed views in which all nonjoin variables are aggregated away.In the Housing schema, both F-IVM and DBT benefit from this preaggregation, and since the query is a star join, both materialize the same views.DBT computes SUM(1) and SUM(postcode) for each postcode in the delta for Inventory, although only the count suffices.Figure 28 also shows that the reevaluation strategies significantly underperform the incremental approaches.

Matrix Chain Multiplication
We consider the problem of maintaining the multiplication A = A 1 A 2 A 3 of three (n × n) matrices under changes to A 2 .We compare F-IVM with factorized updates, 1-IVM that recomputes the delta δA = A 1 δA 2 A 3 from scratch, and REEVAL that recomputes the entire product from scratch on every update.DBT becomes 1-IVM in this particular setting.We consider two different implementations of these maintenance strategies: The first uses DBToaster's hash maps to store matrices, while the second uses Octave, a numerical tool that stores matrices in dense arrays and offers highly-optimized BLAS routines for matrix multiplication [66].
We first consider updates to one row in A 2 .For 1-IVM, the delta δA 12 = A 1 δA 2 might contain non-zero changes to all n 2 matrix entries, thus computing δA = δA 12 A 3 requires full matrix-matrix multiplication.REEVAL updates A 2 first before computing two matrix-matrix multiplications.F-IVM factorizes δA 2 into a product of two vectors δA 2 = uv T , which are used to compute δA 12 = (A 1 u) v T = u 1 v T and δA = u 1 (v T A 3 ) = u 1 v 1 .Both deltas involve only matrix-vector multiplications computed in O(n 2 ) time. Figure 29 (left) shows the average time needed to process an update to one randomly selected row in A 2 for different matrix sizes.REEVAL performs two matrix-matrix multiplications, while 1-IVM performs only one.In the hash-based implementation, the gap between F-IVM and 1-IVM grows from 28x for n = 256 to 92x for n = 4, 096; similarly, in the Octave implementation, the same gap grows from 16x for n = 256 to 236x for n = 16, 384.This confirms the difference in the asymptotic complexity of these strategies.
Our next experiment considers rank-r updates to A 2 , which can be decomposed into a sum of r rank-1 tensors, δA 2 = i∈[r] u i v T i .F-IVM processes δA 2 as a sequence of r rank-1 updates in O(rn 2 ) time, while both REEVAL and 1-IVM take as input one full matrix δA 2 and maintain the product in O(n 3 ) time per each rank-r update.1-IVM has the same performance as REEVAL.Figure 29 (right) shows that the average time F-IVM takes to process a rank-r update for different r values and the matrix size 4, 096 is linear in the tensor rank r.Under both implementations in DBToaster and Octave, incremental computation is faster than reevaluation for updates with rank r ≤ 96.With larger matrix sizes, the gap between reevaluation and incremental computation increases, which enables incremental maintenance for updates of higher ranks.

Factorized Computation of Conjunctive Queries
We analyze F-IVM on queries whose results are stored as keys with integer multiplicities using listing representation (List keys) and as relational payloads using factorized and listing representations (Fact payloads and List payloads).Figure 30 (left) considers the natural join of Retailer under updates to the largest relation.The factorized payloads reduce the memory consumption by 4.4x, from 34GB to 7.8GB, improve the average throughput by 2.8x and 3.7x (and the overall run time by 3.2x and 4.2x) compared to using the two listing encodings.Figure 30 (right) considers the natural join of Housing under updates to all input relations.The number of tuples in the dataset varies from 150, 000 (scale 1) to 1, 400, 000 (scale 20), while the size of the listing (factorized) representation of natural join grows cubically (linearly) with the scale factor.The two listing encodings blow up the memory consumption and computation time for large scales.
Storing tuples in the listing representation using payloads instead of keys avoids the need for hashing wide keys, which makes the joins slightly cheaper.For Housing and factorized representation, the root view stores 25, 000 values of the join variable regardless of the scale.The root's children map these values to relational payloads for each relation.For the largest scale, Fact payloads is 481x faster and takes 548x less memory than List payloads (410ms vs. 197s, 195MB vs. 104GB), and List keys exceeds the available memory.

Related Work
To the best of our knowledge, ours is the first approach to propose factorized IVM for a range of distinct applications.It extends non-trivially two lines of prior work: higher-order delta-based IVM and factorized computation of in-database analytics.
Our view language is modelled on functional aggregate queries over semirings [6] and generalized multiset relations over rings [33]; the latter allowed us to adapt DBToaster to factorized IVM.
IVM. IVM is a well-studied area spanning more than three decades [18,59,65].Prior work extensively studied IVM for various query languages and showed that the time complexity of IVM is lower than of recomputation.We go beyond prior work on higher-order IVM for queries with joins and aggregates, as realized in DBToaster [33], and propose a unified approach for factorized computation of aggregates over joins [8], factorized incremental computation of linear algebra [45], and in-database machine learning over database joins [60].DBToaster uses one materialization hierarchy per relation in the query, whereas F-IVM uses one view tree for all relations.DBToaster can thus have much higher space requirements and update times.As we observed experimentally, it does not consider the maintenance of composite aggregates such as the covariance matrix.IVM over array data [70] targets scientific workloads but without exploiting data factorization.
F-IVM over the relational payload ring strictly subsumes prior work on factorized IVM for acyclic joins [28] as it can support arbitrary joins.F-IVM has efficient support for free-connex acyclic [28] and q-hierarchical queries [12].Exploiting key attributes to enable succinct delta representations and accelerate maintenance complements our approach [30].Our framework generalizes the main idea of the LINVIEW approach [45] for maintaining matrix computation over arbitrary joins.Unlike approaches that exploit the append-only nature of data streams [67], F-IVM allows for both data insertions and deletions.F-IVM can be used to improve the memory-efficiency of systems that integrate IVM into compilers to speed up the search in abstract syntax trees [9].Such systems suffer from the high storage overhead of systems such as DBToaster that maintain significantly more views than F-IVM.
Commercial DBMSs support IVM for restricted classes of queries, e.g., Oracle [3] and SQLServer [1].LogicBlox supports higher-order IVM for Datalog meta-programs [7,26].Trill is a streaming engine that supports incremental processing of relational-style queries but no complex aggregates like covariance matri-ces [16].Differential Dataflow [39] supports incremental processing for programs with recursion.There is a distinct line of work on maintenance for recursive Datalog [41].
Static In-DB analytics.The emerging area of in-database analytics has been overviewed in two tutorials [56,35] and a recent keynote [48].Several systems support analytics over normalized data via a tight integration of databases and machine learning [40,27,57,56,35].Other systems integrate with R to enable in-situ data processing using domain-specialized routines [69,14].The closest in spirit to our approach is work on learning models over factorized joins [58,60,51,31], pushing ML tasks past joins [23,61] and on in-database linear algebra [22,17,53], yet they do not consider incremental maintenance.
Learning.There is a wealth of work in the ML community on incremental or online learning over arbitrary relations [62].Our approach learns over joins and crucially exploits the join dependencies in the underlying training dataset to improve the runtime performance.

Conclusion and Future Work
This article introduces F-IVM, a system that unifies the task of maintaining a variety of analytics over normalized data under updates.We show its applicability to learning linear regression models, building Chow Liu trees, and query evaluation with listing/factorized result representation.F-IVM recovers the best known complexities for free-connex acyclic and q-hierarchical queries.A prior version of this work [47] also discusses the application of F-IVM to matrix chain multiplication.These tasks use the same computation paradigm that factorizes the representation and the computation of the keys, the payloads, and the updates.Their differences are confined to the definition of the sum and product operations in a suitable ring.F-IVM is publicly available and was implemented as an extension of DBToaster [33], a state-of-the-art system for incremental maintenance, and shown to outperform competitors by orders of magnitude in both time and space.
Going forward, we would like to apply this approach to further tasks such as inference in probabilistic graphical models and more complex machine learning tasks.
F-IVM inherits the limitations of DBToaster, in particular it is single-threaded.A promising avenue of research is to build F-IVM on top of an open-source parallel and distributed framework such as Apache Flink.Another goal is to extend F-IVM to support further SQL operators such as theta joins, nested subqueries, and NULLs, which are relevant in practice.
If You Liked It, Then You Should Put A Ring On It. -Beyoncé.

Figure 1 :
Figure 1: View tree for the query in Example 1.1.The propagation paths for updates to S (right red) and to T (left blue).
δVS := SELECT A, C, SUM(E) AS SE FROM δS GROUP BY A, C; δVST := SELECT A, C, SUM(SD * SE) AS SC FROM VT NATURAL JOIN δVS GROUP BY A, C; δQ := SELECT A, C, SUM(SB * SC) FROM VR NATURAL JOIN δVST GROUP BY A, C;

Figure 2 :
Figure 2: Overview of the F-IVM system.

4 .Example 2 .
∃ − a ∈ D : a + (−a) = (−a) + a = 0. 5. (a * b) * c = a * (b * c). 6. a * 1 = 1 * a = a. 7. a * (b + c) = a * b + a * c and (a + b) * c = a * c + b * c.A semiring (D, +, * , 0, 1) satisfies all of the above properties except the additive inverse property (Property 4) and adds the axiom 0 * a = a * 0 = 0.A (semi)ring for which a * b = b * a is commutative.□The number sets Z, Q, R, and C with arithmetic operations + and • and numbers 0 and 1 form commutative rings.The set M of (n × n) matrices forms a non-commutative ring (M, •, +, 0 n,n , I n ), where 0 n,n and I n are the zero matrix and the identity matrix of size (n × n).The set N of natural numbers is a commutative semiring but not a ring because it has no additive inverse.Further examples are the maxproduct semiring (R + , max, ×, 0, 1), the Boolean semiring ({true, false}, ∨, ∧, false, true), and the set semiring (2 U , ∪, ∩, ∅, U ) of all possible subsets of a given set U .□ Data.

Example 3 .
The SQL query SELECT SUM(1) FROM R NATURAL JOIN S NATURAL JOIN T over tables R(A, B), S(A, C, E), and T (C, D) can be encoded as follows in our formalism.The table R is encoded as a relation R : Dom(A) × Dom(B) → Z that maps tuples (a, b) to their multiplicity in R; similarly, we encode the tables S and T as relations S and T. We translate the SQL query into: abbreviates A • • • E .The lifting functions used for marginalization map all values to 1.

Example 5 .
r1 * t1 * gA(a1) + (r2 + s1) * t1 * gA(a2) b2 c2 → s2 * t2 * gA(a3) Let us consider the SQL query from Section 1.1, which computes SUM(R.B * T.D * S.E) grouped by A, C. Assume that B, D, and E take values from Z.We model the tables R, S, and T as relations mapping tuples to their multiplicity, as in Example 3. The variables A and C are free, while B, D, and E are bound.

Figure 3 :
Figure 3: (left) Variable order ω of the natural join of the relations R[A, B ], S[A, C , E ], and T[C , D]; (middle) View tree over ω and F = ∅; (right) View definitions.

Example 7 .
Consider the query from Example 3 that joins the relations R[A, B], S[A, C , E ], and T[C , D].

Figure 3
gives a variable order (top left) for the query.Variable D has ancestors A and C, yet it only depends on C since C and D appear in the same relation T and D does not occur in any relation together with A. Thus, dep(D) = {C}.Given C, the variables D and E are independent of each other.□

Figure 5 :
Figure 5: (left) View tree over the variable order ω in Figure 3 and F = {A, C}; (right) View definitions.

Figure 6 :
Figure 6: Contents of the views in the view tree from Figure 3 in case the relations R, S, and T are over a ring (D, +, * , 0, 1) with p i ∈ D for i ∈ [12].

Theorem 13 .
Let a query Q and a database of size N .F-IVM can maintain Q with O(N ) preprocessing, O(1) enumeration delay, and

Example 17 .
Continuing Example 16, we consider a single-tuple update δT[c, d] to relation T. Figure 10 depicts the original view tree (left) and the delta view tree for updates to T (right).The difference is that along the path from T to the root, we now have delta views.The delta view δV @D T results from δT[c, d ] by marginalizing D, which takes constant time since D is fixed to the constant d.To compute δH @C

Figure 12 :
Figure 12: (left) Canonical free-top variable order of the query Q h from Example 22; (right) Corresponding view tree.

Figure 14 :
Figure 14: From left to right: Hypergraph of the query Q and its Σ-reduct for Σ = {B → C, C → D} from Example 27; canonical variable order ω for Q; view tree modeled on ω.

Figure 15 :
Figure 15: Delta view trees derived from the view tree in Figure 14 for single-tuple updates to R, S, and T (left to right).The values b and c functionally determine c and d, respectively.

Theorem 26 .
Let a query Q and a database of size N and with a set Σ of functional dependencies.F-IVM can maintain Q with O(N ) preprocessing, O(1) enumeration delay, and O(N ) single-tuple updates in case the Σ-reduct of Q is free-connex acyclic.F-IVM can maintain Q with O(N ) preprocessing, O(1) enumeration delay, and O(1) single-tuple updates in case the Σ-reduct of Q is q-hierarchical.Example 27.Consider Σ = {B → C, C → D} and the free-connex acyclic but not hierarchical query Figure 14  depicts the hypergraphs of Q and Q ′ (left), a free-top variable order for Q that is also canonical for Q ′ (middle), and the view tree for Q modeled on this variable order (right).Since Q is free-connex acylic, we can compute the view tree in linear time and enumerate the result tuples of Q with constant delay, as explained in Section 7.1.We next describe how to achieve constant-time update by exploiting the fds.Figure15shows the delta view trees obtained from the view tree for Q for single-tuple updates to R, S, and T.Consider first the update δR[a, b] to relation R. The delta view δV @A R [b] is just a projection of the update tuple.The delta view δH @B RS [b, c] requires a lookup in S[B , C ] for B = b.In general, there may be many C-values paired with b.However, under the fd B → C, there is at most one C-value c paired with b.Hence,

Figure 16 :
Figure 16: (left) Hypergraph of the triangle query Q △ ; (right) View tree for the variable order A − B − C with an indicator projection ∃ A,B R.

Figure 17 :
Figure17: Adding indicator projections to a view tree τ .Each view in τ gets as new children the indicator projections of relations that do not occur in the subtree rooted at the view but form a cycle with those that occur.GYO * is based on the GYO reduction[10].

Definition 32 .
Fix a ring (D, +, * , 0, 1) and m ∈ N. Let C denote the set of triples (D, D m , D m×m ), 0 C = (0, 0 m×1 , 0 m×m ), and 1 C = (1, 0 m×1 , 0 m×m ), where 0 m×n is an m × n matrix with all zeros from D. For a = (c a , s a , Q a ) ∈ C and b = (c b , s b , Q b ) ∈ C, define the operations + C and * C over C as: and the other entries in s and Q are 0s.The payload V @C ST [a 2 ] encodes the result of SUM(1) group by C as s 3 = Q (3,3) = {c 2 → 2}, the result of SUM(D) group by C as Q (3,4) = {c 2 → d 2 + d 3 }, and the result of SUM(E) group by C as Q (3,5) = {c 2 → 2e 4 }.The remaining entries in the payload V @C

Figure 18 :
Figure 18: Computing the query from Example 40 over the database in Figure 6 and the relational ring, where ∀i ∈ [12] : p i = {() → 1}.The computation uses the view tree τ in Figure 3.The red views (rightmost column) have payloads storing the listing representation of the intermediate and final query results.The blue views (top two views in the middle column) encode a factorized representation of these results distributed over their payloads.The remaining (black) views remain the same for both representations.

Figure 21 :
Figure 21: Incremental maintenance of the covariance matrix over the Housing dataset (left), Retailer dataset (middle), and Favorita dataset (right) under updates of size 1, 000 to all relations with a one-hour timeout.The CONT plots consider all features as continuous, while the MIXED plots consider a mix of continuous and categorical features.

Figure 22 :
Figure22: Maintaining linear regression models over the Housing dataset (left), Retailer dataset (middle), and Favorita dataset (right) under updates of size 1, 000 to all relations using F-IVM.Batch gradient descent is invoked after every update using previously learned parameters (TRAIN CONT) and parameters set to 0 (TRAIN SCRATCH).The NO TRAINING plots show the time to compute the covariance matrices only.The bottom charts show the cumulative numbers of iterations used by the batch gradient descent during the training phase.

Figure 23 :Figure 24 :
Figure 23: Incremental maintenance of the covariance matrix on top of the triangle query on Twitter for updates of size 1, 000 to all input relations.

Figure 28 :
Figure 28: The average throughput (tuples/sec) of reevaluation and incremental maintenance of a sum aggregate under updates of size 1, 000 to all relations of the Retailer and Housing datasets with a one-hour timeout (denoted by the symbol * ).

Figure 29 :
Figure 29: Incremental maintenance and reevaluation of the product of three (n × n) matrices, A = A 1 A 2 A 3 : (left) one-row updates in A 2 ; (right) rank-r updates in A 2 for n = 4, 096 using the DBToaster and Octave runtime environments.

Figure 30 :
Figure 30: Incremental maintenance using relational and factorized payloads for the natural joins of the Retailer (left) and of the Housing (right) datasets under updates of size 1, 000 to the largest relation (Retailer) and all input relations (Housing).