Query Lifting

Language-integrated query based on comprehension syntax is a powerful technique for safe database programming, and provides a basis for advanced techniques such as query shredding or query flattening that allow efficient programming with complex nested collections. However, the foundations of these techniques are lacking: although SQL, the most widely-used database query language, supports heterogeneous queries that mix set and multiset semantics, these important capabilities are not supported by known correctness results or implementations that assume homogeneous collections. In this paper we study language-integrated query for a heterogeneous query language \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {NRC}_{\lambda }( Set,Bag )$$\end{document}NRCλ(Set,Bag) that combines set and multiset constructs. We show how to normalize and translate queries to SQL, and develop a novel approach to querying heterogeneous nested collections, based on the insight that “local” query subexpressions that calculate nested subcollections can be “lifted” to the top level analogously to lambda-lifting for local function definitions.


Introduction
Since the rise of relational databases as important software components in the 1980s, it has been widely appreciated that database programming is hard [13].Databases offer efficient access to flat tabular data using declarative SQL queries, a computational model very different from that of most general-purpose languages.To get the best performance from the database, programmers typically need to formulate important parts of their program's logic as queries, thus effectively programming in two languages: their usual general-purpose language (e.g.Java, Python, Scala) and SQL, with the latter query code typically constructed as unchecked, dynamic strings.Programming in two languages is more than twice as difficult as programming in one language [35].The result is a hybrid programming model where important parts of the program's functionality are not statically checked and may lead to run-time failures, or worse, vulnerabilities such as SQL injection attacks.This undesirable state of affairs was recognized by Copeland and Maier [13] who coined the term impedance mismatch for it.
Though higher-level wrapper libraries and tools such as object-relational mappings (ORM) can help ameliorate the impedance mismatch, they often come at a price of performance and lack of transparency, as high-level operations on inmemory objects representing database data are not always mapped efficiently to queries [44].An alternative approach, which has almost as long a history as the impedance mismatch problem itself, is to elevate queries in the host language from unchecked strings to a typed, domain-specific sublanguage, whose interactions with the rest of the program can be checked and which can be mapped to database queries safely while providing strong guarantees.This approach is nowadays typically called language-integrated query following Microsoft's successful LINQ extensions to .NET languages such as C# and F# [36,48].It is ultimately based on Trinder and Wadler's insight that database queries can be modeled by a form of monadic comprehension syntax [49].
Comprehension-based query languages were placed on strong foundations in the database community in the 1990s [3,4,40,54,33].A key insight due to Paredaens and van Gucht [40] is that although comprehension-based queries can manipulate nested collections, any expression whose input and output are flat collections (i.e.tables of records without other collections nested inside field values) can always be translated to an equivalent query only using flat relations (i.e. can be expressed in an SQL-like language).Wong [54] subsequently generalized this result and gave a constructive proof, in which the translation from nested to flat queries is accomplished through a strongly normalizing rewriting system.
Wong's work has informed a number of successful implementations, such as the influential Kleisli system [55] for biomedical data integration, and the Links programming language [12].Although the implementation of LINQ in C# and F# was not directly based on normalization, Cheney et al. [7] showed that normalization can be performed as a pre-processing step to improve both reliability and performance of queries, and guarantee that a well-formed query expression evaluates to (at most) one equivalent SQL expression at run time.
Comprehension-based language-integrated query also forms the basis for libraries such as Quill for Scala [41] and Database-Supported Haskell [21].Most recently, language-integrated query has been extended further to support efficient execution of queries that construct nested results [25,8,21,52], by translating such queries to a bounded number of flat queries.This technique, currently implemented in Links and DSH, has several benefits: for example to implement provenance-tracking efficiently in queries [17,46].Fowler et al. [19] showed that in some cases, Links's support for nested query results decreased both the number of queries issued and the total query evaluation time by an order of magnitude or more compared to a Java database application.Unfortunately, there is still a gap between the theory and practice of language-integrated query.Widely-used and practically important SQL features that mix set and multiset collections, such as duplicate elimination, are supported by some implementations, but without guarantees regarding correctness or reliability.So far, such results have only been proved for special cases [7,8], typically for homogeneous queries operating on one uniform collection type.For example, in Links, queries have multiset semantics and cannot use duplicate elimination or set-valued operations.To the best of our knowledge the questions of how to correctly translate flat or nested heterogeneous queries to SQL are open problems.
In this paper, we solve both open problems.We study a heterogeneous query language N RC λ (Set, Bag), which was introduced and studied in our recent work [42].We have previously extended the key results on query normalization to N RC λ (Set, Bag) [43], but unlike the homogeneous case, the resulting normal forms do not directly correspond to SQL.In this paper, we first show how flat N RC λ (Set, Bag) queries can be translated to SQL, and we then develop a new approach for evaluating queries over nested heterogeneous collections.
The key (and, to us at least, surprising) insight is to recognize that these two subproblems are really just different facets of one problem.That is, when translating flat N RC λ (Set, Bag) queries to SQL, the main obstacle is how to deal with query expressions that depend on local variables; when translating nested N RC λ (Set, Bag) queries to equivalent flat ones, the main obstacle is also how to deal with query expressions that depend on local variables.We solve this problem by observing that such query subexpressions can be lifted, analogously to lambda-lifting of local function definitions in functional programming [30], by abstracting over their free variables.Differently to lambda-lifting, however, we lift such expressions by converting them to tabular functions, or graphs, which can be calculated using database query constructs.
The remainder of this paper presents our contributions as follows: -In section 2 we review the most relevant prior work and present our approach at a high, and we hope accessible, level.-In sections 3 and 4 we present the core languages N RC λ (Set, Bag) and N RC G which will be used in the rest of the paper.-Section 5 presents our results on translation of flat N RC λ (Set, Bag) queries to SQL, via N RC G .-Section 6 presents our results on translation of N RC λ (Set, Bag) queries that construct nested results to a bounded number of flat N RC G queries.-Sections 7 and 8 discuss related work and conclude.

Overview
In this section we sketch our approach.We use Links syntax [12], which differs in superficial respects from the core calculus in the rest of the paper but is more readable.We rely without further comment on existing capabilities of languageintegrated query in Links, which are described elsewhere [11,34,8].Suppose, hypothetically, we are interested in certain presidential candidates and prescription drugs they may be taking 3 .In Links, an expression querying a small database of presidential candidates and their drug prescriptions can be written as follows: Some (totally fictitious and not legally actionable) example data is shown in Figure 1; note that the prescriptions In Links, query results from the database are mapped back to list values nondeterministically, and the result of the above query Q 0 will be a list containing two copies of the tuple (DJT, adderall) and one copy of each of the tuples (DJT, hydrochloroquine) and (JRB, caffeine).If we are just interested in which candidates take which drugs and not how many times each drug was taken, we want to remove these duplicates.This can be accomplished in a basic SQL query using the DISTINCT keyword after SELECT.Currently, in Links there is no way to generate queries involving DISTINCT, and this duplicate elimination can only be performed in-memory.While this is not hard to do when the duplicate elimination happens at the end of the query, it is not as clear how to handle deduplication operations correctly in arbitrary places inside queries.Furthermore, SQL has several other operations that can have either set or multiset semantics such as UNION and EXCEPT: how should they be handled?
To study this problem we introduced a core calculus N RC λ (Set, Bag) [42] (reviewed in the next section) in which there are two collection types, sets and multisets (or bags); duplicate elimination maps a multiset to a set with the same elements, and promotion maps a set to the least multiset with the same elements.
We considered, but were not previously able to solve, two problems in the context of N RC λ (Set, Bag) which are addressed in this paper.First, the fundamental results regarding normalization and translation to SQL have been studied only for homogeneous query languages with collections consisting of either sets, bags, or lists.We recently extended the normalization results to N RC λ (Set, Bag) [43], but the resulting normal forms do not correspond directly to SQL queries if operations such as deduplication, promotion, or bag difference are present.Second, query expressions that construct nested collections cannot be translated directly to SQL and can be very expensive to execute in-memory using nested loops, leading to the N + 1 query problem (or query avalanche problem [26]) in which one query is performed for the outer loop and then another N queries are performed, one per iteration of the inner loop.Some techniques have been developed for translating nested queries to a fixed number of flat queries, but to date they either handle only homogeneous set or bag collections [53,8], or lack detailed correctness proofs [26,51].
Regarding the first problem, the closest work in this respect is by Libkin and Wong [33], who studied and related the expressiveness of comprehensionbased homogeneous set and bag query languages but did not consider their heterogeneous combination or translation to SQL.The following query illustrates the fundamental obstacle: This query is similar to Q 0 , but eliminates duplicates among the drugs for each candidate.The query contains a duplicate elimination operation (dedup) applied to another query subexpression that refers to c, which is introduced in an earlier generator.This is not directly supported in classic SQL: by default the subqueries in FROM clauses cannot refer to tuple variables introduced by earlier parts of the FROM clause.In fact, this query is expressible in SQL:1999 using the LATERAL keyword, which does allow such sideways information-passing: (Without the LATERAL keyword, this query is not well-formed SQL.)However, such queries have only recently become widely supported, so are not available on legacy databases, and even when supported, are not typically optimized effectively; for example PostgreSQL will evaluate it as a nested loop, with quadratic complexity or worse.
Regarding the second problem, Van den Bussche [53] showed that any query returning nested set collections can be simulated by n flat queries, where n is the number of occurrences of the set collection type in the result.However, this translation has not been used as the basis for a practical system to our knowledge, and does not respect multiset semantics.Cheney et al. [8] provided an analogous shredding translation for nested multiset queries, but translated to a richer target language (including SQL:1999 features such as ROW NUMBER) and did not handle operations such as multiset difference or duplicate elimination.Thus, neither approach handles the full expressiveness of a heterogeneous query language over bags and sets.The following query illustrates the fundamental obstacle: Much like Q 1 , Q 2 builds a multiset of pairs (name, drugs) but here drugs is a set of all of the drugs taken by candidate name.Such a query is, of course, not even syntactically expressible in SQL because it returns a nested collection; it is not expressible in previous work on nested query evaluation either, because the result is a multiset of records, one component of which is a set.
We will now illustrate how to translate Q 1 to a plain SQL query (not using LATERAL), and how to translate Q 2 to two flat queries such that the nested result can be constructed easily from their flat results.First, note that we can rewrite both queries as follows, introducing an abbreviation F (x) for a query subexpression parameterized by x: Next, observe that the set of all possible values for x appearing in some call to F (x) is finite, and can even be computed by a query.Therefore, we can write a closed query Q F that builds a lookup table that calculates the graph of F (or at least, as much of it as is needed to evaluate the queries) as follows: Notice that the use of deduplication here is really essential to define Q F correctly: if we did not deduplicate then there would be repeated tuples in Q F , leading to incorrect results later.If we inline and simplify F (x) in the above query, we get the following: Q_F' = dedup(for (x <-Cand, y <-Pres, z <-Drug) where (x.cid == y.cid && y.did = z.did)[(in=x,out=z.drug)]) Finally we may replace the call to F (x) in Q 1 with a lookup to Q F , as follows: This expression may now be translated directly to SQL, because the argument to dedup is now closed:  Although this query looks a bit more complex than the one given earlier using LATERAL, it can be optimized more effectively, for example PostgreSQL generates a query plan that uses a hash join, giving quasi-linear complexity.
On the other hand, to deal with Q 2 , we refactor it into two closed, flat queries Q 21 , Q 22 and an expression Q 2 that builds the nested result from their flat results (illustrated in Figure 2): Notice that in Q 21 we replaced the call to F with the argument x, while Q 22 is just Q F again.The final expression Q 2 builds the nested result (in the host language's memory) by traversing Q 21 and computing the set value of each cs field by looking up the appropriate values from Q 22 .Thus, the original query result can be computed by first evaluating Q 21 and Q 22 on the database, and then evaluating the final stitching query expression in-memory.(In practice, as discussed in Cheney et al. [8], it is important for performance to use a more sophisticated stitching algorithm than the above naive nested loop, but in this paper we are primarily concerned with the correctness of the transformation.) The above examples are a bit simplistic, but illustrate the key idea of query lifting.In the rest of this paper we place this approach on a solid foundation, and (partially inspired by Gibbons et al. [20]), to help clarify the reasoning we extend the calculus with a type of tabulated functions or graphs − → σ {τ }, with graph abstraction introduction form G(−; −) and graph application M − → x .In our running example we could define ), and we would use the application operation M − → x to extract the set of elements corresponding to x in Q F .We will also consider tabular functions that return multisets rather than sets, in order to deal with queries that return nested multisets.

Background
We recap the main points from [42], which introduced a calculus N RC λ (Set, Bag) with the following syntax: where Θ is a sequence of generators x ← M .Similarly, multiset operations include empty bag , singleton M , bag union M N , bag difference M − N , conditional M where bag N , emptiness test empty bag (M ).The syntax is completed by duplicate elimination δM (converting a bag M into a set with the same object type) and promotion ιM (which produces the bag containing all the elements of the set M , with multiplicity 1).
The one-way conditional operations M where set N and M where bag N evaluate Boolean test N , and return collection M if N is true, otherwise the empty set/bag; two-way conditionals can supported without problems.Other set operations, such as intersection, membership, subset, and equality are also definable, as are bag operations such as intersection [4,33].Also, we may define empty bag (M ) as empty set (δ(M )) and M where set N as δ(ι(M ) where bag N ), but we prefer to include these constructs as primitives for symmetry.Generally, we will allow ourselves to write M where N and empty(M ) without subscripts if the collection kind of these operations is irrelevant or made clear by the context.We freely use syntax for unlabeled tuples − → M , M.i and tuple types − → σ and consider them to be syntactic sugar for labeled records.The typing rules for the calculus are standard and provided in an appendix.For the purposes of this discussion, we will highlight two features of the type system.The first is that the calculus used here differs from our previous work by using constants and table names, whose types are described by a fixed signature Σ: As usual, a typing judgment Γ M : σ states that a term M is well-typed of type σ, assuming that its free variables have the types declared in the typing context Γ = x 1 : σ 1 , . . ., x k : σ k .For the two rules above, note in particular that the primitive functions c can only take inputs of base type and produce results at base type, and table constants t are always multisets of records where the fields are of base type.We refer to a type of the form − − → : b as flat; if σ is flat, we refer to {σ} and σ as flat collection types.
The second is that our type system uses an approach à la Church, meaning that variable abstractions (in lambdas/comprehensions), empty sets and empty bags are annotated with their type in order to ensure the uniqueness of typing.
In the context of a larger language implementation, most of these type annotations can be elided and inferred by type inference.We have chosen to dispense with these details in the main body of this paper to avoid unnecessary syntactic cluttering.
We will use a largely standard denotational semantics for N RC λ (Set, Bag), in which sets and multisets are modeled as finitely-supported functions from their element types to Boolean values {0, 1} or natural numbers respectively.This approach follows the so-called K-relation semantics for queries [23,18] as used for example in the HoTTSQL formalization [10].The full typing rules and semantics are included in the appendix.
N RC λ (Set, Bag) subsumes previous systems including N RC [4,54], BQL [33] and N RC λ [11,8].In this paper, we restrict our attention to queries in which collection types taking part in δ, ι or bag difference contain only flat records.There are various reasons for excluding function types from these operators: for starters, any concrete implementation that used function types in these positions would need to decide the equality of functions; secondly, our rewrite system can ensure that a term whose type does not contain function types has a normal form without lambda abstractions and applications only if any δ, ι, or bag difference used in that term are applied to first-order collections.We thus want to exclude terms such as: x 1 2 |x ← ι({λyz.y}∪ {λyz.z}) which do not have an SQL representation despite having a flat collection type.
In order to obtain simpler normal forms, in which comprehensions only reference generators with a flat collection type, we also disallow nested collections within δ, ι, and bag difference.We believe this is without loss of generality because of Libkin and Wong's results showing that allowing such operations at nested types does not add expressiveness to BQL.
We have extended Wong's normalizing rewrite rule system, so as to simplify queries to a form that is close to SQL, with no intermediate nested collections.Since our calculus is more liberal than Wong's, allowing queries to be defined by mixing sets and bags and also using bag difference, we have added non-standard rules to take care of unwanted situations.In particular, we use the following constrained eta-expansions for comprehensions: The rationale of these rules is that in order to achieve, for comprehensions, a form that can be easily translated to an SQL select query, we need to move all the syntactic forms that are blocking to most normalization rules (i.e.promotion and bag difference) from the head of the comprehension to a generator.In order for this strategy to work out, we also need to know that the type of these subexpressions is flat, as we previously mentioned.
In Figure 3 we show the grammar for the normal forms for terms of nested relational types, i.e. types of the following form: For ease of presentation, the grammar actually describes a "standardized" version of the normal forms in which: -∅ is represented as the trivial union − → C where − → C is the empty sequence; has a similar representation using a trivial disjoint union; comprehensions without a guard are considered to be the same as those with a trivial true guard: {{M }|Θ} = {{M } where true | Θ} singletons that do not appear as the head of a comprehension are represented as trivial comprehensions: Each normal form M can be either a term of base type X, a tuple − −−− → = M , a set Q, or a bag R. The normal forms of sets and bags are rather similar, both being defined as unions of comprehensions with a singleton head.The generators for set comprehensions F include deduplicated tables and deduplicated bag differences; the generators for bag comprehensions G must be either tables, promoted set queries, or bag differences.
The non-terminals used as the arguments of emptiness tests, promotion, and bag difference have been marked with a star to emphasize the fact that they must have a flat collection type.The corresponding grammar can be obtained from the grammar for nested normal forms by replacing the rule for M with the following: Normalized queries can be translated to SQL as shown in Figure 4 as long as they have a flat collection type.The translation uses SELECT DISTINCT and UNION where a set semantics is needed, and SELECT, UNION ALL and EXCEPT ALL in the case of bag semantics.Note that promotion expressions ιQ * are translated simply by translating Q * , because in SQL there is no type distinction between set and multiset queries: all query results are multisets, and sets are considered to be multisets having no duplicates.
The other main complication in this translation is in handling generators containing references to other locally-bound variables.To deal with the resulting lateral variable references, we add the LATERAL keyword to such queries.As explained earlier, the use of LATERAL can be problematic and we will return to this issue in Section 5.
Remark 1 (Record flattening).The above translations handle queries that take flat tables as input and produce flat results (collections of flat records − − → : b ).It is straightforward to support queries that return nested records (i.e.records containing other records, but not collections).For example, a query can be handled by defining both directions of the obvious isomorphism , normalizing the flat query N • M , evaluating the corresponding SQL, and applying the inverse N −1 to the results.Such record flattening is described in detail by Cheney et al. [9] and is implemented in Links, so we will use it from now on without further discussion.

A relational calculus of tabular functions
We now introduce N RC G , an extension of the calculus N RC λ (Set, Bag) providing a new type of finite tabular function graphs (in the remainder of this paper, also called simply "graphs"; they are similar to the finite maps and tables of Gibbons et al. [20]).The syntax of N RC G is defined as follows: Semantically, the type of graphs − → σ τ will be interpreted as the set of finite functions from sequences of values of type − → σ to values in τ : such functions can return non-trivial values only for a finite subset of their input type.In our settings, we will require the output type of graphs to be a collection type (i.e.τ shall be either {τ } or τ for some τ ), and we will use ∅ or as the trivial value.The typing rules involving graphs are shown in Figure 5. Graphs are created using the graph abstraction operations G set (Θ; N ) and G bag (Θ; N ), where Θ is a sequence of generators in the form provided that each of the N i is in the corresponding element of the domain of the graph.The typing rule does not enforce this requirement and if any of the N i is not an element of L i , the graph application will evaluate to an empty set or bag (depending on τ ).Graphs can also be merged by union, using ∪ or depending on their output collection kind.Furthermore, graphs that return bags can be subtracted from one another using bag difference; the deduplication and promotion operations also extend to graphs in the obvious way.
Whenever M is well typed and its typing environment is made clear by the context, we will allow ourselves to write ty(M ) for the type of M .Furthermore, given a sequence of generators Θ = x 1 ← L 1 , . . .x n ← L n , such that for i = 1, . . ., n we have x 1 : σ 1 , . . ., x i−1 : σ i−1 L i : σ i , we will write ty(Θ) to denote the associated typing context: The semantics of N RC λ (Set, Bag) is extended to N RC G as follows: In this definition, graph abstractions are interpreted as collections of pairs of values ( − → u , v) where the − → u represent the input and v the corresponding output of the graph; consequently, the semantics of a graph G set ( − −−− → x ← L; M ) states that the multiplicity of ( − → u , v) is equal to the multiplicity of v in the semantics of M (where each x i is mapped to u i ) if each u i is in the semantics of L i , and zero otherwise.The semantics of bag graph abstractions is similar, with × substituted for ∧ to allow multiplicities greater than one in the graph output.
For graph applications M ( − → N ), the multiplicity of v is obtained as the multiplicity of ( − −− → N ρ, v) in the semantics of M .The semantics of set and bag union, bag difference, bag deduplication, and set promotion, as defined in N RC λ (Set, Bag), are extended to graphs and remain otherwise unchanged in N RC G .
In fact (as noted for example by Gibbons et al. [20]), the graph constructs of N RC G are just a notational convenience: we can translate N RC G back to N RC λ (Set, Bag) by translating types − → σ {τ } and − → σ τ to { − → σ , τ } and − → σ , τ respectively, and the term constructs are rewritten as follows:

Delateralization
As explained at the end of section 3, if a subexpression of the form ι(N ) or N 1 − N 2 contains free variables introduced by other generators in the query (i.e.not globally-scoped table variables), such queries cannot be translated directly to SQL, unless the SQL:1999 LATERAL keyword is used.
More precisely, we can give the following definition of lateral variable occurrence.

Definition 1. Given a query containing a comprehension
as a subterm, we say that x occurs laterally in Θ if, and only if, there is a binding y ← N in Θ such that x ∈ FV(N ).
Since LATERAL is not implemented on all databases, and is sometimes implemented inefficiently, we would still like to avoid it.In this section we show how lateral occurrences can be eliminated even in the presence of bag promotion and bag difference, by means of a process we call delateralization.
Using the N RC G constructs, we can delateralize simple cases of deduplication or multiset difference as follows: It is necessary to deduplicate N in the first two rules to ensure that the results correctly represent finite maps from the distinct elements of N to multisets of corresponding elements of P .(In any case, N needs to be deduplicated in order to be used as a set in G(x ← δN ; )).Given a query expression in normal form, the above rules together with standard equivalences (such as commutativity of independent generators) can be used to delateralize it: that is, remove all occurrences of free variables in subexpressions of the form ι(N ), Theorem 1.If M is a flat query in normal form, then there exists M equivalent to M with no lateral variable occurrences.
The proof of correctness of the basic delateralization rules and the above correctness theorem are in the appendix.
To illustrate some subtleties of the translation, here is a trickier example: where Q, P both depend on x.We proceed from the outside in, first delateralizing the difference: Note that this still contains a lateral subquery, namely ι(P ) depends on x.After translating back to N RC λ (Set, Bag), and delateralizing ι(P ), the query normalizes to: 6 Query lifting and shredding In the previous sections, we have discussed how to translate queries with flat collection input and output to SQL.The shredding technique, introduced in [8], can be used to convert queries with nested output (but flat input) to multiple flat queries that can be independently evaluated on an SQL database, then stitched together to obtain the required nested result.This section provides an improved version of shredding, extended to a more liberal setting mixing sets and bags and allowing bag difference operations, and described using the graph operations we have introduced, allowing an easier understanding of the shredding process.We introduce, in Figure 6, a shredding judgment to denote the process by which, given a normalized N RC λ (Set, Bag) query, each of its subqueries having a nested collection type is lifted (in a manner analogous to lambda-lifting [30]) to an independent graph query: more specifically, shredding will produce a shredding environment (denoted by Φ, Ψ, . ..), which is a finite map associating special graph variables ϕ, ψ to N RC G terms: The shredding judgment has the following form: where the ⇒ symbol separates the input (to the left) from the output (to the right).The normalized N RC λ (Set, Bag) term M is the query that is being considered for shredding; M may contain free variables declared in Θ, which must be a sequence of N RC λ (Set, Bag) set comprehension bindings.Θ is initially empty, Fig. 6.Shredding rules.
but during shredding it is extended with parts of the input that have already been processed.Similarly, the input shredding environment Φ is initially empty, but will grow during shredding to collect shredded queries that have already been generated.It is crucial, for our algorithm to work, that M be in the form previously described in Figure 3, as this allows us to make assumptions on its shape: in describing the judgment rules, we will use the same metavariables as are used in that grammar.The output of shredding consists of a shredded term M and an output shredding environment Ψ .Ψ extends Φ with the new queries obtained by shredding M ; M is an output N RC G query obtained from M by lifting its collection typed subqueries to independent queries defined in Ψ .
The rules for the shredding judgment operate as follows: the first rule expresses the fact that a normalized base term X does not contain subexpressions with nested collection type, therefore it can be shredded to itself, leaving the shredding environment Φ unchanged; in the case of tuples, we perform shredding pointwise on each field, connecting the input and output shredding environments in a pipeline, and finally combining together the shredded subterms in the obvious way.
The shredding of collection terms (i.e.unions and comprehensions) is performed by means of query lifting: we turn the collection into a globally defined (graph) query, which will be associated to a fresh name ϕ and instantiated to the local comprehension context by graph application.This operation is reminiscent Fig. 7. Typing rules for shredding environments.
of the lambda lifting and closure conversion techniques used in the implementation of functional languages to convert local function definitions into global ones.Thus, when shredding a collection, besides processing its subterms recursively, we will need to extend the output shredding environment with a definition for the new global graph ϕ.In the interesting case of comprehensions, ϕ is defined by graph-abstracting over the comprehension context Θ; notice that, since we are only shredding normalized terms, we know that they have a certain shape and, in particular, the judgment for bag comprehensions must ensure that generators − → G be converted into sets.The shredding of set and bag unions is performed by recursion on the subterms, using the same plumbing technique we employed for tuples; additionally, we optimize the output shredding environment by removing the graph queries − → ψ resulting from recursion, since they are absorbed into the new graph ϕ.
Notice that since the comprehension generators of our normalized queries must have a flat collection type, they do not need to be processed recursively.Furthermore, since our normal forms ensure that promotion and bag difference terms can only appear as comprehension generators, we do not need to provide rules for these cases.
The shredding environments used by the shredding judgment must be well typed, in the sense described by the rules of Figure 7: the judgment Φ : Γ means that the graph variables of Φ are mapped to terms whose type is described by Γ .Whenever we add a mapping [ϕ → M ] to Φ, we must make sure that M is well typed (of graph type) in the typing environment Γ associated to Φ.
If Φ : Γ , we will write ty(Φ) to refer to the typing environment Γ associated to Φ.The following result states that shredding preserves well-typedness: Theorem 2. Let Θ be well-typed and ty(Θ) M : σ.If Θ M ⇒ M | Φ, then: -Φ is well-typed ty(Φ), ty(Θ) M : σ We now intend to prove the correctness of shredding: first, we state a lemma which we can use to simplify certain expressions involving the semantics of graph application: Definition 2. Let Θ be a closed, well-typed sequence of generators.A substitution ρ is a model of Θ (notation: ρ Θ) if, and only if, for all x ∈ dom(Θ), we have Θ(x)) ρ(x) > 0.
To state the correctness of shredding, we need the following notion of shredding environment substitution.Definition 3.For every well-typed shredding environment Φ, the substitution of Φ into an N RC G term M (notation: M Φ) is defined as the operation replacing within M every free variable ϕ ∈ dom(Φ) with (Φ(ϕ))Φ (i.e.: the value assigned by Φ to ϕ, after recursively substituting Φ).
We can easily show that the above definition is well posed for well-typed Φ.
We now show that shredding preserves the semantics of the input term, in the sense that the term obtained by substituting the output shredding environment into the output term is equivalent to the input.
Theorem 3 (Correctness of shredding).Let Θ be well-typed and ty(Θ) Proof.By induction on the shredding judgment.We comment two representative cases: in the set comprehension case, we want to prove where ρ Θ.We rewrite the lhs as follows: where . ., x i ← F i for all i = 1, . . ., n, and u i s.t.F i ρ i−1 u i .By the definition of substitution and by Lemma 3, we rewrite the rhs: We can prove that for all − → u such that ρ n Θ, ..,n = 0. Therefore, we only need to consider those − → u such that ρ n Θ, − −−− → x ← F .Then, to prove the thesis, we only need to show: in the set union case, we want to prove where ρ Θ.We rewrite the lhs as follows: By the definition of substitution and by Lemma 3, we rewrite the rhs: By induction hypothesis and unfolding of definitions, we know for all i: which proves the thesis.
6.1 Reflecting shredded queries into N RC λ (Set, Bag ) The output of the shredding judgment is a stratified version of the input term, where each element of the output shredding environment provides a layer of collection nesting; furthermore, the output is ordered so that each element of the shredding environment only references graph variables defined to its left, which is convenient for evaluation.Our goal is to evaluate each shredded item as an independent query: however, these items are not immediately convertible to flat queries, partly because their type is still nested, and also due to the presence of graph operations introduced during shredding.We thus need to provide a translation operation capable of converting the output of shredding into independent flat terms of N RC λ (Set, Bag).This translation uses two main ingredients: an index function to convert graph variable references to a flat type I of indices, such that φ, − → x are recoverable from index (φ, − → x ); a technique to express graphs as standard N RC λ (Set, Bag) relations.
The resulting translation, denoted by • , is shown in in Figure 8.Let us remark that the translation need be defined only for term forms that can be produced as the output of shredding: this allows us, for instance, not to consider terms such as ιM or M − N , which can only appear as part of flat generators of comprehensions or graphs.
We discuss briefly the interesting cases of the definition of the flattening translation.Base expressions X are expressible in N RC λ (Set, Bag), therefore they can be mapped to themselves (this is also true for empty(M ), since normalization ensures that the type of M be a flat collection).Graph applications ϕ ( − → x ), as we said, are translated with the help of an index abstract operation: this is where the primary purpose of the translation is accomplished, by flattening a collection type to the flat type I, making it possible for a shredded query to be converted to SQL; although we do not specify the concrete implementation of index , it is worth noting that it must store the arguments of the graph application along with the (quoted) name of the graph variable ϕ.Tuples, unions, and comprehensions only require a recursive translation of their subterms: however the generators of comprehensions must have a flat collection type, so no recursion is needed there.Finally, we translate graphs as collections of the pairs obtained by associating elements of the domain of the graph to the corresponding output; it is simple to come up with a comprehension term building such a collection: set-valued graphs are translated using set comprehension, while bag-valued ones use bag comprehension (this also means that in the latter case the generators for the domain of the graph, which are set-typed, must be wrapped in a ι).
We can prove that the flattening embedding produces flat-typed terms, as expected.
Definition 4. A well-typed set comprehension generator Θ is flat-typed if, and only if, for all x ∈ dom(Θ), there exists a flat type σ such that ty(Θ(x)) = {σ}.
A well-typed shredding environment Φ is flat-typed if, and only if, for all ϕ ∈ dom(Φ), we have that ty( Φ(ϕ) ) is a flat collection type.It is important to note that the composition of shredding and • does not produce normalized N RC λ (Set, Bag) terms: when we shred a comprehension, we add to the output shredding environment a graph returning a comprehension, and when we translate this to N RC λ (Set, Bag) we get two nested comprehensions: 9.The stitching function.
In fact, not only is this term not in normal form, but it may even contain, within Q * , a lateral reference to x; thus, after a flattening translation, we will always require the resulting queries to be renormalized and, if needed, delateralized.
Let norm denote N RC λ (Set, Bag) normalization, and S denote the evaluation of relational normal forms: we define the shredded value set Ξ corresponding to a shredding environment Φ as follows: The evaluation S is ordinarily performed by a DBMS after converting the N RC λ (Set, Bag) query to SQL, as described in Section 5.The result of this evaluation is reflected in a programming language such as Links as a list of records.

The stitching function
Given a N RC λ (Set, Bag) term with nested collections, we have first shredded it, obtaining a shredded N RC G term M and a shredding environment Φ containing N RC G graphs; then we have used a flattening embedding to reflect both M and Φ back into the flat fragment of N RC λ (Set, Bag); next we used normalization and DBMS evaluation to convert the shredding environment into a shredded value set Ξ.As the last step to evaluate M : τ , we need to combine M and Ξ together to reconstruct the correct nested value M : τ Ξ by stitching together partial flat values.The stitching function is shown in Figure 9: its job is to visit all the components of tuples and collections, ignoring atomic values other than indices along the way.The real work is performed when an index (ϕ, − → V ) is found: conceptually, the index should be replaced by the result of the evaluation of ϕ ( − → V ).Remember that Ξ contains the result of the evaluation of the graph function ϕ after translation to N RC λ (Set, Bag), i.e. a collection of pairs associating each input of ϕ to the corresponding output: then, to obtain the desired result, we can take Ξ(ϕ), filter all the pairs p whose first component is − → V , and return the second component of p after a recursive stitching.Finally, observe that we track the result type argument in order to disambiguate whether to construct a set or multiset when we encounter an index.
The full correctness result follows by combining the Theorems 3 and 4.
Corollary 1.For all M such that M : τ , suppose M ⇒ M | Ψ , and let Ξ be the shredded value set obtained by evaluating the flattened queries in Ψ .Then M = M : τ Ξ .

Related work
Work on language-integrated query and comprehension syntax has taken place over several decades in both the database and programming language communities.We discuss the most closely related work below.

Comprehensions, normalization and language integration
The database community had already begun in the late 1980s to explore proposals for so-called nonfirst-normal-form relations in which collections could be nested inside other collections [45], but following Trinder and Wadler's initial work connecting database queries with monadic comprehensions [49], query languages based on these foundations were studied extensively, particularly by Buneman et al. [4,3].For our purposes, Wong's work on query normalization and translation to SQL [54] is the most important landmark; this work provided the basis for practical implementations such as Kleisli and later Links.Almost as important is the later work by Libkin and Wong [33], studying the questions of expressiveness of bag query languages via a language BQL that extended basic N RC with deduplication and bag difference operators.They related this language to N RC with set semantics extended with aggregation (count/sum) operations, but did not directly address the question of normalizing and translating BQL queries to SQL.Grust and Scholl [28] were early advocates of the use of comprehensions mixing set, bag and other monadic collections for query rewriting and optimization, but did not study normalization or translatability properties.Although comprehension-based queries began to be used in general-purpose programming languages with the advent of Microsoft LINQ [36] and Links [12], Cooper [11] made the next important foundational contribution by extending Wong's normalization result to queries containing higher-order functions and showing that an effect system could be used to safely compose queries using higher-order functions even in an ambient language with side-effects and recursive functions that cannot be used in queries.This work provided the basis for subsequent development of language-integrated query in Links [34] and was later adapted for use in F# [7], Scala [41], and by Kiselyov et al. [47] in the OCaml library QueΛ.However, on revisiting Cooper's proof to extend it to heterogeneous queries, we found a subtle gap in the proof, which was corrected in a recent paper [43]; the original result was correct.As a result, in this paper we focus on first-order fragments of these languages without loss of generality.
Giorgidze et al. [22] have shown how to support non-recursive datatypes (i.e.sums) and Grust and Ulrich [29] built on this to show how to support function types in query results using defunctionalization [29].We considered using sums to support a defunctionalization-style strategy for query lifting, but Giorgidze et al. [22] map sum types to nested collections, which makes their approach unsuitable to our setting.Wong's original normalization result also considered sum types, but to the best of our knowledge normalization for N RC λ (Set, Bag) extended with sum types has not yet been proved.
Recent work by Suzuki et al. [47] have outlined further extensions to language-integrated query in the QueΛ system, which is based on finally-tagless syntax [6] and employs Wong's and Cooper's rewrite rules; Katsushima and Kiselyov's subsequent short paper [31] outlined extensions to handling ordering and grouping.Kiselyov and Katsushima [32] present an extension to QueΛ called Squr to handle ordering based on effect typing, and they provide an elegant translation from Squr queries to SQL based on normalization-by-evaluation.Okura and Kameyama [39] outline an extension to handle SQL-style grouping and aggregation operators in QueΛ G ; however, their approach potentially generates lateral variable occurrences inside grouping queries.These systems QueΛ, Squr and QueΛ G consider neither heterogeneity nor nested results.
Our adoption of tabulated functions (graphs) is inspired in part by Gibbons et al. [20], who provided an elegant rational reconstruction of relational algebra showing how standard principles for reasoning about queries arise from adjunctions.They employed types for (finite) maps and tables to show how joins can be implemented efficiently, and observed that such structures form a graded monad.We are interested in further exploring these structures and extending our work to cover ordering, grouping and aggregation.

Query decorrelation and delateralization
There is a large literature on query decorrelation, for example to remove aggregation operations from SELECT or WHERE clauses (see e.g.[38,5] for further discussion).Delateralization appears related to decorrelation, but we are aware of only a few works on this problem, perhaps because most DBMSs only started to support LATERAL in the last few years.(Microsoft SQL Server has supported similar functionality for much longer through a keyword APPLY.)Our delateralization technique appears most closely related to Neumann and Kemper's work on query unnesting [38].In this context, unnesting refers to removal of "dependent join" expressions in a relational algebraic query language; such joins appear to correspond to lateral subqueries.This approach is implemented in the HyPER database system, but is not accompanied by a proof of correctness, nor does it handle nested query results.It would be interesting to formalize this approach (or others from the decorrelation literature) and relate it to delateralization.
Querying nested collections Our approach to querying nested heterogeneous collections clearly specializes to the homogeneous cases for sets and multisets respectively, which have been studied separately.Van den Bussche's work on simulating queries on nested sets using flat ones [53] has also inspired subsequent work on query shredding, flattening and (in this paper) lifting, though the simulation technique itself does not appear practical (as discussed in the extended version of Cheney et al. [9]).More recently, Benedikt and Pradic [1] presented results on representing queries on nested collections using a bounded number of interpretations (first-order logic formulas corresponding to definable flat query expressions) in the context of their work on synthesizing N RC queries from proofs.This approach considers set-valued N RC only, and its relationship to our approach should be investigated further.
Cheney et al.'s previous work on query shredding for multiset queries [8] is different in several important respects.In that work we did not consider deduplication and bag difference operations from BQL, which Libkin and Wong showed cannot be expressed in terms of other N RC operations.The shredding translation was given in several stages, and while each stage is individually comprehensible, the overall approach is not easy to understand.Finally, the last stages of the translation relied on SQL features not present (or expressible) in the source language, such as ordering and the SQL:1999 ROW NUMBER construct, to synthesize uniform integer keys.Our approach, in contrast, handles set, bag, and mixed queries, and does not rely on any SQL:1999 features.
In a parallel line of work, Grust et al. [26,21,50,52,51] have developed a number of approaches to querying nested list data structures, first in the context of XML processing [24] and subsequently for N RC-like languages over lists.The earlier approach [26], named loop-lifting (not to be confused with query lifting!) made heavy use of SQL:1999 capabilities for numbering and indexing to decouple nested collections from their context, and was implemented in both Links [50] and earlier versions of the Database Supported Haskell library [21], both of which relied on an advanced query optimizer called Pathfinder [27] to optimize these queries.The more recent approach, implemented by Ulrich in the current version of DSH and described in detail in his thesis [51], is called query flattening and is instead based on techniques from nested data parallelism [2].Both loop-lifting and query flattening are very powerful, and do not rely on an initial normalization stage, while supporting a rich source language with list semantics, ordering, grouping, aggregation, and deduplication which can in principle emulate set or multiset semantics.However, to the best of our knowledge no correctness proofs exist for either technique.We view finding correctness results for richer query languages as an important challenge for future work.
Another parallel line of work started by Fegaras and Maier [15,14] considers heterogeneous query languages based on monoid comprehensions, with set, list, and bag collections as well as grouping, aggregation and ordering operations, in the setting of object-oriented databases, and forms the basis for complex object database systems such as λDB [16] and Apache MRQL [14].However, Wongstyle normalization results or translations from flat or nested queries to SQL are not known for these calculi.
Lambda-lifting and closure conversion Since Johnsson's original work [30], lambda-lifting and closure conversion have been studied extensively for func-tional languages, with Minamide et al.'s typed closure conversion [37] of particular interest in compilers employing typed intermediate languages.We plan to study whether known optimizations in the lambda-lifting and closure conversion literature offer advantages for query lifting.The immediate important next step is to implement our approach and compare it empirically with previous techniques such as query shredding and query flattening.By analogy with lambda-lifting and closure conversion, we expect additional optimizations to be possible by a deeper analysis of how variables/fields are used in lifted subqueries.Another problem we have not resolved is how to deal with deduplication or bag difference at nested collection types in practice.Libkin and Wong [33] showed that such nesting can be eliminated from BQL queries, but their results do not provide a constructive algorithm for eliminating the nesting.

Conclusions
Monadic comprehensions have proved to be a remarkably durable foundation for database programming and language-integrated query, and has led to language support (LINQ for .NET, Quill for Scala) with widespread adoption.Recent work has demonstrated that techniques for evaluating queries over nested collections, such as query shredding or query flattening, can offer order-of-magnitude speedups in database applications [19] without sacrificing declarativity or readability.However, query shredding lacks the ability to express common operations such as deduplication, while query flattening is more expressive but lacks a detailed proof of correctness, and both techniques are challenging to understand, implement, or extend.We provide the first provably correct approach to querying nested heterogeneous collections involving both sets and multisets.
Our most important insight is that working in a heterogeneous language, with both set and multiset collection types, actually makes the problem easier, by making it possible to calculate finite maps representing the behavior of nested query subexpressions under all of the possible environments encountered at run time.Thus, instead of having to maintain or synthesize keys linking inner and outer collections, as is done in all previous approaches, we can instead use the values of variables in the closures of nested query expressions themselves as the keys.The same approach can be used to eliminate sideways informationpassing.This is analogous to lambda-lifting or closure conversion in compilation of functional languages, but differs in that we lift local queries to (queries that compute) finite maps rather than ordinary function abstractions.We believe this idea may have broader applications and will next investigate its behavior in practice and applications to other query language features.∅ ρ = λu.0Basic types and records are represented by the usual interpretations of such types, and the details are elided.For set types, the interpretation of a set {A} is A → fs {0, 1}.Here → fs is the set of finitely-supported functions from A , here the support is the set of elements mapped to a nonzero value.We consider {0, 1} equipped with the usual structure of a Boolean algebra, with operations ∧, ∨, ¬, and we consider equality and other meta-level predicates as functions returning Boolean values.Likewise, we consider bag types A to be interpreted as finitelysupported functions A → fs N, where N is the set of natural numbers, equipped with the usual arithmetic operations +, − n = max(m − n, 0).Finally to be explicit about the situations where we coerce a Boolean value to a natural number or vice versa we introduce notation χ : {0, 1} → N for the "characteristic function" and ζ : N → {0, 1} for the "nonzero test" function x → (x > 0) that maps 0 to 0 and any nonzero value to 1.Note that ζ(χ(n)) = n.
Since we work with finitely-supported functions f, p, we write u f (u) (resp. u p(u) for the summation (resp.disjunction) over all possible u of f (u) (resp.p(u)).Although this summation or disjunction is infinite, the number of values of u for which f /p can be nonzero is finite, so this is a finite sum or disjunction and thus well-defined.Finally, although N RC λ (Set, Bag) also includes function types, lambda abstraction, and application, but not recursion, their addition poses no difficulty and since these features can be normalized away prior to applying the results in this paper, we do not explicitly discuss them in the semantics.
B Proofs for Section 5 Recall (for example from Buneman et al. [4]) that set membership M ∈ N is definable as ¬empty set ({x | x ← N, x = M }) It is straightforward to show that M ∈ N ρv = N ρ( M ρ), that is, the result is true iff the interpretation of N returns true on the interpretation of M .We will use this as a primitive in the following proofs.First we observe that when x was introduced by a generator Proof.If the metric is zero, then the query is fully delateralized.Combining the basic delateralization steps above with commutativity, any expression with nonzero metric can be rewritten so as to decrease the metric (though possibly increasing the query size).We can also undo the effects of commutativity steps to restore the original order of generators, to preserve the query structure as much as possible for readability.Theorem 8. Given M with M > 0, there exists an equivalent M with M < M that can be obtained by applying commutativity and basic rewrites.Hence, there exists an equivalent fully-delateralized M with M = 0.

G(
Proof.The proof requires establishing that whenever M > 0, there exists at least one outermost subexpression M 0 of the form ι(N ) or N −P with M 0 > 0. That is, M 0 should not be a subexpression of any larger such subexpression of M having the same property.Moreover, M 0 must occur as a generator.We need to show that M 0 therefore contains at least one free record variable bound earlier in the same comprehension.We can show this by inspection of normal forms.Since this is the case, then (if the generator is not already adjacent) we can commute it to be adjacent to M 0 and then apply one of the delateralization rules, decreasing M 0 and hence M .Furthermore, let Ξ and Ξ be the shredding value sets corresponding to Φ and Φ : then M Ξ = M Ξ .

C Proofs for Section 6
In the following proof, whenever Θ = − −−− → x ← F , we use the abbreviation: Proof.We proceed by induction on the shredding judgment.We comment the two key cases: -For set comprehension: Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material.If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
SELECT c.name,d.drug FROM Cand c, LATERAL (SELECT DISTINCT d.drug FROM Pres p, Drug d WHERE p.cid = c.cidAND p.did = d.did)d

Fig. 4 .
Fig. 4. Translation to SQL used to construct a (finite) tabular function mapping each sequence of values R 1 , . . ., R n in the sets M 1 , . . ., M n to the set N − → R / − → x .If each M i has type {σ i } and N has type {τ }, then the graph has type − → σ {τ }.Similarly, if N has type τ , G bag ( −−−−→ x ← M ; N ) has type − → σ τ .The terms M 1 , . . ., M n constitute the (finite) domain of this graph.When the kind of graph application (set-based or bag-based) is clear from the context or unimportant, we will allow ourselves to write G(−; −) instead of G set (−; −) or G bag (−; −).A graph G of type − → σ τ can be applied to a sequence of terms N 1 , . . ., N n of type σ 1 , . . ., σ n to obtain a term of type τ .If G = G( − −−− → x ← L; M ), then we will want the semantics of G( − −−− → x ← L; M ) ( − → N ) to be the same as that of M

Fig. 12 .
Fig. 12. Semantics of set and multiset operations of N RC λ (Set, Bag)
singleton construction {M }, union M ∪ N , one-armed conditional M where set N , emptiness test empty set (M ), and comprehension {M | Θ},