Bounded Evaluation: Querying Big Data with Bounded Resources

This work aims to reduce queries on big data to computations on small data, and hence make querying big data possible under bounded resources. A query Q is boundedly evaluable when posed on any big dataset D\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\cal D}$$\end{document}, there exists a fraction DQ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\cal D}_Q}$$\end{document} of D\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\cal D}$$\end{document} such that Q(D)=Q(DQ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q({\cal D}) = Q({{\cal D}_Q})$$\end{document}, and the cost of identifying DQ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\cal D}_Q}$$\end{document} is independent of the size of D\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\cal D}$$\end{document}. It has been shown that with an auxiliary structure known as access schema, many queries in relational algebra (RA) are boundedly evaluable under the set semantics of RA. This paper extends the theory of bounded evaluation to RAaggr, i.e., RA extended with aggregation, under the bag semantics. (1) We extend access schema to bag access schema, to help us identify DQ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\cal D}_Q}$$\end{document} for RAaggr queries Q. (2) While it is undecidable to determine whether an RAaggr query is boundedly evaluable under a bag access schema, we identify special cases that are decidable and practical. (3) In addition, we develop an effective syntax for bounded RAaggr queries, i.e., a core subclass of boundedly evaluable RAaggr queries without sacrificing their expressive power. (4) Based on the effective syntax, we provide efficient algorithms to check the bounded evaluability of RAaggr queries and to generate query plans for bounded RAaggr queries. (5) As proof of concept, we extend PostgreSQL to support bounded evaluation. We experimentally verify that the extended system improves performance by orders of magnitude.


Q(D) D
Querying big data can be prohibitively costly.As an indicator, it is NP-hard 1 to decide whether a tuple is in the answer in a dataset to an SPC (select, project, Cartesian product) query Q, and it is PSPACEhard 1 when Q is a query in relational algebra (denoted by RA) [2] .It takes days to join two tables with 10 million tuples each [3] .One might be tempted to think that parallel computation could do the job.However, there exist computational problems for which parallel scalability is beyond reach, i.e., no matter how many machines are used, the parallel runtime of algorithms for such problems may not be reduced [4] .Worse still, small businesses typically have constrained resources and may not afford large-scale parallel computation.
Is querying big data beyond the reach of small companies, or is it just a privilege of big companies?Is it possible to extend DBMS with an immediate capacity to answer common queries over big datasets under constrained resources?
One approach to tackling the challenge has recently been studied, based on bounded evaluation [5,6] .To answer a query Q on a dataset , the idea is to look at only a "bounded" fraction of that suffices to compute , instead of at the entire .This is doable by using an access schema , which is a combination of cardinality constraints and associated indices.Under , Q is boundedly evaluable if for all datasets that conform to , one can identify by reasoning about the cardinality constraints, and fetch by using the indices of , such that (a) and (b) is determined by and Q only.In other words, if Q is boundedly evaluable under , query Q can then be exactly answered via bounded evaluation, by accessing only D Q of size bounded by the cardinalities in .
The theory has been tested in industry and is found to "improve the performance by orders of magnitude" [7] .

D1
Example 1.Consider query Q 1 from Facebook Graph Search [8] : Find all my friends who have check-ins in UK 2 .The query is posed on dataset with two relations: (a) friend (uid, fid), stating that person fid is a friend of uid, and (b) checkin (uid, loc, cty, date), stating that person uid checked in at location loc in country cty on date.Written as an RA query, Q 1 is as follows (u 0 denotes "me"): Q1(x) = friend(u0, x) 1 checkin(x, loc, "UK", date).

Q1(D1)
Here dataset is big, with trillions of friend links and check-ins [9] .It is costly to compute directly.
Here constraint specifies a Facebook policy [10] : a limit of 5 000 friends per user; and states that each user can check-in at most 193 countries.Indices can be built on based on such that given a person, it returns the ids of all her friends by accessing at most 5 000 friend tuples; similarly for .Taken together, these constraints and their associated indices are called access constraints [5] .

Using
, we can compute by accessing at most 970 000 tuples from , instead of trillions.(1) We fetch T 1 of at most 5000 fid′s of friend tuples with uid = u 0 , by using .(2) For each fid f in T 1 , we fetch T 2 of at most 193 country values with .(3) We return the set of fid′s in T 1 with country = UK in T 2 .The plan fetches at most 5 000 + 193 × 5 000 tuples to compute , no matter how big is.Hence, Q 1 is boundedly evaluable under .

DQ
As shown in Example 1, bounded evaluation answers a query Q over a big dataset by accessing a set of data values with bounded size.It does this by retrieving values (i.e., partial tuples) using indices associated with cardinality constraints that correlate attributes.One might think that this can also be carried out by conventional index-only plans for query optimization [11] .However, the two are different problems as indicated by their complexity bounds: deciding whether an SPC query can be answered with a "bounded" query plan is EX-PSPACE-hard [5] , while it is in PTIME to decide whether it has an index-only plan [11] .
While bounded evaluation is promising, more work has to be done, from theory to systems.Bounded evaluation has only been studied for RA queries under the set semantics [5,6] .In the real world, queries are often expressed in RA aggr , i.e., RA extended with aggregation under the bag semantics.RA aggr can express all SQL (structured query language) queries that do not carry arithmetic expressions.This makes bounded evaluation more intriguing.

A1
Here gpBy(Q 3 , uid, count(cty)) groups the results of Q 3 by attribute uid and calculates count(cty) for each group (see Section 2.1 for more details about gpBy operator).In contrast to Q 1 , does not help us answer Q 2 .
Using φ 2 , we can fetch a set of distinct countries for each friend x.However, x may have multiple UK check-ins.Access schema no longer suffices for RA aggr under the bag semantics.
For practical use to emerge from the study, it is necessary to extend bounded evaluation from RA to RA aggr (SQL).This gives rise to several questions.How should we extend the access schema of [5, 6] to support the bag semantics?We will see that the problem for checking whether an SQL query is boundedly evaluable is undecidable.Given the negative result, is bounded evaluation beyond reach in practice?More specifically, is there any practical and decidable special case?Is it possible to develop a systematic method that allows us to efficiently check the bounded evaluability of SQL queries?In addition, after determining that an SQL query is boundedly evaluable, how can we generate and optimize a query plan to carry out its bounded evaluation?
Contributions.This paper answers these questions by extending the study to RA aggr , from theory to practice.
(1) Bounded evaluation for SQL.We extend bounded evaluation from RA to RA aggr , i.e., SQL (without arithmetic) to support arbitrarily nested aggregate sub-queries and group-by clauses.We introduce bag access schemas, an extension of the access schema of [5, 6] to support the bag semantics.We also formulate bounded query plans for RA aggr .
(2) Complexity of bounded evaluation.Not surprisingly, bounded evaluability is undecidable for SQL since it is already undecidable for RA [5] .We identify practical conditions that cover a number of real-life queries, for which the bounded evaluability can be efficiently determined.These conditions tell us what makes queries bounded.
(3) Effective syntax.To accommodate the undecidability, we develop an effective syntax for boundedly evaluable RA aggr queries.We show that under a bag access schema , (a) an RA aggr query Q is boundedly evaluable if and only if it is equivalent to a query ; and (b) it is in PTIME (polynomial time) to check whether Q is in .That is, is a core subclass of bounded evaluable RA aggr queries that are syntactically checkable without sacrificing the expressive power.This is along the same lines as how commercial database systems (DBMS) deal with safe relational calculus queries, which are undecidable to decide [12][13][14] .
(4) Extending DBMS with bounded evaluation.We present a framework, referred to as BEAS (bounded evaluable SQL) to provide commercial DBMS with the capability of bounded evaluation of RA aggr queries.Given an RA aggr query Q and a bag access schema , BEAS first checks whether Q is boundedly evaluable under .If so, it generates a query plan for Q to compute by accessing a bounded small fraction of using .Otherwise, it leverages access schema and generates a partially bounded plan, to bound sub-queries of Q.We develop al- (3) otherwise, it computes approximate answers to Q in also by accessing a bounded amount of data, and provides deterministic accuracy ratios [17] .
In the entire process, it only accesses a bounded fraction of data and can be conducted under bounded resources.Hence it is feasible to provide small businesses with a capacity of querying big data despite constrained resources.
To simplify the discussion, we focus on row-oriented DBMS (a.k.a.row stores) in this paper.Nonetheless, as will be seen in Section 2, the model of bounded evaluation subsumes column stores.Moreover, bounded evaluation can be readily extended to parallel and distributed systems [18] .

LB
Organization.The remainder of the paper is organized as follows.Section 2 defines bag access constraints and formulates boundedly evaluable RA aggr queries.Section 3 studies the complexity of bounded evaluation for RA aggr queries.Section 4 proposes effective syntax for boundedly evaluable RA aggr queries.Section 5 introduces BEAS and develops its underlying algorithms.The experimental study is presented in Section 6.We discuss related work in Section 7, and identifies topics for future work in Section 8.

Bounded evaluation of SQL queries
We first define bag access schema (Section 2.1) and then formulate bounded evaluation of RA aggr queries, aggregate or not, under the bag semantics (Section 2.2).

RA aggr and bag access schema
We start with a review of RA aggr , an extension of RA with a group-by construct and nested aggregate sub-queries.
RA aggr queries.An RA aggr query is an expression defined in terms of RA operators (i.e., select , project , Cartesian-product or join , renaming , union and set difference -), and additionally a group-by aggregate operator where (a) Q is an RA aggr query, (b) X is a set of attributes for group-by, (c) agg i is one of aggregate functions max, min, count, sum, avg, and are attributes such that forms the output relation of Q.We refer to agg i (V i ) as an aggregate field on attribute V i .We write the operator as when it is clear from the context.Since Q may include aggregate operators itself, aggregations in an RA aggr query may be arbitrarily nested.
In SQL syntax, the operator can be written as selectX, agg As a special case, when , does not have a group-by construct.We write it simply as agg(Q).
Example 3. Query Q 2 in Example 2 is an RA aggr query.
As another example, an RA aggr query with nested aggregation is Q 4 over relations R(A, B, C) and S(E, F, W):

R B
Bag access schema.To support the bag semantics, we extend the access schema of [5, 19] to bag access schema.Over a database schema , a bag access schema is a set of bag access constraints of the form: R where R is a relation schema in , X and Y are sets of attributes of R, and N is a positive integer.

ab) ab
To define the semantics of bag access constraints, we use the following notations.consists of the following bag access constraints: Here φ 2 says that (a) for any uid u 1 , there exist at most 193 distinct country values, and (b) there exists an index built on the friend relation that given any uid u 1 , fetches all associated distinct countrys c and for each country c, the multiplicity of (u 1 , c) in the friend relation for country c; similarly for φ 2 .

B2
As another example, consider a bag access schema for Q 4 of Example 3, which consists of bag access constraints:

B2
We will see that Q 4 can be efficiently answered with .
A database instance of conforms to a bag access schema , denoted by , if for every , where , and D is the instance of R in .
Intuitively, a bag access constraint φ extends an access constraint ψ of [5, 19] by incorporating multiplicity.Similar to ψ, given any X-value in D, φ enforces the cardinality constraint and returns distinct corresponding Y-values.In contrast to ψ, for each corresponding Y-value , φ also returns the multiplicity of in D. In other words, access constraints under the set semantics [5,6] are a special case of bag access constraints when we only bound the cardinality and retrieve distinct values.
Remark 1.When it is clear from the context, we also simply refer to bag access schema as access schema.

Bounded evaluation of RA aggr queries
We next define bounded evaluation for RA aggr queries.
Multiplicity relations.From an instance D of a relation schema R, we can use the index of a bag access constraint to retrieve a relation consisting of tuples , where t is a tuple in D. It is a set that besides partial tuples t[X, Y] in D, carries multiplicity m(D, t[X, Y]), and is referred to as a multiplicity relation.
The RA aggr operators can be readily extended to multiplicity relations.Take join operator as an example.Given two multiplicity relations I 1 and I 2 , the result of , denoted by I s , is a multiplicity relation as follows: (a) tuples in I s have the form (t, M), where t is a result tuple of using the conventional join semantics (ignoring multiplicity), and (b) , where t is joined from and , and m(I i , t i ) denotes the multiplicity of tuple t i in multiplicity relation .Similarly, other RA aggr operators are defined on multiplicity relations.ξ Bounded RA aggr plans.A bounded RA aggr plan under a bag access schema is an algebra tree that extends conventional RA aggr query plans with a new operator fetch(T,φ), where is a bag access constraint in , and T is an intermediate multiplicity relation on attributes R[X] (see Appendix for a formal definition).Over an instance D of R that conforms to φ, fetch( Observe that does not explicitly use φ 3 .However, the correctness of relies on the cardinality of φ 3 .Moreover, propagates constants of Q 4 via join and fetch, such that all values and their combinations that are needed for answering Q 4 are fetched from .In particular, in the presence of nested aggregation, answers to aggregate sub-queries can also be used by fetch, e.g., T 3 .(1) Scale independence.Each fetch operation in ξ retrieves data with a cost that can be quantified by the bag constraint employed.Hence the cost of executing ξ is determined by bag access schema and query plan ξ only, not by the size of dataset as long as conforms to .That is, under the bag semantics, bounded RA aggr plans preserve the scale independence of bounded evaluation for RA queries [5,6] .
(2) Late bag semantics enforcement.Plan ξ fetches and operates on sets since fetch(T, ψ) returns a set.It defers the process of the bag semantics to a stage as late as possible.This reduces performance degradation caused by duplicated values in, e.g., joins, in which duplicates get inflated rapidly.

R(|∅ → A, N |)
(3) Subsuming column-stores.Bounded plan ξ can also express query evaluation over column-stores [20] or columnstore indices [21] .Indeed, in a column store (or a columnstore index), each column (or column index) on attribute A of a relation schema R is essentially a special case of bag access constraint of the form . Hence, column store and columnstore index are a special case of bag access schema and hence their evaluation plan can be expressed by a bounded plan ξ under such a bag access schema.

R(|X → Y, N |)
Note that the efficiency of column stores mainly comes from its implementation-level optimization, e.g., column compression and vectorization [20] .While these optimization strategies can also be used to implement the indices of bag access schema, these are not the focus of this paper.We study query evaluation at the logical level, under generic constraints when X is not necessarily empty.Hence, this paper focuses on row-oriented databases as the underlying platform for implementing bag access schema.

Complexity of bounded evaluation
In this section, we study the complexity of bounded evaluability and identify practical decidable cases.
Bounded evaluability.The problem is stated as follows.

• R B R R
Input: A database schema , a bag access schema over , and an RA aggr query Q over .
• B Question: Is Q boundedly evaluable under ?This bounded evaluability problem is to decide whether a query can be answered by accessing a bounded amount of data, and is underlying the first step of our framework for querying big data under bounded resources (Section 1).
No matter how important, the problem is hard.To see why it is intriguing, let us consider Example 6.

B3
At a first glance, none of Q 9 and Q 10 seems boundedly evaluable, and hence neither is Q 8 .Indeed, we cannot retrieve values for any of x, y or z using indices in .However, putting together and φ 7 of , one can deduce that x must be equal to either 1 or y in all tuples retrieved from instance of T by any query plan for Q 9 .In other words, under , Q 9 reduces to SPCU , where and .
Hence, since .It is easy to see that is boundedly evaluable under and as a result, so is Q 8 .
As shown above, it is often necessary to check query equivalence to decide whether a query is bounded.The use of union ( ) allows us to convert SPC to SPCU under a bag access schema (e.g., Q 4 to ), which may further interact with set difference (-) (e.g., Q 3 and ).It is beyond reach in practice to check the equivalence of RA or RA aggr queries.Thus, the bounded evaluability problem is already undecidable for RA, a special case of RA aggr .
Theorem 1 was verified under access schema.As remarked in Section 2, access schema is a special case of bag access schema.Hence, the bounded evaluability problem remains undecidable for RA under a bag access schema.As an immediate corollary, the bounded evaluability problem is undecidable for RA aggr , which subsumes RA.
Decidable cases.We next identify special cases when the bounded evaluability is decidable.The reason is twofold.(1) The special cases cover a large number of RA aggr queries used in practice, e.g., all SPC sub-queries of built-in benchmark queries in TPCH [15] and TPCDS [22] .
(2) These cases reveal insight about why queries become boundedly evaluable.In Section 4, we will deal with gen-eric RA aggr queries, by devising an effective syntax for boundedly evaluable RA aggr queries.sub>queries.
Statement (1) apparently holds.Below we prove (2) by giving a PTIME sufficient and necessary condition for checking the bounded evaluability of Q under .
The condition needs a notion of covered SPC queries from [6].It is characterized by a set , which consists of attributes whose values can be retrieved via fetch operations along with , without directly accessing raw data in a database.More specifically, let be the set of attributes A in the constant selection predicates of Q, i.e., for a constant c.Then is inductively defined as: and (denoting that can be deduced from selection predicates via the transitivity of equality), then ; Denote by the set of attributes that are either in the selection or join predicates of Q, or are the top-level projection attributes.Then we show the following.
( ) Assume that Q is boundedly evaluable under .Then there exists a bounded plan ξ for Q under .Below we first inductively construct a query Q ξ from ξ such that (a) , where means that for any database , and (b) Q ξ satisfies the condition on Q in Lemma 3. We then show that Q also satisfies the condition when Q ξ satisfies it, and thus Lemma 3 holds.
Construction of Q ξ .We construct query Q ξ by induction on the structure of ξ as follows: ), then (resp. ), where Q ξ' is the query constructed for ξ'.

If
, then .
By induction on the structure of ξ, one can readily verify that for each relation R in , there exists a bag constraint such that , i.e., Q ξ satisfies the condition of Lemma 3.
Query Q satisfies the condition.We next show that Q satisfies the condition in Lemma 3 when Q ξ does.Since for each bag constraint , , from , one can verify that .Thus there exists a homomorphism ρ from Q ξ to Q [2] .Moreover, since Q is self-join free, each relation schema R (i.e., relation atom) has at most one occurrence in Q.Then no relation atom in Q can be removed without changing Q.Thus, Q is minimal (an SPC query is minimal if it has no redundant relation atoms [2] ).Hence for each relation R in Q, there must exist a relation R' in Q ξ such that ρ(R') = R, and moreover, .Hence .That is, Q also satisfies the condition of Lemma 3.

⇐ B
( ) Assume that Q satisfies the condition of Lemma 3. We construct a 3-step bounded query plan ξ under for Q: (a) it has a bounded sub-plan ξ R for each relation R in Q that fetches all attribute values needed for answering Q; such that each (partial) tuple fetched and kept for R is guaranteed to draw values from the same tuple in D; and To show such a plan ξ exists for Q under the condition of Lemma 3, we only need to prove the following two properties: (1) all necessary attribute values for answering Q from each relation R in Q can be retrieved by ξ R in step (a), and Proof of (1).We quence of applications of rules (a)-(c) that defines such that ℓ : cov0 where , , step i expands cov i-1 by applying rule r i from one of the rules (a)-(c) for defining given earlier.We translate into a bounded plan: as follows: , then , where , •••, are the bounded plans that fetch attributes in X ( ).By the construction, ξ i is a bounded plan that fetches all A attribute values for Q.Note that each sub-plan ξ i is bounded since it does not involve relation scans.
Proof of (2).Plan is constructed with for each by (i) , where ′s range over all attributes in X, and is the plan generated above for fetching values; and (ii) if for , then .Since is a bounded plan for each , is bounded under .

B B
Hence, when the condition of Lemma 3 holds for Q and , ξ constructed above is a bounded plan for Q under .

□
C P (II) NP cases.One might want to lift the restriction of condition (b) on the queries in .This covers all SPC queries, including those with self-joins.However, the bounded evaluability analysis becomes harder unless P = NP.
Denote by the set of pairs of bag access schemas and SPC queries Q such that for each bag constraint , .We have the following.B Theorem 4. For any bag access schema and SPC Q, To prove statement (b), we first give a sufficient and necessary condition for query Q to be boundedly evaluable under for any .Based on the characterization, we then show that checking bounded evaluability for is NPcomplete.
Let Q m be the minimal equivalent query of Q, i.e., the C NP minimized version of Q, which can be obtained by removing all redundant relations (see [2] for details).For an SPC query Q, there exists a unique minimal equivalent query up to isomorphism [2] .Along the same lines as the proof of Lemma 3, one can verify the following for cases in .
Lemma 5.For any , Q is boundedly evaluable under if and only if for each relation atom R in Q m , there exists such that .
Based on this, we prove that deciding whether Q is boundedly evaluable under is NP-complete for in .Upper bound.We give an NP algorithm as follows: (a) convert Q into its tableau representation (T Q , u) [2] ; , and a mapping ρ from T Q to T'; T', u) and (ii) whether for each relation atom R in Q', there exists such that ; return "Yes" if so.
The algorithm is correct since if conditions (i) and (ii) of step (c) hold on Q', they must also hold on the minimal equivalent query Q m of Q.Indeed, ρ also determines a homomorphism from (T Q , u) to since Q m is a minimal equivalent query of Q, i.e., ; therefore, if condition (ii) holds on Q', by the homomorphism ρ it must also hold on Q m , i.e., the condition of Lemma 5 applies to Q and .Thus, by Lemma  is the total length of bag constraints in .Lower bound.To show that the problem is NP-hard, we consider the following problem, denoted by MINCQ.
• Input: A relation schema R and an SPC query Q over R.
• Question: Is Q minimized, i.e., is Q a minimal equivalent query of Q?
It is easy to verify that MINCQ is coNP-complete by reduction from 3-C OLORABILITY , which is NP-complete [23] .Lemma 6. Problem MINCQ is NP-complete.

C NP
We show that the bounded evaluability problem for is NP-hard by reduction from the complement of MINCQ.
Given an instance of MINCQ, i.e., a relation schema and an query Q over R, we construct a database schema , an SPC query Q' over and a bag access schema .We show that Q is not minimal if and only if Q' is boundedly evaluable under .As will be shown later, together with Q', such new attributes will be used as join attributes to pairwisely connect the n relation atoms of Q in Q'.
(2) Query Q' is derived from Q as follows: query Q' retains the same number of joins and relation atoms as Q, such that each relation atom R i (i.e., renaming of relation schema ) is replaced with R i ' (i.e., renaming of relation schema ); and the selection (join) condition C of query contains all selection predicates of Q, and in addition, the following predicates: for each pair of relations R i and R j in Q ( ), add equality to C, where .
Intuitively, C preserves all selection conditions of Q and additionally joins each pair of the n relation atoms on a dedicated attribute : (a) for each B k , there exist exactly two relation atoms R i ' and R j ' ; and (b) for each R i ' and R j ', there exists exactly one attribute (3) The bag access schema consists of constraints.Let W be the set of all attributes of A 1 , ••• , A m such that they appear in the selection/join conditions or the top-level projection attributes in Q.Then consists of

B
We next show that query Q is not minimal if and only if Q is boundedly evaluable under .
Assume that Q is not minimal.Then none of the relation atoms in Q' can be removed by minimizing Q'.

Hence all
attributes , •••, , together with W, have to be contained in XY for some in by Lemma 5.This yields attributes.This is impossible since for any , φ i contains attributes only by its definition above.
Assume that Q' is boundedly evaluable under .

□
C NP Remark 2. Despite its intractability, checking the bounded evaluability for is feasible in practice by Lemma 5. Indeed, there have been effective algorithms for minimizing SPC queries, i.e., computing Q m for Q [2] , and

Effective syntax
In this section, we propose an effective method to check the bounded evaluability of generic RA aggr queries.We show that while the problem is undecidable (Theorem 1), there exists an effective syntax for boundedly evaluable RA aggr queries, which reduces the problem to syntactic checking (Section 4.1).In addition, we identify two practical subclasses of RA aggr queries and provide their effective syntax.In particular, we give one for RA and show that it covers more bounded queries than the one given in [6] (Section 4.2).

An effective syntax for RA aggr
Under an access schema , an effective syntax for boundedly evaluable queries of ( refers to, e.g., RA or RA aggr ) is a subclass of such that for any Q in , Here if for all databases .
Intuitively, the effective syntax reduces the problem of deciding the bounded evaluability of queries to syntactic checking of .Indeed, every boundedly evaluable query can find an equivalent query in under .Hence, we can safely settle with queries in , since can express, up to equivalence, all boundedly evaluable queries.
To some extent, the development of effective syntax is analogous to the study of range-safe queries for relational calculus.Indeed, the problem for checking the "safety" of relational calculus queries is also undecidable [2] .Despite this, range-safe queries are supported by commercial DBMS, by making use of an effective syntax of range-safe relational calculus queries.We follow the same approach to dealing with the bounded evaluability of RA aggr queries.

LB B
Below we develop such an effective syntax, denoted by , for RA aggr queries that are boundedly evaluable under .
The class .In a nutshell, we characterize with three sets: , and .Informally, under a bag access schema , for an RA aggr query Q, contains attributes (e.g., ) and aggregates (e.g., ) of Q whose values can be fetched via consists of relations in Q whose partial tuples that are needed to answer Q can be reconstructed from the fetched values for attributes in ; and •BQ(Q,B) contains boundedly evaluable sub-queries of Q.

B
Intuitively, these sets characterize RA aggr queries Q for which the values of all attributes necessary for answering Q can be "deduced" from constants in Q, via joins and fetch under access schema .Such attributes participate in RA aggr operations of Q, and are referred as the nontrivial attributes of query Q.The class of such queries makes an effective syntax for boundedly evaluable RA aggr queries.
More specifically, sets BA, BR and BQ are defined in a mutual recursive way using rules given in Fig. 1, with notations explained in Table 1.Intuitively, (1) rule γ 1 of Fig. 1 includes constant attributes (see Table 1) in ; (2) γ 2 propagates value from attributes and aggregate fields to join attributes; (3) γ 3 specifies value propagation via fetch; (4) γ 4 says that if a sub-query is boundedly evaluable, then its output attributes and aggregate fields can also be fetched; (5) γ 5 adds a relation atom R to only when the partial tuples of R can be reconstructed from combinations of the fetched values; and (6) γ 6 says that a sub-query is boundedly evaluable if all its relations can be correctly fetched.(5) Thus, by γ 5 and γ 6 , and .

B3
Note that .However, , and Q 6 is boundedly evaluable under .Similarly, for Q 8 , and of Example 6, one can verify that but and is boundedly evaluable under .We next show that is indeed an effective syntax for boundedly evaluable RA aggr queries under .

Table 1 Notations and definitions
B LB Theorem 7.Under any bag access schema , is an effective syntax for boundedly evaluable RA aggr queries.
LB Proof.We show below that has properties (a) and (b) of an effective syntax, by proving the following lemmas.We will constructively prove property (c) in Section 5.2.

B
Q ≡B ξ LB (I) For any bounded plan ξ under , there is in .
, there is plan bounded under .
These suffice.Indeed, for any Q that is bounded under , by definition there must exist a plan ξ Q bounded under ; hence by (I), there exists in .On the other hand, if , by (II), Q' has a bounded plan , i.e., Q is also boundedly evaluable under .Hence, has properties (a) and (b) of an effective syntax.
We next prove the two lemmas.
ξ Proof of (I).We prove it by induction on the structure of .
Base case.When is or , by definition itself is in .
ξ Induction.We consider the structure of .
. By the induction hypothesis, there exists a query such that .

Consider
. By the definition of , all relations in are in (since Q and share the same nontrivial attributes).Hence by rule .That is, .Obviously, .
The cases for , , , , are similar and can be verified along the same lines. .That is, .
Proof of (II).Since , by the definition of there must exist a proof consisting of applications of rules in Fig. 1 that deduces , i.e., a sequence of the form where (a) is one of the rules in is rule that deduces . We define the length of as the number of rules applied in .ℓ i Induction hypothesis.We show that for a proof of length , , values for that are necessary for answering Q can be fetched via bounded plan under ; , then values from R necessary for Q can be fetched via bounded plans under ; and , then the exact answers to sub-query Q of Q can be answered via bounded plans under .
Note that for any , Q must have a proof ending by including .If the induction hypothesis holds, Q must have a bounded plan under , which proves lemma (II).

l(ℓ) ℓ
We next prove it by induction on the length of .
Base case.When , then rule r 1 can only be either (i) γ 1 of Fig. 1, i.e., to include from selection A = c of Q; or (ii) γ 3 of Fig. 1, i Induction.Assume that the hypothesis holds for proofs of length at most k.Consider proof of length k+1.We discuss the last step of .(i) If r k+1 is rule γ 1 with attribute A = c, then A can be fetched in exactly the same way as the base case.
(ii) If r k+1 is rule γ 2 that includes attribute B in BA i+1 with A = B, then attribute A must be included in BA i+1 at some steps prior to k+1.By the induction hypothesis, there must exist a bounded plan ξ A that fetches all necessary values for answering Q except B. Hence ξ B =ξ A is also a bounded plan that fetches B for Q by the condition  (iv) If r k+1 is γ 4 that includes in BA i+1 from a subquery Q s of Q that is included in BQ j at step , then by the induction hypothesis, there exists a plan for Q s under that exactly answers Q s .Hence we can get values for its output attributes simply by .Note that since is an exact plan for Q s , the values for are guaranteed correct even when Q s is an aggregate subquery.
Proof.Since it is in PTIME to check whether a query is in and every query in has a bounded plan under , to show that and are effective syntax for boundedly evaluable RA and queries, respectively, it suffices to show that for any boundedly evaluable RA Q 1 and , there exist and such that and .This is verified along the same lines as the proof of Lemma (I) for Theorem 7 above, by showing that every bounded RA (resp.
) plan has an equivalent query in (resp. ).

□
RA 0 aggr There are close connections between and RA Proof.Lemma 9(1) can be verified by the definition of effective syntax.We focus on Lemma 9(2) here (the proof for Lemma 9(1) is simpler).By the definition of boundedly evaluable queries, it is easy to show the following lemma: for any RA aggr , under , Q is boundedly evaluable iff Q' is boundedly evaluable.
We next use the lemma to prove Lemma 9(2).When is an effective syntax for boundedly evaluable RA queries under , consider the associated class of .( 1

□
By Lemma 9, one can easily extend an effective syntax for boundedly evaluable RA queries, e.g., covered RA in [6], to an effective syntax for boundedly evaluable RA ag- gr queries.
One might think that such an extension is also possible for RA aggr .However, when group-by aggregation is nested with other RA aggr operators, a convenient extension is beyond reach.It is much harder for RA aggr to characterize propagation of values from aggregate subqueries to other relations, or to cover all boundedly evaluable queries up to equivalence.

LB[RA]
Nonetheless, is more expressive than the class of covered RA of [6], which is also an effective syntax for RA.

B B LB[RA]
Proposition 10.For any bag access schema , the set of RA queries covered by is properly contained in .

LB[RA] LB
Proof.One can verify that covered RA queries [6] can be expressed in without rule γ 4 .Hence it is a subclass of .To see it is a proper subclass, consider an

512
International Journal of Automation and Computing 17(4), August 2020 query Q over relations R(A, B) and S(C, D): , where and rename R. Consider consisting of , and . One can verify that Q is not covered by since subquery S is not covered (see [6]).However, . □

BEAS for querying big data
In this section, we show how to extend DBMS with the functionality of bounded evaluation.We first present such a framework (Section 5.1).We then provide algorithms underlying the framework, for checking the bounded evaluability (Section 5.2) and generating bounded plans (Section 5.3).

R
The framework, referred to as BEAS, is shown in Fig. 2. Given an application that involves queries over instances of a database schema , BEAS works as follows., BEAS first checks whether Q is boundedly evaluable under (>C2).If so, it generates a bounded query plan ξ for Q under (C3), which is interpreted as an SQL query Q ξ and hence can be directly executed by the underlying DBMS on a bounded dataset identified by plan ξ (C4).If Q is not boundedly evaluable, it generates a query plan ξ' for Q that is partially bounded, to make maximal use of access constraints in (C5).The (partially) bounded plans are optimized and executed by DBMS (C4).
Note that the BEAS framework does not need to change the underlying DBMS.Indeed, it interacts with the DBMS via SQL only.Hence, BEAS can be built on top of any existing DBMS, providing a bounded evaluation capacity.BEAS can also compute approximate answers to unbounded queries under constrained resources, and offers deterministic accuracy guarantees under access schema [17] .We focus on computing exact answers in this paper.
Below we develop algorithms for components C2 and C3 of BEAS in Section 5.2 and Section 5.3, respectively.

B
We next develop a practical algorithm for component C2 of BEAS.Under a bag access schema , given an RA aggr query Q, it decides whether Q is boundedly evaluable.

LB
To do this, we first checks whether and fall into the two classes of special cases, i.e., or , in PTIME.If so, their bounded evaluability can be decided efficiently as shown in the proofs of Theorems 2 and 4. Otherwise, we check whether Q is in the effective syntax for RA aggr (Section 4).Below we give a PTIME algorithm for this.The iteration continues until can no longer be updated (line 6).It returns "Yes" if and "No" otherwise (lines 7-8).In each iteration, steps (b) and (c) are straightforward.Below we discuss step (a) in more details.
Input: Relational schema , RA aggr Q and bag access schema over .
, by mapping attributes and aggregate fields of Q to attributes of U Q , via a mapping function ρ.For any two attributes R[A] and is in the selection condition of Q.For aggregate field agg(A) and attribute R[B], agg(A)) = ρ(R[B]) only when agg(A) is in (recall Table 1) for some .Accordingly, bag constraints in are also mapped on to U Q by ρ.
(2) Computing fetch closure.We then reduce the computation of to the computation of fetch closures over U Q with w.r.t.ρ.For a set W of attributes of U Q , its fetch closure, denoted by , is a set of attributes of U Q such that ; (ii) if and such that and , then ; and contains nothing else.
In the first iteration, BEChk starts by updating (line 3).To do this, it builds a universal relation via function ρ that maps c.uid to f.fid and keeps all other attributes intact (f and c stand for friend and checkin, respectively).Since , and , BEChk sets W to and computes the fetch closure of W, yielding .Hence it updates to {f.uid, f.fid, c.cty}.It then updates to {f, since all nontrivial attributes of f and c are already in (line 4).BEChk finally updates to {Q 3 , Q 2 } (line 5) and terminates in next iteration and returns "Yes".B, C, E, F, W, F gets more involved for Q 6 from Example 5 and from Example 4. In the first iteration, BEChk builds a universal schema via a mapping function that keeps attributes of R and S and maps aggregate field (i.e., the output) sum(y) of Q 7 to F'.Note that does not map sum(y) to E although since yet.BEChk then computes fetch closure of over and sets to {B, C, F}.It then finds that all nontrivial attributes of R in and hence updates to .Consequently, it sets to Q 7 as well.In the second iteration, BEChk builds an updated universal relation since and . It continues to update to {B, C, E, F, W}, to {R, S} and to {Q 7 , Q 6 }.It terminates after the third iterations and returns "Yes" for Q 6 under .
Correctness & Complexity.To see that BEChk correctly checks the effective syntax of Fig. 1, observe the following.(1) For any fixed , the corresponding decided by rules γ 1 , ••• , γ 4 is exactly the fetch closure (recall the definition of above).( 2) The while loop propagates changes from BA to BR and to BQ, and finally to BA again, until reaching a fixed point w.r.t. the rules of Fig. 1.
BEChk can implemented in time, where p Q is the number of sub-queries in Q, ||Q|| is the number of relation atoms in Q, |Q| is the number of attributes and aggregate fields in the relation atoms and predicates of Q, and are the number and total length of bag constraints in , respectively (see Table 1).Indeed, computing the fetch closure can be implemented in -time, and hence each while iteration is in time; there are at most p Q iterations.
Algorithm BEChk provides a constructive proof for property (c) of the effective syntax in Theorem 7, i.e., it is in PTIME to check whether for an RA aggr query Q.
This also completes the proof of Theorem 7.

Generating bounded plans B B B
We next provide an algorithm underlying component C3 of BEAS, denoted by BPlan.Given a bag access schema and an RA aggr query Q that is determined to be boundedly evaluable under by BEChk of Section 5.2, BPlan generates a bounded RA aggr query plan for Q under .

Q ∈ LB B
Algorithm BPlan.Given a boundedly evaluable RA aggr query (see Section 4), BPlan generates a bounded plan ξ Q for Q under as follows: (1) fetch a bounded amount of data for each relation R that appears in Q, and (2) carry out operations of Q over fetched data.While step (2) is straightforward, step (1) is rather involved.

B B B
To out step (1), BPlan generates bounded logical access paths (bLAPs).A bLAP ξ R for a relation R in Q fetches all values (partial tuples) of R that are necessary for evaluating Q with ; moreover, ξ R is a bounded RA aggr plan under .Intuitively, bLAPs play the same role as conventional DBMS access paths.But instead of accessing complete tuples by scan or index, bLAPs fetches values (partial tuples) using such that the amount of data accessed is bounded.

B
More specifically, we give an algorithm, denoted by BAP, as a sub-procedure of BPlan to find a bLAP ξ R for R under .While there may exist exponentially many such bLAPs, BAP aims at computing those with minimum cost.

514
International Journal of Automation and Computing 17(4), August 2020 B After BAP computes bLAP ξ R for every relation R in Q, algorithm BPlan generates a bounded plan ξ Q for Q under , by replacing each R in Q with its bLAP ξ R , and by carrying out RA aggr operations of Q on the data retrieved by ξ R .
In the rest of the section, we focus on algorithm BAP.Parametric cost measures.To evaluate the quality of bLAPs found by BAP, we start with a generic class of cost functions.Conventional access path measures assess the cost of physical table-access methods, e.g., sequential scan and index scan [11] .These metrics do not apply to bLAPs, which involve, e.g., fetch and joins.Hence, BAP employs a generic cost function c(ξ R ) that takes user specified functions as parameters, to express various cost measures over ξ R as bLAPs, e.g., output size, data access, etc.
The cost of a bLAP ξ R for R under , denoted by c(ξ R ), is inductively defined in Table 2, with five user configurable parameter functions , , , and .

c()
By these user configurable functions, we can support various measures for bLAPs.For example, to estimate the worst-case output size of ξ, we simply set (i) to , (ii) to , (iii) to , (iv) to c 1 , and (v) to c' if and to 1 otherwise (assume when ). Algorithm BAP.Algorithm BAP works in two steps: (1) it reduces bLAPs to proofs of and encodes all proofs with a directed graph in PTIME; and (2) it searches to find proofs with minimum cost, where a proof corresponds to a subgraph in the search trace.
Here a proof of is a sequence of applica-tions of the rules given in Fig. 1.Each step of the proof corresponds to one or several operations in a bLAP ξ R for R.
Below we outline BAP (see Appendix for its pseudo code).
(1) Reduction.It reduces the problem of generating bLAPs for R of Q under to finding proofs of .It encodes all proofs of (hence all bLAPs for R) in a weighted directed graph , where nodes encode (a) attributes R[X] and in constraints , and (b) relations and sub-queries of Q. Edges encode value propagation among them.It ensures that each proof of is encoded by a traversal from a dummy node to node u R encoding R in .Graph has at most nodes and edges.
We illustrate reduced graphs with the following example, and defer the construction details to Appendix.
Recall RA aggr queries Q 2 of Example 2 and Q 6 of Example 5, and bag access schemas and from Example 4. Graphs and are shown in Fig. 3.Here is a dummy node connected to all constant attributes in Q. Edges with numeric weights are to encode deduction steps with rule γ 3 of Fig. 1, where the weights are the cardinality N's of the corresponding access constraints.
As will be show below, proofs of a relation can be encoded as traversals from to u R in .
Conditional Dijkstra search.Algorithm BAP then adopts a Dijkstra-like search over , from the dummy node to the relation node u R encoding R, such that the trace of the search encodes a bLAP (i.e., proof) for R under .
It extends Dijkstra algorithm [24] as follows.BAP can be implemented in -time (ignoring the complexity of parameter functions of c(ξ R )).One can verify that BAP restarts at most N times, where N is the number of nodes in .Optimality.Algorithm BAP is able to find optimal bLAPs for a large class of parameter functions for c(ξ R ).We defer detailed proofs of this optimality to Appendix.

Experimental study
We have developed BEAS@PG by extending Postgr-eSQL with bounded evaluation.Using a benchmark and two real-life datasets, we conducted four sets of experiments to evaluate (1) the overall performance of BEAS@PG vs PostgreSQL; and the effectiveness of bounded evaluation for (2) bounded queries and (3) unbounded queries.
Experimental setting.We start with the setting.Bench mark.We used TPCH benchmark [15] .It uses TPCHdbgen to generate 8 relations with 61 attributes of different scales.It contains 22 built-in benchmark queries.
Real-life datasets.We also used two real-life datasets.(a) US Air carriers (AIRCA) records flight and statistic data of US air carriers.It consists of Flight On-Time Performance Data [25] for departure and arrival information, and Carrier Statistic data [26] for airline market and segment data of the air carriers.It has 3 tables, 200 attributes, and about 16 GB of data with records from 1990 to 1997.
(b) UK MOT data (UKMOT) integrates the anonymised data [27] that records MOT tests and outcomes, and the roadside survey of vehicle observations [28] that includes vehicles passing observation points in the UK.It has 3 tables with 42 attributes, about 16 GB of data from 2007 to 2011.Queries.To test the impact of query structures on the effectiveness of bounded evaluation, we designed a generator to generate queries with different structures over the two real-life datasets.More specifically, we manually created 30 query templates for each of the two datasets (Q1-Q15 are boundedly evaluable and Q16-Q30 are unbounded), with 0 to 4 joins.The generator populates these templates by randomly instantiating parameters in the templates with values from the datasets, yielding 150 queries for each real-life dataset.
Access schema.We built access schemas with 59, 18 and 14 access constraints over TPCH, AIRCA and UK-MOT, respectively.We extended TANE [29] , an algorithm for discovering functional dependencies, to first find candidate constraints on small sample datasets of 100 MB, and ranked them by their cardinalities N′s.We then checked whether their N′s are insensitive to the size of datasets D, by varying the size of D, e.g., 200 MB and 500 MB.We picked those access con-

516
International Journal of Automation and Computing 17(4), August 2020 straints with small and size-insensitive N′s, such that the total size of the indices is at most 3 times of the size of its D.

R(|X → Y, N |)
Configuration.For DBMS, we used PostgreSQL 9.6 with all optimization enabled (BEAS@PG is built with PostgreSQL 9.6).In favor of PostgreSQL, besides indices for access constraints, we also built the following extra indices for PostgreSQL: (1) for each access constraint , we built a B-tree index on attributes X over R as well; (2) we built all primary key and foreign key indices; and (3) we also built B-tree on numerical attributes.Note that these were only for PostgreSQL, not built for BEAS@PG.We set the cost measure parameters of BEAS@PG as the worst-case output size estimation (recall Section 5.3).
The experiments were conducted on an Amazon EC2 Dense-storage instance m4.xlarge, with 16 GB of memory, 4 Intel Xeon E5-2676 vCPUs, and 500 GB of EBS SSD storage.Both the plan generation time and the execution time of the generated plans are included in evaluation time.All the experiments were run 3 times.The average is reported here.
Experimental results.We next report our findings.
Exp-1: Overall performance.We first report the evaluation time of 22 TPCH queries over 16 GB of TPCH data, and the 60 query templates over the entire AIRCA and UKMOT datasets, where evaluation time of a query template is the average of the evaluation time of its 5 instantiated (1) Index size.The indices of all the access constraints over TPCH, AIRCA and UKMOT account for 2.98, 0.01 and 0.25 times of the size of the datasets, respectively; the additional indices built only for Postgr-eSQL (in favor of the conventional DBMS) are of size 2.21, 0.87 and 1.5 times of that of TPCH, AIRCA and UKMOT, respectively.

R(|X → Y, N |)
(2) Query overview.None of the TPCH queries is boundedly evaluable under the access constraints selected.This is because the TPCH data generator scales cardinalities N's of almost all candidate access constraints due to its simple scaling up strategy.This rules out most of the candidate constraints when we scale up to larger datasets while using a fixed threshold for N.For the 60 query templates over AIRCA and UKMOT, 30 of them are boundedly evaluable under the access constraints used, 15 for each dataset.Note that one could build more access constraints to allow more bounded queries.We will evaluate the performance of BEAS@PG for bounded and unbounded queries in more details in Exp-2 and Exp-3, respectively.
(3) Performance.The results for TPCH, AIRCA and UKMOT are reported in Tables 3-5, respectively.For bounded queries, BEAS@PG is 1.79 and 3.66 times faster than PostgreSQL on AIRCA and UKMOT, respectively, up to 3.44 and 2.52 times.
The results show that with a modest number (and size) of access constraints, BEAS@PG can speed up Poston both bounded queries and unbounded queries, when all relevant indices are enabled for PostgreSQL, including those of access constraints and additional indices tailored for PostgreSQL.This verifies the effectiveness of bounded evaluation for generic queries, bounded or not, while the speedup is much larger for bounded queries, as expected.
Below we report more in-depth evaluation results for BEAS@PG versus PostgreSQL (with additional indices) for bounded queries (Exp-2) and unbounded queries (Exp-3).
Exp-2: Effectiveness for bounded queries.We next evaluated the impact of datasets D and queries Q on the evaluation time of BEAS@PG and PostgreSQL (with indices enabled), when queries Q are boundedly evaluable.
Varying |D|.To evaluate the impact of |D|, we partitioned AIRCA and UKMOT datasets by their date attributes (year and month), yielding subsets of sizes from 1 GB to 16 GB, consistent with how we scale up TPCH datasets when testing unbounded queries below in Exp-3.We did not use TPCH here since it has no boundedly evaluable queries.ms for all queries over all subsets of AIRCA and UKMOT, respectively, no matter how large the datasets were.In contrast, even on the subsets of AIRCA and UKMOT of size 8 GB, Postgr-eSQL took 8.45 ms and 3.88 ms, respectively, up to 1.58 ms and 7.80 ms over the full datasets.That is, PostgreSQL is 1.35 and 1.98 slower than BEAS@PG on AIRCA and UKMOT, respectively, even with all relevant indices built.The larger the dataset is, the bigger the gap between Postgr-eSQL and BEAS@PG is for bounded queries.
Varying Q.To evaluate the impact of queries Q, we varied the complexity of Q, measured as the number #Q of joins in the query templates Q, from 0 to 4, while using the entire AIRCA and UKMOT datasets.Note that for each query template, we instantiated 5 queries by set-Y.Cao et al. / Bounded Evaluation: Querying Big Data with Bounded Resources ting its parameters with different values (hence these queries share the same query structure and #Q).The evaluation time of each query template is the average of all its instantiated queries.The results are reported in Figs.4(c) and 4(d).We find the following.(a) The complexity of Q has impacts on the performance of both BEAS@PG and PostgreSQL, as expected.They both take longer time for queries with more joins (i.e., #Q).However, (b) BEAS@PG scales much better with the number #Q of joins in Q than PostgreSQL (with indices).For instance, on average BEAS@PG found answers for all queries with #Q = 4 within 11.67 ms on full-sized AIRCA, while PostgreSQL takes 1.56 ms; that is, PostgreSQL is 1.34 times slower than BEAS@PG for large queries.Remark.We find that when queries Q incur joins on keys only, PostgreSQL with extra key/foreign key indices built is almost as fast as BEAS@PG (e.g., TPCH Q4).However, as long as Q involves non-key attributes, e.g., many of the AIRCA and UKMOT queries, Postgr-eSQL performs poorly on big tables, even provided with all indices.Indeed, on average BEAS@PG outperforms PostgreSQL by 8.98 times and 1.76 times for all bounded queries over subsets of AIRCA and UK-MOT, respectively.The gap gets larger when the number of non-key attributes increases.

R(|X → Y, N |)
By looking into PostgreSQL′s plan and its E XPLAIN output, we find that this is partially due to the following reason.Given an access constraint , BEAS@PG fetches only distinct values of the relevant XY attributes, but PostgreSQL fetches entire tuples with irrelevant attributes of R, although those attributes are needed for answering Q at all, no matter what indices are provided.This led to duplicated (X,Y) values when X is not a key, and the duplication got inflated rapidly by joins, e.g., E XPLAIN output shows that Postgr-eSQL consistently accesses entire tables when there are non-key attributes.
Exp-3: Effectiveness for unbounded queries.In the same setting as in Exp-2, we evaluated the impact of D and Q on the performance of unbounded queries by BEAS@PG and PostgreSQL with indices enabled for PostgreSQL.
Varying |D|.The results on AIRCA, UKMOT and TPCH are in Fig. (4e), (4f) and (4g), respectively.Observe the following.(a) BEAS@PG is able to speed up PostgreSQL even for queries that are not bounded under the available access constraints.On average, BEAS@PG is 7.22 , 2.29 and 3.43 times faster than PostgreSQL for unbounded queries on AIRCA, UKMOT and TPCH, respectively.This is because while not all relations in these queries are bounded, bounded evaluation can still speed up their "bounded" subqueries, and hence remains faster than PostgreSQL.(b) As opposed to evaluating bounded queries, both BEAS@PG and PostgreSQL are sensitive to the size of the datasets when evaluating unbounded queries.However, BEAS@PG scales much better than Postgr-eSQL, and their performance gap becomes larger when the dataset size increases.For example, when the dataset increases from 1 GB to 16 GB, the average processing time of BEAS@PG increases from 9.

R(|X → Y, N |)
Note that the speedup for unbounded TPCH queries is not as good as for AIRCA and UKMOT queries.This is because (i) the N's of access constraints over TPCH scale linearly as the dataset gets larger, while those on AIRCA and UKMOT are more stable and independent of the dataset size; and (ii) joins in TPCH queries are mostly key/foreign key joins, and thus the extra key indices built for PostgreSQL can mimic bounded query plans used by BEAS@PG to some extent, reducing their performance gaps.
Varying Q. Varying the number #Q of joins in the queries, the evaluation time of unbounded queries over AIRCA and UKMOT is reported in Figs.(4h) and (4i), respectively.The results tell us the following.(a) The processing time of BEAS@PG and PostgreSQL increases when the number of joins increases.However, (b) the gap between BEAS@PG and PostgreSQL becomes larger when #Q increases from 0 to 4. For instance, over AIR-CA, on average BEAS@PG and PostgreSQL take 17.37 ms and 4.43 ms, respectively, to answer queries with #Q = 0; and the two take 2.78 ms and 1.64 ms, respectively, when #Q = 4; the results over UKMOT are similar.Note that for bounded queries, the gap between the two is even larger (Exp-2).Summary.We find the following.(1) BEAS@PG (PostgreSQL with BEAS built on top) does better than PostgreSQL for each and every query in all cases, even with extra indices built for the latter.On average BEAS@PG improves PostgreSQL by 7.32, 9.58 and 2.06 times for TPCH benchmark of 16 GB, AIRCA and UKMOT, respectively, up to 40.46, 3.44 , and 2.52 times in the best case.(2) For queries that are boundedly evaluable, BEAS@PG outperforms Postgr-eSQL by 1.9 and 3.6 times on AIRCA and UKMOT, respectively.For queries with complicated joins, e.g., joins on non-key attributes (AIRCA and UK-MOT queries), BEAS@PG is particularly effective, even for unbounded queries.For example, on average BEAS@PG improves PostgreSQL by 5.97 and 1.90 times for queries that are not boundedly evaluable over AIRCA and UKMOT, respectively.For cases where conventional DBMS does its best, e.g., table scan/aggregation and key-foreign key joins (most TPCH queries), BEAS@PG still does better than PostgreSQL.(4) The storage cost for indices of access schema is modest, accounting for 2.98, 0.01 and 0.25 times of the size of 16 GB TPCH, AIRCA and UKMOT, respectively.

Related work
The related work is categorized as follows.Bounded evaluation.The notion of bounded evaluation was introduced in [5], as an effort to formalize scale independence [19,30,31] .The latter aims to guarantee that a bounded amount of work is required to execute all queries in an application, regardless of the size of the underlying data.Under access schema proposed in [19], Fan et al. [5] defines boundedly evaluable RA queries.It establishes the complexity of deciding whether a query is boundedly evaluable, for queries in various fragments of RA, ranging from EXPSPACE-hard to undecidable.Bounded evaluation using views was studied in [32], focusing on its complexity bounds.
To cope with the undecidability of the bounded evaluability problem, an effective syntax was given for RA in [6] under the set semantics.Based on the syntax, algorithms were developed [6] for checking the bounded evaluability of RA queries Q, and if affirmative, generating a bounded query plan for Q.These issues were also studied in [33] for SPC, using a restricted form of query plans.
This work extends the prior work in the following.(1) We define bag access schema, an extension of access schema of [5, 19] to support the bag semantics (Section 2).(2) We identify decidable special cases of the bounded evaluability problem that cover a variety of SQL queries commonly used in practice.(3) We develop an effective syntax boundedly evaluable RA aggr queries under bag access schema, supporting nested aggregations (Section 4).Moreover, the syntax allows us to make a larger class of RA queries bounded, improving the result of [6]  for RA.(4) We extend BEAS [34] from RA to RA aggr , by seamlessly integrating bounded evaluation with DBMS query optimizers, which is quite different from [6, 34].These extend DBMS with bounded evaluation, which was not studied in [5, 6, 33, 34].
Query answering with constrained resources.The objective of this work is to make big data analytics accessible to small companies under constrained resources.For queries that are not boundedly evaluable, an approach is to compute approximate answers under available re- sources.Approximation techniques have been extensively studied, based on synopsis (e.g., [35-39]) or dynamic sampling (e.g., [40-42]).We have proposed a data-driven approximation scheme [17] that computes approximate answers to an RA aggr query Q in a dataset , by identifying a fraction of under an extension of the access schema of [5].It ensures a deterministic accuracy bound η: (a) for each tuple , there exists an exact answer t that is within distance at most η from S, and (b) for each exact answer , there exists within distance η from t.This work differs from [17] in that we focus on computing exact answers instead of approximation.The techniques are hence quite different.In particular, a bag access schema carries the multiplicities of tuples to deal with the bag semantics, as opposed to distance bounds in access templates of [17].This said, this work and [17] are complementary to each other.On one hand, the methods of [17] can be used to compute approximate answers to unbounded queries under constrained resources.On the other hand, the techniques developed in this work can be incorporated into the methods of [17], to improve the accuracy of approximate answers by making use of DBMS optimizers and bounded sub-plans.

R(|X → Y, N |)
Indices.Hash-based or tree-based, DBMS indices are typically defined at the tuple level [11] , to retrieve tuple IDs and fetch full tuples.In contrast, a bag constraint offers a value-based index.Bounded plans fetch distinct partial tuples( Y-values) for each input Xvalue, and thus reduce duplicated and unnecessary attributes in tuples fetched by DBMS, i.e., reduce data access and intermediate relations.The redundancies get inflated rapidly joins.Moreover, the cardinality constraints in a bag access schema allow us to determine whether data access is bounded.
Related to bag access schema is a notion of access patterns, which require a relation to be accessed only by providing certain combinations of attributes, e.g., [43-45].As opposed to access patterns, a bag access schema offers cardinality constraints, tuple multiplicity and indices.Moreover, it is not required to cover all the attributes of a relation and hence, allows us to fetch partial tuples and reduce redundancy.Further, this work studies bounded evaluation of RA aggr queries and its integration with DBMS, which were not considered in the prior work on query answering under access patterns.

Conclusions
We have presented an approach to extending DBMS with bounded evaluation of SQL queries.The novelty of the work consists of (a) a notion of bag access schema to support the bag semantics of nested aggregations; (b) decidable special cases of the bounded evaluability of RA aggr queries; (c) an effective syntax to characterize boundedly evaluable RA aggr queries; and (d) a framework and its underlying algorithms for integrating bounded evaluation with DBMS.Our experimental study has verified that the approach is promising.Together with the approximation scheme of [17], we hope that this work provides small businesses with a capacity for querying big data under constrained resources.
One topic for future work is to develop algorithms for discovering bag access schemas by incorporating machine learning techniques.Another topic is to extend bounded evaluation of SQL queries to column-oriented DBMS.

( 2 )
(1) Denote by a database of , and by D an instance of relation schema R in .(2) For an instance D of R, , i.e., denotes the set of values corresponding to X-value .(3) For any XY-value in D, denotes the cardinality of the bag (multiset) , i.e., the number of occurrences of as XY attributes in D; we refer to as the multiplicity of in D.ψ D |= φ We say that D conforms to , denoted by , if there exists an index for on D such that given any X-value , by accessing at most N tuples, it retrieves (a) all associated distinct Y-values in D, and (b) for each such , the multiplicity .most N distinct corresponding Y values in D.Moreover, these Y-values (partial tuples) and their multiplicities in D are indexed by ψ and can be efficiently retrieved via the index.A1 B1 Example 4. Extending of Example 1, a bag access schema

BFrom Lemma 3 ,
Theorem 2(2) follows since one can simply check the condition of Lemma 3 in PTIME in the sizes of Q and .Below we prove Lemma 3.

⇒
prove (1) by constructing such a bounded plan ξ R[A] for each attribute .Note that only attributes in are needed for answering Q.The plan ξ R[A] is constructed by translating the proof that witnesses .More specifically, since the condition of Lemma 3 holds for Q and , for any attribute A of such that , there must exist a se-Y.Cao et al. / Bounded Evaluation: Querying Big Data with Bounded Resources cov(Q, B)

2 ,
1) Database schema consists of a single relation schema , where n is the number of relation atoms that appear in query Q. Intuitively, R' extends R with additional attributes 508 International Journal of Automation and Computing 17(4), August 2020 B1 B n(n−1) ••• , .
most attributes for each relation atom R' i in Q', where Q' m is the minimal equivalent query of Q'.Since contains attributes, query Q' is not minimal.

Fig. 1 ;
(b) ; (c) for each step , only one of , and is changed in , and , respectively; and (d) r k+1 is rule γ 3 that includes R[Y] in BA i+1 with and constraint , then there exist steps prior to k+1 that include R[X 1 ], ••• , in BA such that .Hence by the induction hypothesis, has bounded plan .Let be , then ξ R[X] fetches all values for R[X] that are necessary for answering Q.Hence, further by the semantics of fetch, R[Y] has a plan fetch(ξ R[X] , φ) that retrieves all R[Y]-values needed for answering Q.
ZQ s j < i + 1 ξQ s B

Fig. 3
Fig. 3 Reduced graphs for Example 9 (a) Conditional expansion.Denote by U u the attribute, relation or sub-query encoded by a node u in .Note that U u may deduced from attributes or are coefficients that can be estimated from database statistics as a priori.Y. Cao et al. / Bounded Evaluation: Querying Big Data with Bounded Resources and (2) a proof of encodes a bLAP for R under .

× 10 4 (
a) BEAS@PG outperforms PostgreSQL on each and every query on all the three datasets, when all indices are enabled for PostgreSQL.It is1.11  times faster on average.

3
As shown in Figs.4(a) and 4(b), (a) the evaluation time of BEAS@PG is indifferent to the size of D, as expected for boundedly evaluable queries.(b) Bounded query plans work well with large D. Indeed, BEAS@PG took less than 11.67 ms and 3.94

Fig. 4 Q
Fig. 4 Effectiveness of bounded evaluation for bounded and unbounded queries Y.Caoet al. / Bounded Evaluation: Querying Big Data with Bounded Resources T, φ)retrieves an intermediate relation by using the index of φ on D, where each tuple t in S is annotated with multiplicity m(D, t) (also retrieved by φ).
Under access schema , an RA aggr Q is boundedly evaluable if it has a BBoundedly evaluable queries.
QmB |B|the size of Q is typically small.Taking one of these algorithms as an oracle for computing Q m , one can still efficiently check the bounded evaluability of generic SPC queries: first minimize Q, yielding , and then check whether Q m and satisfy the condition of Lemma 5 in PTIME in |Q m | and .
.e., to include R[Y] in BA 1 with.For (i), simply let ξ A = {c}.Then ξ A is a bounded plan that fetches all necessary values for A. For (ii), let .Then by the semantics of fetch, all values for R[Y] that are necessary for answering Q are fetched by ξ R[Y] (here we rename Q beforehand such that there exist no duplicated attribute names).
been included in {BR in prior steps.Hence by the induction hypothesis, there exist , •••, that fetch all values from R 1 , ••• , R p , respectively, which are necessary for Q.Now construct plan by replacing each relation in Q s with .Then must be a query plan for Q s of Q since all necessary value combinations can be retrieved from D via , and Q s then filters and combines values exactly the same as on D.
). Example 8. Recall Q 2 from Example 2 and bag access schema from Example 4. Algorithm BEChk iteratively updates BA, BR and BQ for Q 2 and , which are all initially.

Table 2
(b) Even though all TPCH queries are unbounded, over 16 GB of TPCH data, BEAS@PG is up to 40.46 times faster than PostgreSQL, and is on average 7.32 times faster.For unbounded queries over AIRCA and UKMOT, BEAS@PG is on average 1.32 and 4.61 times faster than PostgreSQL, respectively, up to 1.48 and 6.10 times.