A new approximate query engine based on intelligent capture and fast transformations of granulated data summaries
 1k Downloads
 3 Citations
Abstract
We outline the processes of intelligent creation and utilization of granulated data summaries in the engine aimed at fast approximate execution of analytical SQL statements. We discuss how to use the introduced engine for the purposes of adhoc data exploration over large and quickly increasing data collected in a heterogeneous or distributed fashion. We focus on mechanisms that transform input data summaries into result sets representing query outcomes. We also illustrate how our computational principles can be put together with other paradigms of scaling and harnessing data analytics.
Keywords
Big data processing Approximate query processing Data summarization Granular computing Machine intelligence1 Introduction
There is a growing need to explore big data sets. Most companies address this challenge by scaling out resources. However, this strategy is increasingly costprohibitive and inefficient for large and distributed data sources. On the other hand, people are realizing that the tasks of data exploration could be successfully performed in at least partially approximate fashion. This way of thinking opens new opportunities to seek for a balance between the speed, resource consumption and accuracy of computations. In the case of an approach introduced in this paper – an engine that produces high value approximate answers to SQL statements by utilizing granulated summaries of input data – such a new kind of balance possesses very attractive characteristics.
The engine allows its users to achieve approximate – yet sufficiently accurate – analytical insights 1001000 times faster than traditional solutions. From an architectural perspective, it comprises two layers: 1) the knowledge capture layer responsible for softwareagentstyle acquisition of data summaries and 2) the knowledge transformation layer responsible for utilizing alreadystored summaries to produce fast approximate answers to adhoc queries. The first layer looks through potentially distributed and heterogeneous data sources, leaving the actual data inplace. The second layer operates on the output of the first layer, removing the need of accessing the original data, which would be far too costly and simply not needed for analytical purposes.
Properties highlighted in Figs. 1 and 2 can be also considered from the perspectives of business intelligence or cognitive computing, where there is an ongoing discussion whether in all use cases query answers must be exact and whether one can afford to wait for them. Furthermore, in many scenarios, the analysis needs to be conducted over the data collected in a lake or cloud environment. Traditional solutions would then require moving / reorganizing large amounts of the data to make them efficiently accessible. The data sets might be queried in their original locations but this makes it hard to jointly analyze diverse data sources in single operations. Thus, the ability to work with easily manageable summaries brings a great advantage.
The paper is organized as follows. In Section 2, we discuss our previous developments and some stateoftheart approaches in the area of approximate query processing. In Section 3, we discuss the details of the captured summaries and provide some means for their analysis. In Section 4, we describe examples of the knowledge transformation operations that produce new summaries reflecting intermediate stages of query execution. In Section 5, we discuss the final stage of assembling approximate answers, namely, generating query results from the produced summaries. In Section 6, we report our empirical experiences gathered to date. In Section 7, we outline some of other aspects still in development. Section 8 concludes the paper.
Although our main aim is to discuss the engine from a practical viewpoint, let us also emphasize that it relies on some significant scientific novelties. It is particularly visible in Section 4, where we adapt the mechanism of treebased belief propagation to populate the WHERErelated changes in data summaries, by introducing new methods of dynamic derivation of optimal trees from input summaries, extending the standard propagation model to let it work with more complex filter conditions, using partial knowledge about dataderived joint probabilities to efficiently run recursive calculations and producing output summaries representing the filtered data for the purposes of further transformations reflecting execution of a given SELECT statement.
2 Related works
In this section, we first refer to our own previous experiences and then proceed with a brief discussion related to other methodologies, as well as a more general scope of approximate querying in data analytics.
2.1 Historical background
To some extent, the approach introduced in this paper originated from the architecture of our earlier database engine, which clustered the incoming rows into socalled packrows, further decomposed into data packs gathering values of particular columns (Ślęzak et al. 2013). In that earlier framework, packrows were described by simple summaries accessible independently from the underlying data. It combined the ideas taken from other database technologies (Abadi et al. 2013) and the theory of rough sets (Pawlak and Skowron 2007), by means of using summaries to classify data packs as relevant, irrelevant and partially relevant for particular SELECT statements – by analogy to deriving rough set positive, negative and boundary regions of the considered concepts, respectively. Such higherlevel classifications were useful to limit the amounts of compressed data packs required to access to finish calculations. Actually, one could think about it as productionready example of a granular data analytics mechanism switching efficiently between the detailed and summarybased data representations – a mechanism envisioned in a number of studies on the principles of granular computing (Yao 2016).
Our previous engine included rough query functionality developed for the purposes of both external usage and internal query execution accelerations (Kowalski et al. 2013). The idea was to quickly deliver some bounds for actual query results. However, it was hard to configure other tools to work with such a new kind of output syntax. External tools would expect the engine to use its rough query capabilities to generate approximate results in standard format. Somewhat in parallel, we encouraged the academic community to design summarybased machine learning and knowledge discovery methods (Ganter and Meschke 2011). However, once the original data access was disallowed, summaries stored within our previous framework could not provide the new versions of machine learning algorithms with sufficient information to make their results truly meaningful.
The above lessons – related to nonstandard format of rough query results and limited usefulness of the aforementioned engine’s summaries – had a significant impact on the design presented in this paper. We decided to continue our works in this area because of the rapidly growing challenges of standard database system users, not only with analyzing but also simply managing their data. We followed the same strategy of grouping rows into packrows as previously. However, to make it really efficient, we needed to solve two problems: 1) how to build compact summaries that contain enough knowledge about the original packrows and 2) how to accurately perform operations on those summaries thereby removing the need for access to the actual data.
2.2 Literature overview
Approximate query processing is already a popular trend in data analytics. Exact results of database computations are not always a must, e.g., for the purposes of reporting, visualization, trend analysis, event detection, or decision making in general. By analogy to modern image processing, approximate outcomes of data operations are acceptable, if they enable the users to validly perceive and deploy dataderived knowledge. One may look at our algorithms also from the perspective of information granulation and granular computing, wherein the major rationales are that: 1) crisp, finegrained information is often not available, 2) precise information is costly, 3) finegrained information is not necessary and 4) coarsegrained information reduces cost (Zadeh 1997). Furthermore, approximate query processing corresponds to the area of approximate computing, whereby the primary goal is to determine what aspects and degrees of approximations are feasible so that the produced results are acceptable (Agrawal et al. 2016). Actually, quite many components of the approach considered in this paper, such as reducing the length of heaviest algorithmic loops, quantizing input information, as well as loosening the assumptions about completeness and consistency of the analyzed data sources, could be compared to some of the most fundamental principles of approximate computing.
There are several ways to develop approximate SQL solutions. In most approaches, the results are estimated by executing queries on collections of intelligently derived data samples (Mozafari and Niu 2015). One of advantages of such approaches is their ability to adapt statistical apparatus to deliver confidence intervals for approximate outcomes. For that reason, we also considered introducing sampling into our earlier developments (Kowalski et al. 2013). However, for truly big data sets, goodquality samples will need to be quite large too. This surely limits query acceleration possibilities, not mentioning about other sampling challenges related to multitable joins, handling important outlying values, etc.
The second category of approximate query approaches is based on summaries (histograms, sketches, etc.) (Cormode et al. 2012). These two types of data synopses – samples and summaries – are to some extent combinable (Ramnarayan et al. 2016). However, the solutions developed so far build summaries for predefined query configurations or, e.g., OLAPspecific scenarios (Cuzzocrea and Saccȧ 2013). This limits their usefulness for exploratory analytics, where – by default – it is hard to anticipate queries that will be executed (Nguyen and Nguyen 2005). Because of this, it is important to emphasize that the way we construct and utilize data summaries makes the engine introduced in this paper fully applicable for adhoc analytical SQL statements.
Our summaries are expressed by means of histograms, whereby there is a long tradition of their use within standard relational database optimizers (Jagadish et al. 1998). This is a great source of inspiration when developing new engines such as ours. In the literature, a lot of effort has been spent on addressing a need of updating histogram structures while loading new data (Gibbons et al. 2002). However, this is one aspect where our approach is different. This is because we build separate summaries for each subsequently ingested data chunk. Hence, the newly buffered packrows do not interfere with the previously captured knowledge structures.
There is also a significant amount of research related to multidimensional summaries, although their derivation and utilization is far more complex than for single columns. Histo grams reflecting intensities of cooccurrences of values of different columns are a good starting point in this field (Bruno et al. 2001). One can also refer to some examples of utilizing more complex multidimensional summaries, e.g., treebased structures that can be applied in a progressive / iterative framework for approximate querying (Lazaridis and Mehrotra 2001). Compared to the aforementioned approaches, we operate with relatively simple multicolumn representations that are easy to embed into our knowledge transformation mechanisms yet they contain sufficient information about cooccurrences of values in the original data. Let us reemphasize that our transformations work in a loop over the captured packrow summaries. This gives us a truly valuable opportunity to model a potential drift of occurrence and cooccurrence characteristics with respect to time and origin of the continually ingested data sources.
It is also important to realize that approximate and traditional styles of querying can be used within the same application frameworks. Let us refer again to the tasks of data exploration, where it is quite common to begin an analytical process with approximations and finish with their thorough validation. With this respect, it is worth referring to methods that support incremental and interactive computations. Herein, let us refer to paradigms and tools related to evolution of complex SQL execution processes (Hu et al. 2007) and interactive visualization / representation of analytical query results (Fisher et al. 2012).
In addition to savings in execution time and resource consumption, there are also other motivations for approximate querying. Some of them are related to the fact that in dynamic environments the data may evolve too fast to care about exact answers or – in other cases – some data fragments may be temporarily unavailable (Narayanan et al. 2008). Yet another challenge – analogous to the one encountered by search engines – is that available data items and query conditions often do not perfectly match each other. In such situations, one may wish to retrieve approximately fitting items (Naouali and Missaoui 2005). However, such approaches usually require modifications of query syntax. This can be problematic in practice, like in the case of changing standard format of query outcomes.
3 Knowledge capture settings
In this section, we describe the methods designed to fill the engine with meaningful knowledge about the original chunks of data, i.e., packrows which are collections of rows acquired from the original data tables.
3.1 Histograms, special values and domain representations
The knowledge capture layer needs to assess what aspects of the ingested data will be most useful later for approximate query algorithms. This may turn out to be even more important than the ability to store and process the original data, especially given the fact that perfect data access is questionable in many environments (Narayanan et al. 2008). For the knowledge representation purposes, we utilize enhanced histogram structures (Cormode et al. 2012). For each original data pack, its histogram contains information about dynamically derived rangebased bars and special values that differ from neighboring values of the given column by means of their frequencies in the given packrow. We also store information about significant gaps, i.e., the areas where there are no values occurring. This is further used as a foundation for multicolumn summaries capturing packrowspecific cooccurrences of special values or values belonging to histogram ranges of different data columns. The developed methods decide what is worth storing based on a significance ranking of the detected cooccurrences.
Special values are chosen to minimize an average error related to estimating frequencies of the original values. It resembles classical methods of histogram variance minimization (Jagadish et al. 1998), although some modifications were required for columns with irregular domains. Similar ranking is created for gaps. Any value occurring in a given pack can be a potential special value. Any interval between consecutive (along an ordering defined for a given domain) values occurring in a data pack is a candidate for a gap. However, only the most significant gaps and values can be explicitly represented. Further discussion about heuristic ranking functions that are worth considering during the process of creation of singledimensional summaries can be found, e.g., in ChądzyńskaKrasowska et al. (2017).
Gaps play an important role in estimating local value distributions within histogram ranges. As it will be clearly visible in Section 4, such estimations are crucial for proper behavior of data summary transformations and final query result generation. For instance, referring to Fig. 3, we know that in a given packrow, over a given data column, there were 3580 rows with values dropping into the range between 0 and 100. Let us assume that the greatest common divisor stored for the corresponding data pack (denoted as the field gcd in Fig. 5 described in Section 3.3) equals to 2. Thus, if a given query execution requires the estimation of the number of distinct values occurring in the considered range, then knowledge about a gap between 40 and 60 may improve that estimation by avoiding counting 9 potential values that did not occur in the original data.^{1} Knowledge components related to occurrence of distinct values will be further referred to as domain representation.
3.2 Cooccurrences of histogram bars and special values
The remaining task is to describe cooccurrences between (ranges of) values of different columns. This is quite a significant differentiator in our approach when compared to other methodologies of utilizing summaries in approximate query processing. The key objective is to set up a multicolumn representation that is simple enough to operate with at the levels of both knowledge capture and queryspecific knowledge transformations. This drove our decision to restrict ourselves to modeling cooccurrences for only pairs of columns and represent such pairwise relationships only in a partial way. Although one might say that such a model cannot be sufficiently informative, remember that it is applied locally for each of separate packrows. This allows our ranking algorithms to focus on different pairs of columns and their corresponding values for different fragments of the ingested data, leading towards ability to express far more complex trends and patterns than one might expect.
The ratio (3) minimizes a weighted error comprising quantities of the form \( \tau _{t}\left (a,b\right )  \frac {p_{t}\left ({r_{t}^{a}}[i],{r_{t}^{b}}[j]\right )}{p_{t}\left ({r_{t}^{a}}[i]\right )p_{t}\left ({r_{t}^{b}}[j]\right )} \), for pairs of ranges whose cooccurrence ratios are not stored. Formulation of τ _{ t }(a,b) enabled us to redesign and adapt some of classical probability estimation and revision methods (which would have – in their original versions – quadratic complexity with respect to the number of bars / values), so they work linearly for the knowledge transformation purposes (see Section 4). In ChądzyńskaKrasowska et al. (2017), we reported also some interesting mathematical properties of τ _{ t }(a,b) with regards to machine learning methods.
3.3 Analytical testing environment
Our engine stores data summaries in binary files accessible by approximate query execution methods via internal interfaces. From a logical viewpoint, the contents of those files can be represented as a collection of data summary tables displayed in Fig. 5. For diagnostic purposes, we implemented a converter that transforms the contents of the binary files into an explicit relational form. As a result, one can use standard PostgreSQL to access and analyze the data summary tables that store information about histogram frequencies and ranges, special values, gaps, as well as bartobar and valuetovalue cooccurrences, per each database, table, column and packrow, independently from the implemented knowledge transformation mechanisms.
This kind of an alternative access to the outcomes of our knowledge capture layer proved helpful when developing our quality testing environment. Moreover, the schema presented in Fig. 5 is a great means for elucidation to our customers. With a use of simple SQL, one can check which pairs of columns are most correlated by means of cooccurrences of their histogram ranges and special values (by querying bar_bar and special_special, respectively), what is the degree of repeatability of special values in different data packs of the same data column (by querying special), whether values of particular data columns evolve from packrow to packrow (by querying pack and gap), etc.
This kind of extra knowledge can be helpful for the users to better understand quality and performance characteristics of our query execution processes. Moreover, they may wish to explore data summary tables directly to do basic analytics, e.g., approximately visualize demographics of particular columns. We also work with data summaries in their relational form visible in Fig. 5 while prototyping new algorithms. For instance, in ChądzyńskaKrasowska et al. (2017) we reported some examples of feature selection methods performed entirely on our data summary tables. The obtained empirical results prove that our onedimensional and twodimensional structures described in the previous subsections can provide completely new means to scale machine learning algorithms (Peng et al. 2005).
4 Knowledge transformation operations
For each SELECT statement, the engine transforms available data summaries iteratively to build a summary of the query result. Different transformation mechanisms are dedicated to different operations such as filtering, joining, aggregating, etc. Each subsequent operation in a query execution chain takes as its input relevant summaries produced by previous operations. Once the summaries corresponding to the final stage of query execution are assembled, we use them to produce the outcome interpretable as a standard SQL result. This idea becomes particularly interesting when compared to approximate query techniques based on data sampling, where – after all – the whole computational process remains at the original level of atomic data.
4.1 Filteringrelated transformations
Let us consider the operation of filtering, i.e., taking into account SQL clauses such as WHERE (in the case of summaries of the original tables or dynamically derived summaries of the nested SELECT results) or HAVING (in the case of dynamically derived summaries of earlier GROUP BY operations). For the engine introduced in this paper, this means revising frequencies for histograms, special values and cooccurrence ratios for columns relevant for further query execution stages, basing on filters specified over other columns.
The filtering process adapts the treebased belief propagation (Neapolitan 2003), which is a popular tool in decision making and modeling (Weiss and Pearl 2010). Over the years, many improvements and extensions of the original belief propagation algorithms were proposed, including their SQLbased implementations (Gatterbauer et al. 2015). Nevertheless, this paper presents the first productionready attempt to embed this idea inside an approximate query engine, where – in a loop over packrows – the most interesting cooccurrences of bars and special values of different columns are used as partial knowledge about dataderived conditional probability distributions. For more examples of probabilistic information flow models that operate with partial knowledge about value cooccurrences, let us refer to Pawlak (2005).
Let us consider a simple query of the form SELECT SUM(a) from T WHERE b > x, which involves columns a and b in data table T. From the perspective of our knowledge transformation layer, the process of executing this query is split into two separate phases: 1) the filtering transformation: calculating new counts and frequencies of a’s histogram bars (and special values) subject to the condition b > x and 2) the result generation: assembling SUM(a) based on the outcome of filtering, in combination with information about values occurring inside particular ranges of a. Herein, we concentrate on the first of the above phases. The second one will be discussed in Section 5.
4.2 Dynamic construction of propagation trees
Let us now explain how the idea of belief propagation can be adapted to serve for the WHERErelated knowledge transformations. The most basic propagation algorithm operates on a directed tree spanned over variables assigned with finite domains of values. A directed edge in the tree expresses causal relation, whereby a parent makes its child independent from its remaining nondescendants. Such relation is encoded in a form of conditional probability distribution of the child node subject to its parent. For each packrow t, our task is to construct and use a tree spanned over columns involved in a given query, with probabilities defined by means of available histograms and cooccurrence ratios. In the general case, distributions are defined over special values and histogram ranges embracing the remaining “nonspecial” values. For the sake of clarity, we limit ourselves to a simplified scenario where onedimensional column representations correspond only to histogram ranges.
Let us refer again to the example in Fig. 6. In that case, the corresponding tree can be envisioned as a←b, that is a is the child of b. Singlevariable probability distributions are defined over ranges \({r_{t}^{a}}[1]\), \({r_{t}^{a}}[2]\) and \({r_{t}^{a}}[3]\), as well as \({r_{t}^{b}}[1]\), \({r_{t}^{b}}[2]\) and \({r_{t}^{b}}[3]\). (Actually, one can see that this is one of differences when compared to classical model of belief propagation, where the singlevariable distribution is stored only for the root of a tree.) Furthermore, conditional distribution \(p_{t}\left ({r_{t}^{a}}[i]{r_{t}^{b}}[j]\right )\) can be retrieved as either \(\tau _{t}\left ({r_{t}^{a}}[i],{r_{t}^{b}}[j]\right )p_{t}\left ({r_{t}^{a}}[i]\right )\) in the case of \(\tau _{t}\left ({r_{t}^{a}}[i],{r_{t}^{b}}[j]\right ) \in \tilde {t}\), or \(\tau _{t}\left (a,b\right )p_{t}\left ({r_{t}^{a}}[i]\right )\) otherwise.
As another example, let us consider the statement SELECT b, SUM(a) FROM T GROUP BY b WHERE b > x. In this case, the propagation tree a←b looks like before. However, in addition to singlecolumn distributions, we need to pass further also revised knowledge about cooccurrences involving a’s and b’s histogram ranges. This is because of the specifics of the GROUP BY operation, which – as briefly outlined at the beginning of Section 7 – transforms summaries reflecting the filtered data into summaries representing packrows of tuples indexed by the grouping values and their corresponding aggregation coefficients. Thus, if appropriately recalculated ratios between a and b are not provided, then the whole mechanism would produce incorrect final summaries. We go back to this topic in Section 4.3, where it is discussed how to use propagation trees to revise knowledge about cooccurrence ratios.
If the undirected graph G _{ t } constructed as a result of lines 23 has multiple connected components – further denoted by \({G^{X}_{t}}\) for some column subsets X ⊆ B ∪ C – then, as above, we can assume that those subsets are approximate independent from each other. This means that further calculations can be conducted for each of such components separately and then merged together. Let us denote by p _{ t }(Q ^{ ↓ X }) the estimated ratio of rows in packrow t that satisfy filters specified in query Q on columns in C ∩ X (line 16). Then, the estimated ratio of rows in packrow t that satisfy all conditions of Q – denoted as p _{ t }(Q) – can be calculated as the product of coefficients p _{ t }(Q ^{ ↓ X }) (line 17). Similarly, we can run beliefpropagationbased mechanisms aimed at revision of histograms and cooccurrence ratios over particular components \({G^{X}_{t}}\) and then readjust them to obtain final summaries representing the filtered data.
In Section 4.3, we describe how to calculate the revised probabilities and ratios within each X ⊆ B ∪ C. To do this, we need to transform an undirected tree \({S^{X}_{t}}\) into its directed version \(\overrightarrow {S}^{X}_{t}\). Algorithm 2 addresses this task in lines 78. In order to minimize the amount of further computations, we choose as the tree’s root a column that is on average closest to the elements of C ∩ X. Later, during the belief propagation process, it is also worth considering line 18 that checks whether complete calculations are really necessary. Namely, if the first phase of propagation provides us with p _{ t }(Q ^{ ↓ X }) = 0 over any of connected components of G _{ t }, then packrow t can be skipped as irrelevant.
Surely, we realize that the whole introduced approach raises a number of questions with regards to the accuracy of estimations represented by transformed data summaries. First, one needs to take into account quantization of original column domains as discussed in Section 3.1. Second, as elaborated in Section 3.2, we can afford storing only a fraction of cooccurrence ratios representing relationships between quantized / discretized domains of different columns, which leads, e.g., to potentially inaccurate judgements with respect to dataderived probabilistic independencies. Third, in order to adapt the treebased belief propagation mechanisms to model the process of filtering, we partially neglect some of twocolumn relationships that are evaluated as weaker than the others. Still, despite all those potential inaccuracy factors, the approximate results of analytical queries that we tested on large realworld data sets are usually close enough to their exact counterparts. Thanks to the diagnostic analysis performed on data summary tables described in Section 3.3, we can conclude that this is because our histogram generation, as well as special value / domain gap / cooccurrence ratio ranking functions are indeed able to store the most meaningful aspects of the original data.
4.3 Adaptation of belief propagation algorithm
In this subsection, we focus on calculations over a single tree \(\overrightarrow {S}^{X}_{t}\), for a column subset X ⊆ B ∪ C corresponding to one of connected components of graph G _{ t }. As before, we refer to Algorithm 2. The discussed procedures are also visualized by Fig. 6 that actually concerns the case of a connected graph, i.e., X = B ∪ C, where B = {a,b,c} and C = {b,c}.
The idea of adapting treebased belief propagation is motivated by a need to develop an efficient method to model conjunctions of conditions specified over multiple columns. In this paper, we do not consider other logical operations (such as, e.g., disjunctions). We do not consider conditions defined over derived columns either (such as, e.g., arithmetic or CASE WHEN expressions – see Section 7 for further discussion). Such enhancements are quite straightforward to develop in the near future. Moreover, the current engine already contains some significant extensions when compared to the classical characteristics of propagation mechanisms (Weiss and Pearl 2010). In particular, it enables us to work with conjunctions of more complex singlecolumn conditions than the “variable = value” filters that would be supported by standard model (Neapolitan 2003).
Let us denote by c _{ Q } a twovalued variable characterizing whether a given original row in packrow t could hold the condition specified by query Q over column c ∈ C ∩ X. Let us represent the values of c _{ Q } as \({r^{c}_{Q}}[1]\) and \({r^{c}_{Q}}[2]\) corresponding to rows satisfying and not satisfying the considered condition, respectively. (Note that we do not index those values with t, as they have the same meaning for all packrows of a given table.) Using such new notation, we can rewrite probabilities discussed in Section 4.2 as \(p_{t}(Q^{\downarrow X})=p_{t}(\bigwedge _{c\in C\cap X}{r^{c}_{Q}}[1])\) and \(p_{t}(Q)=p_{t}(\bigwedge _{c\in C}{r^{c}_{Q}}[1])\). Moreover, the main task of this subsection can be now formulated as estimation of new frequencies and cooccurrence ratios for (pairs of) columns in B as conditionals subject to \(\bigwedge _{c\in C}{r^{c}_{Q}}[1]\).

\(p_{t}^{\downarrow }\left ({r_{t}^{a}}[i]\right )\) be the probability that conjunction \(\bigwedge _{c\in C\cap {D_{t}^{a}}}{r^{c}_{Q}}[1]\) holds in packrow t subject to observing values within range \({r_{t}^{a}}[i]\) on column a,

\(p_{t}^{\uparrow }\left ({r_{t}^{a}}[i]\right )\) be the probability that values within range \({r_{t}^{a}}[i]\) are observed on column a subject to satisfaction of conjunction \(\bigwedge _{c\in C\cap {N_{t}^{a}}}{r^{c}_{Q}}[1]\).

for each node c _{ Q }, we put \(\lambda _{t}\left ({r^{c}_{Q}}[1]\right ) = 1\), \(\lambda _{t}\left ({r^{c}_{Q}}[2]\right ) = 0\),

for all other leaves in \(\overrightarrow {S}_{t}^{X}\), we put λ _{ t } ≡ 1,
 for each b ∈ X which is not a leaf in \(\overrightarrow {S}_{t}^{X}\), we put:$$ \begin{array}{c} \lambda_{t}\left( {r_{t}^{b}}[j]\right) = {\prod}_{a \in C{h_{t}^{b}}} {\lambda_{t}^{a}}\left( {r_{t}^{b}}[j]\right) \end{array} $$(10)
Another aspect of performance acceleration relates to parameters \({\lambda _{t}^{a}}\left ({r_{t}^{b}}[j]\right )\) and \(\pi _{t}\left ({r_{t}^{a}}[i]\right )\). Let m a x_n o_o f_b a r s denote the maximum number of histogram bars, as in Algorithm 1. In their original form, formulas \({\lambda _{t}^{a}}\left ({r_{t}^{b}}[j]\right )={\sum }_{i} p_{t}\left ({r_{t}^{a}}[i]{r_{t}^{b}}[j]\right ) \lambda _{t}\left ({r_{t}^{a}}[i]\right )\) and \(\pi _{t}\left ({r_{t}^{a}}[i]\right )={\sum }_{j} p_{t}\left ({r_{t}^{a}}[i]r_{t}^{\hat {a}}[j]\right ){\pi _{t}^{a}}\left (r_{t}^{\hat {a}}[j]\right )\) would lead toward the quadratic cost \(\mathcal {O}\left (X\cdot max\_no\_of\_bars^{2}\right )\). On the contrary, (11) and (13) provide us with computational complexity at the level \(\mathcal {O}\left (\max \left (X\cdot max\_no\_of\_bars, \tilde {t}\right )\right )\), where \(\tilde {t}\) denotes the number of cooccurrence ratios stored by the considered engine for packrow t.
5 Generating final query results
Alternatively, if the knowledge capture layer is regarded as responsible for aforementioned information granulation (Zadeh 1997), then translation of query result summaries into final approximate results can be treated as information degranulation. This way, the overall design of our engine fits the idea of calculations on information granules, with a special emphasis on their transformation and reorganization. More inspirations – related in particular to switching between granular and detailed data representations – can be found in Liu et al. (2013).
Query result generation can also be understood as transition from a result summary to a result comprising crisplyvalued tuples. For simple queries, such as aggregations optionally grouped with respect to lowcardinality data columns, this stage is quite straightforward. As an example, let us go back to the SELECT statement considered in Section 4.1. Its execution consists of transforming information about a subject to the clause b > x and then producing the final outcome SUM(a). In this case, the result can be computed as a total sum of outputs produced by each of packrows. For a given packrow, we take into account the transformed frequencies of a’s special values and we add v a l u e ⋅ f r e q u e n c y scores to the constructed sum. The remaining part is to calculate contributions of histogram bars after subtracting frequencies of their special values. For this purpose, we construct an estimate of an average “nonspecial” value within each bar. Such estimate can be obtained based on domain representation discussed at the end of Section 3.1.
The situation changes if highcardinality columns are involved. Then, it is especially beneficial to model all data operations at the level of summaries and switch to the detailed values just prior to shipping the query result outside. Let us consider query SELECT a, b, c FROM T WHERE b > x and c < y highlighted in Fig. 7. In this case, after belief propagation, we need to generate collections of the resulting abc tuples corresponding to each of (nonzeroed) packrow summaries. First, we use those summaries (both single and multidimensional ones) to create collections of tuples labeled with codes of special values and histogram ranges (symbolizing occurrences of “nonspecial” values). At the end, we replace codes with final values generated using – again – domain information available for the involved columns.
In conclusion, query result generation relies both on operations presented in previous subsections and the domain representation used at the last stage. Let us recall that in addition to knowledge summarized in terms of (co)occurrences of special values and histogram bars, we store the most significant gaps and the greatest common divisors of values observed in the original data packs. Referring to the theory of rough sets (Pawlak and Skowron 2007), we can say that special values whose frequencies were not pushed down to zero during query execution constitute a kind of domain’s positive region, i.e., these are values that should contribute to the query result. On the other hand, gaps, greatest common divisors, dictionaries (if available) and zeroed frequencies can be used together to define the domain’s negative region, i.e., values that should not contribute to the result. Utilization of such regions during result generation is one of the areas of our ongoing improvement, as it strongly relates to elimination of nonexisting values in the approximate query outcomes (ChądzyńskaKrasowska et al. 2017).
6 Empirical evaluation
Referring to Fig. 10 again, one might claim that comparing ourselves to Hive / Spark SQL is not fair because these are products with quite different aims. Indeed, such solutions are expected to cooperate rather than compete with each other, at the level of both external characteristics and the underlying computational paradigms. Actually, we have already conducted experiments related to reimplementation of our knowledge transformation operations in native Spark environment (Zaharia et al. 2016). Preliminary results are highly encouraging, as such implementation allows us to take advantage of query acceleration factors corresponding to both summarybased computations and resourcerelated scalability used, e.g., by Spark SQL (Armbrust et al. 2015). On the other hand, it is certainly important to empirically compare our approach with other approximate query processing solutions. However, as mentioned in Section 2.2, this is a difficult challenge because those solutions will still require far more development work to scale and provide sufficient accuracy for complex queries.
As shown in Fig. 11, one can also run some machine learning methods directly against data summaries, instead of using classical SQLbased interface. Any steps toward this direction would fit the harnessing knowledge dimension illustrated in Fig. 9. (We will discuss it further in Section 7.) Moreover, from the perspective of evaluation of the engine components, it would let us evaluate the knowledge capture layer decoupled from SQLspecific knowledge transformation mechanisms. That was actually our motivation to develop the alreadymentioned summarybased feature selection approach reported in ChądzyńskaKrasowska et al. (2017). Namely, we considered one of standard minimum redundancy maximum relevance (mRMR) feature selection techniques (Peng et al. 2005) and executed it for some realworld data sets using: 1) mutual information scores computed over the original data, 2) mutual information scores computed over 15%samples and 3) approximations of mutual information scores derived directly from the stored data summaries. The outputs – interpreted as rankings of columns produced by each of the three above runs of the mRMR algorithm – turned out to be encouragingly similar to each other, while summarybased approximate calculations were incomparably faster than for both other cases.
7 Further discussion and development directions
There are a number of important topics that were not addressed in this paper in a detailed way. We summarize them briefly below.
Coverage of SQL operations
Besides data filtering, the engine supports also other operations, such as JOIN, GROUP BY, etc. While joining two tables, we need to create summaries of collections of tuples that would belong to a standard output of JOIN operation. This is done by amalgamating the pairs of summarized packrows of input tables. Such amalgamation is analogous to the filtering operations, though it can be preceded by a singletable process of merging similar packrow summaries (the outcomes of Algorithm 2, if any filters were applied prior to joining). As for GROUP BY, its standard output would consist of vectors labeled by aggregate results and the corresponding values of columns used for grouping. Herein, the goal is to produce summaries of such vectors without generating them explicitly. More examples of our knowledge transformation operations will be described in the near future.
Data types and derived columns
In general, some components of abovementioned operations are agnostic, while the others depend on semantics of particular columns. For example, in Algorithm 2, almost all calculations are conducted on probabilities of abstracted partition blocks, except line 12. This is worth remembering while developing new knowledge capture techniques. For instance, for some alphanumeric columns one could consider histogram bars labeled by prefixes instead of ranges. For such new types of histograms, some fundamental knowledge transformation functions – such as (7) – would require reimplementation. There is also requirement for dynamic creation of both onedimensional and twodimensional summaries reflecting new columns derived as complex expressions. Certainly, computations at this level need to rely on the types of original columns as well.
Disk and memory management
Many database mechanisms can be adapted to work with granulated summaries. Their contents can be clustered into even bigger chunks – “megapacks” – and labeled with higherlevel descriptions. For the considered engine, it would be a step toward the paradigms of multilevel granular data analytics (Yao 2016). Yet another aspect refers to vertical data organization (Abadi et al. 2013), which – in our case – means an independent access to collections of histograms, special values, gaps, cooccurrence ratios, etc. This way, for every query, we can grasp these components of stored data summaries that are required to execute particular operations. Such components can be cached in memory for the purposes of future queries. Moreover, summaries representing intermediate outputs of query execution stages can be managed in memory in a pipeline style like in the case of standard database engines.
Knowledge capture improvements
Accuracy tradeoffs and hybrid scenarios
In Section 3, we pointed out that our engine should store only a limited fraction of knowledge, specified by means of maximum amounts of histogram bars, special values, gaps and cooccurrence ratios. Although all those amounts are constrained by some defaults, it is possible to vary them with respect to particular columns or the whole data tables. For instance, it may be desirable to optionally access some of tables in a classical form and join them with summarized contents of other tables within hybrid query execution schemes. Such flexibility fits realworld scenarios of incremental data analytics (Fisher et al. 2012), but it requires careful maintenance of information that allows the algorithms to extract the original data from remote sources. From the perspective of Fig. 12, this kind of information needs to be managed by knowledge capture agents. Parameters enabling the engine to trace the original data should constitute a kind of analytical knowledge grid, by analogy to the concept of knowledge grid considered in semantic web (Cannataro and Talia 2003).
Accuracy measures and confidence intervals
In Section 6, we referred to the accuracy as a measure of similarity between exact and approximate query results, as well as similarity between the outcomes of machine learning algorithms working with detailed and summarized data sets. One can conduct a deeper discussion on a need of designing the quality measures of data summaries and investigating relationships between the accuracies of input summaries and the expected accuracies of query results. Finally, special attention should be devoted to development of confidence interval calculations, which would assist the users in their everyday work, as is used with some samplingbased approximate query engines. To accomplish this, our plans are twofold. First, we intend to enhance our framework with a kind of query explain functionality enabling the users to trace potential inaccuracies emerging at particular stages of query execution. Second, we intend to create a tool, which maintains a sample of original packrows and uses it – in combination with the core engine – to produce confidence intervals that characterize similarities between exact and approximate outcomes of SQL statements.
Ability to expand beyond SQL
Figure 12 illustrates our plans with respect to extending the engine with a librarystyle support for machine learning and knowledge discovery (KDD) processes. As discussed in Section 6, it is indeed possible to use our data summaries to conduct processes like mRMR feature selection (Peng et al. 2005). Some other feature selection approaches seem to be easy to consider as well (Ślęzak 2009). The next steps can include implementations of some elements of deep learning (Hinton 2007) and redesign of some SQLbased decision tree induction algorithms (Nguyen and Nguyen 2005). Let us note that most of such algorithms rely on iterative data and model transformations. This constitutes a kind of conceptual equivalence between the work that we have already done within our framework with respect to approximate SQL and the work that is still ahead in the field of intelligent data processing. By analogy to Section 4, we need to develop efficient and accurate methods of expressing atomic machine learning operations by means of transforming summaries of their inputs into summaries of their outputs. Abstracting such operations can lead towards a powerful environment for interactive analytics, wherein the users could work with summaries of intermediate inputs / outputs via their favorite interfaces.
Integrating with other computational paradigms
Figure 12 also summarizes opportunities related to putting our architectural strengths together with other trends in the areas of scalable and distributed data processing. SQL Layer denotes the currently existing layer of relationalqueryfocused transformations that – as outlined in Section 6 – can be successfully embedded into Spark environment (Zaharia et al. 2016) (see also the harnessing resources dimension in Fig. 9). ML Layer represents methods discussed above (see also ChądzyńskaKrasowska et al. 2017). Capture Agent refers to the distributed knowledge capture layer that can save constructed histograms, etc., in a format uniformly accessible by SQL Layer, ML Layer and Spark’s generic tools enabling the users to work directly with data summaries. This last aspect is especially interesting when combined with the previously mentioned idea of interactive analytics, which – in addition to intermediate inputs / outputs of machine learning algorithms – can embrace an access to summaries of simply filtered, joined and grouped data sets.
8 Conclusions
We presented a novel approximate query processing engine, which works by means of SQLrelated transformations of granulated data summaries. The engine does not assume an access to the original data. Instead, it processes quantized histograms and a low percentage of cooccurrence ratios reflecting multicolumn interdependencies detected in the buffered chunks of ingested data tables. Switching from the level of atomic data rows to lowfootprint descriptions of their bigger clusters enables us to significantly decrease the computational cost of operations corresponding to filtering, joining, aggregating, etc. On the other hand, by dealing with summaries at the level of data chunks – unlikely in the case of other database solutions that do it for larger partition blocks or the whole tables – we are able to better control their quality and address complex trends occurring in the original data.
The discussed engine is available in its first production version and it is heavily tested by the first customers. The empirical observations confirm that it can be suitable for data owners and databased services providers who cannot currently cope with exploring permanently growing data, or who simply want to lower the cost of resources required for data maintenance and analytics. Some of the already validated use cases include network traffic analytics and intrusion detection, digital advertising, as well as monitoring of industry processes. More generally, the users of similar engines can come from the fields of online applications, internet of things, sensorbased risk management systems and other tools related to machinegenerated data.
We envision a number of interesting directions for further research and development. First of all, we can see a huge opportunity in strengthening connections of our overall approach with the theories of rough sets and granular computing, e.g., with respect to designing multilevel models of approximate summarybased computations. Moreover, although the existing approximate querying approaches based on data sampling do not seem to scale sufficiently in realworld scenarios, they provide the users with very helpful functionalities such as confidence interval derivations. Thus, one of our goals is to introduce such functionalities into our own framework, potentially by enriching our granulated data summaries with small subsets of original rows. However – like in the case of any other extensions – this will need to be done very carefully, so as to maintain the current performance characte ristics, especially from the viewpoint of working with distributed and remote data sets.
Footnotes
 1.
Gaps are interpreted as open intervals.
 2.
In Fig. 5, one can find two types of default ratios – default_special_special_tau and default_not_covered _special_special_tau – corresponding to the cases of “parents” of special values whose ratios are stored and are not stored, respectively.
 3.
Let us recall that we deal with a dynamically derived approximate probability model, with all its consequences discussed at the end of Section 4.2. The investigation how accurately such models reflect the data is one of our main analytical testing scopes.
References
 Abadi, D., Boncz, P., Harizopoulos, S., Idreos, S., & Madden, S. (2013). The design and implementation of modern columnoriented database systems. Foundations and Trends in Databases, 5(3), 197–280.CrossRefGoogle Scholar
 Agrawal, A., Choi, J., Gopalakrishnan, K., Gupta, S., Nair, R., Oh, J., Prener, D.A., Shukla, S., Srinivasan, V., & Sura, Z. (2016). Approximate computing: challenges and opportunities, Proceedings of ICRC (pp. 1–8).Google Scholar
 Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., & Zaharia, M. (2015). Spark SQL: relational data processing in spark, Proceedings of SIGMOD (pp. 1383–1394).Google Scholar
 Bruno, N., Chaudhuri, S., & Gravano, L. (2001). STHoles: a multidimensional workloadaware histogram, Proceedings of SIGMOD (pp. 211–222).Google Scholar
 Cannataro, M., & Talia, D. (2003). The knowledge grid. Communications of the ACM, 46(1), 89–93.CrossRefzbMATHGoogle Scholar
 ChądzyńskaKrasowska, A., Betliński, P., & Ślęzak, D. (2017). Scalable machine learning with granulated data summaries: a case of feature selection, Proceedings of ISMIS (pp. 519–529).Google Scholar
 ChądzyńskaKrasowska, A., Stawicki, S., & Ślęzak, D. (2017). A metadata diagnostic framework for a new approximate query engine working with granulated data summaries, Proceedings of IJCRS (1) (pp. 6230–643).Google Scholar
 Cormode, G., Garofalakis, M.N., Haas, P.J., & Jermaine, C. (2012). Synopses for massive data: samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(13), 1–294.CrossRefzbMATHGoogle Scholar
 Cuzzocrea, A., & Saccȧ, D. (2013). Exploiting compression and approximation paradigms for effective and efficient online analytical processing over sensor network readings in data grid environments. Concurrency and Computation: Practice and Experience, 25(14), 2016–2035.CrossRefGoogle Scholar
 Fisher, D., Popov, I.O., Drucker, S.M., & Schraefel, M.C. (2012). Trust me, i’m partially right: incremental visualization lets analysts explore large datasets faster, Proceedings of CHI (pp. 1673–1682).Google Scholar
 Ganter, B., & Meschke, C. (2011). A formal concept analysis approach to rough data tables. In Transactions on Rough Sets XIV (pp. 37–61). Springer.Google Scholar
 Gatterbauer, W., Günnemann, S., Koutra, D., & Faloutsos, C. (2015). Linearized and singlepass belief propagation. Proceedings of the VLDB Endowment, 8 (5), 581–592.CrossRefGoogle Scholar
 Gibbons, P.B., Matias, Y., & Poosala, V. (2002). Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems, 27(3), 261–298.CrossRefGoogle Scholar
 Glick, R. (2015). Current trends in analytic databases – Keynote Talk at FedCSIS. https://fedcsis.org/2015/keynotes/rick_glick.
 Hinton, G.E. (2007). Learning multiple layers of representation. Trends in Cognitive Sciences, 11, 428–434.CrossRefGoogle Scholar
 Hu, Y., Sundara, S., & Srinivasan, J. (2007). Supporting timeconstrained SQL queries in oracle, Proceedings of VLDB (pp. 1207–1218).Google Scholar
 Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., & Suel, T. (1998). Optimal histograms with quality guarantees, Proceedings of VLDB (pp. 275–286).Google Scholar
 Kowalski, M., Ślęzak, D., & Synak, P. (2013). Enhanced rough SQL for correlated subqueries, Proceedings of FedCSIS (pp. 1455–1462).Google Scholar
 Lazaridis, I., & Mehrotra, S. (2001). Progressive approximate aggregate queries with a multiresolution tree structure, Proceedings of SIGMOD (pp. 401–412).Google Scholar
 Liu, Y., Li, D., He, W., & Wang, G. (2013). Granular computing based on Gaussian cloud transformation. Fundamenta Informaticae, 127(14), 385–398.Google Scholar
 Meila, M. (1999). An accelerated Chow and Liu algorithm: fitting tree distributions to highdimensional sparse data, Proceedings of ICML (pp. 249–257).Google Scholar
 Mozafari, B., & Niu, N. (2015). A handbook for building an approximate query engine. IEEE Data Engineering Bulletin, 38(3), 3–29.Google Scholar
 Naouali, S., & Missaoui, R. (2005). Flexible query answering in data cubes, Proceedings of DaWaK (pp. 221–232).Google Scholar
 Narayanan, D., Donnelly, A., Mortier, R., & Rowstron, A. (2008). Delay aware querying with seaweed. The VLDB Journal, 17(2), 315–331.CrossRefGoogle Scholar
 Neapolitan, R.E. (2003). Learning Bayesian networks. Prentice Hall.Google Scholar
 Nguyen, H.S., & Nguyen, S.H. (2005). Fast split selection method and its application in decision tree construction from large databases. International Journal of Hybrid Intelligent Systems, 2(2), 149–160.CrossRefzbMATHGoogle Scholar
 Ordonez, C., & Cereghini, P. (2000). SQLEM: fast clustering in SQL using the EM algorithm, Proceedings of SIGMOD (pp. 559–570).CrossRefGoogle Scholar
 Pawlak, Z. (2005). Flow graphs and intelligent data analysis. Fundamenta Informaticae, 64(14), 369–377.MathSciNetzbMATHGoogle Scholar
 Pawlak, Z., & Skowron, A. (2007). Rudiments of rough sets. Information Sciences, 177(1), 3–27.MathSciNetCrossRefzbMATHGoogle Scholar
 Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of maxdependency, maxrelevance, and minredundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.CrossRefGoogle Scholar
 Ramnarayan, J., Mozafari, B., Wale, S., Menon, S., Kumar, N., Bhanawat, H., Chakraborty, S., Mahajan, Y., Mishra, R., & Bachhav, K. (2016). SnappyData: a hybrid transactional analytical store built on spark, Proceedings of SIGMOD (pp. 2153–2156).CrossRefGoogle Scholar
 Ślęzak, D. (2009). Degrees of conditional (in)dependence: a framework for approximate bayesian networks and examples related to the rough setbased feature selection. Information Sciences, 179(3), 197–209.MathSciNetCrossRefzbMATHGoogle Scholar
 Ślęzak, D., Synak, P., Wojna, A., & Wróblewski, J. (2013). Two database related interpretations of rough approximations: data organization and query execution. Fundamenta Informaticae, 127(14), 445–459.Google Scholar
 Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., & Murthy, R. (2009). Hive – a warehousing solution over a mapreduce framework. Proceedings of the VLDB Endowment, 2(2), 1626–1629.CrossRefGoogle Scholar
 Weiss, Y., & Pearl, J. (2010). Belief propagation: technical perspective. Communications of the ACM, 53(10), 94.CrossRefGoogle Scholar
 Yao, Y. (2016). A triarchic theory of granular computing. Granular Computing, 1(2), 145–157.CrossRefGoogle Scholar
 Zadeh, L.A. (1997). Toward a theory of fuzzy information granulation and tts centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems, 90(2), 111–127.MathSciNetCrossRefzbMATHGoogle Scholar
 Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., & Stoica, I. (2016). Apache spark: a unified engine for big data processing. Communications of the ACM, 59(11), 56–65.CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.