Abstract
We outline the processes of intelligent creation and utilization of granulated data summaries in the engine aimed at fast approximate execution of analytical SQL statements. We discuss how to use the introduced engine for the purposes of adhoc data exploration over large and quickly increasing data collected in a heterogeneous or distributed fashion. We focus on mechanisms that transform input data summaries into result sets representing query outcomes. We also illustrate how our computational principles can be put together with other paradigms of scaling and harnessing data analytics.
Introduction
There is a growing need to explore big data sets. Most companies address this challenge by scaling out resources. However, this strategy is increasingly costprohibitive and inefficient for large and distributed data sources. On the other hand, people are realizing that the tasks of data exploration could be successfully performed in at least partially approximate fashion. This way of thinking opens new opportunities to seek for a balance between the speed, resource consumption and accuracy of computations. In the case of an approach introduced in this paper – an engine that produces high value approximate answers to SQL statements by utilizing granulated summaries of input data – such a new kind of balance possesses very attractive characteristics.
The presented engine captures knowledge in a form of single and multicolumn data summaries. It collects chunks of newly gathered data and builds summaries for each chunk separately. Unlike in standard databases, the considered query execution mechanisms do not assume any access to the original chunks. Those chunks can be available in a broader application framework but the primary goal of our engine is to work with summaries, as illustrated by Fig. 1. Namely, for a given query received from an external tool, each consecutive data operation (such as filtering, joining, grouping, etc.) scheduled within the query execution plan is performed as a transformation of summaries representing its input into summaries representing its output.
The engine allows its users to achieve approximate – yet sufficiently accurate – analytical insights 1001000 times faster than traditional solutions. From an architectural perspective, it comprises two layers: 1) the knowledge capture layer responsible for softwareagentstyle acquisition of data summaries and 2) the knowledge transformation layer responsible for utilizing alreadystored summaries to produce fast approximate answers to adhoc queries. The first layer looks through potentially distributed and heterogeneous data sources, leaving the actual data inplace. The second layer operates on the output of the first layer, removing the need of accessing the original data, which would be far too costly and simply not needed for analytical purposes.
Several organizations have been already working towards embedding our solution into their projects. Figure 2 illustrates its benefits in particular areas, where the emphasis is put on the speed of querybased analytical processes and the ability to integrate broader aspects of the data. Thanks to informative richness of summaries, the approximate results can be used for insightful decisions. Moreover, our analytical testing environment allows the users to investigate similarities of approximate answers compared to query outputs that would be extracted in a standard way from the atomic data.
Properties highlighted in Figs. 1 and 2 can be also considered from the perspectives of business intelligence or cognitive computing, where there is an ongoing discussion whether in all use cases query answers must be exact and whether one can afford to wait for them. Furthermore, in many scenarios, the analysis needs to be conducted over the data collected in a lake or cloud environment. Traditional solutions would then require moving / reorganizing large amounts of the data to make them efficiently accessible. The data sets might be queried in their original locations but this makes it hard to jointly analyze diverse data sources in single operations. Thus, the ability to work with easily manageable summaries brings a great advantage.
The paper is organized as follows. In Section 2, we discuss our previous developments and some stateoftheart approaches in the area of approximate query processing. In Section 3, we discuss the details of the captured summaries and provide some means for their analysis. In Section 4, we describe examples of the knowledge transformation operations that produce new summaries reflecting intermediate stages of query execution. In Section 5, we discuss the final stage of assembling approximate answers, namely, generating query results from the produced summaries. In Section 6, we report our empirical experiences gathered to date. In Section 7, we outline some of other aspects still in development. Section 8 concludes the paper.
Although our main aim is to discuss the engine from a practical viewpoint, let us also emphasize that it relies on some significant scientific novelties. It is particularly visible in Section 4, where we adapt the mechanism of treebased belief propagation to populate the WHERErelated changes in data summaries, by introducing new methods of dynamic derivation of optimal trees from input summaries, extending the standard propagation model to let it work with more complex filter conditions, using partial knowledge about dataderived joint probabilities to efficiently run recursive calculations and producing output summaries representing the filtered data for the purposes of further transformations reflecting execution of a given SELECT statement.
Related works
In this section, we first refer to our own previous experiences and then proceed with a brief discussion related to other methodologies, as well as a more general scope of approximate querying in data analytics.
Historical background
To some extent, the approach introduced in this paper originated from the architecture of our earlier database engine, which clustered the incoming rows into socalled packrows, further decomposed into data packs gathering values of particular columns (Ślęzak et al. 2013). In that earlier framework, packrows were described by simple summaries accessible independently from the underlying data. It combined the ideas taken from other database technologies (Abadi et al. 2013) and the theory of rough sets (Pawlak and Skowron 2007), by means of using summaries to classify data packs as relevant, irrelevant and partially relevant for particular SELECT statements – by analogy to deriving rough set positive, negative and boundary regions of the considered concepts, respectively. Such higherlevel classifications were useful to limit the amounts of compressed data packs required to access to finish calculations. Actually, one could think about it as productionready example of a granular data analytics mechanism switching efficiently between the detailed and summarybased data representations – a mechanism envisioned in a number of studies on the principles of granular computing (Yao 2016).
Our previous engine included rough query functionality developed for the purposes of both external usage and internal query execution accelerations (Kowalski et al. 2013). The idea was to quickly deliver some bounds for actual query results. However, it was hard to configure other tools to work with such a new kind of output syntax. External tools would expect the engine to use its rough query capabilities to generate approximate results in standard format. Somewhat in parallel, we encouraged the academic community to design summarybased machine learning and knowledge discovery methods (Ganter and Meschke 2011). However, once the original data access was disallowed, summaries stored within our previous framework could not provide the new versions of machine learning algorithms with sufficient information to make their results truly meaningful.
The above lessons – related to nonstandard format of rough query results and limited usefulness of the aforementioned engine’s summaries – had a significant impact on the design presented in this paper. We decided to continue our works in this area because of the rapidly growing challenges of standard database system users, not only with analyzing but also simply managing their data. We followed the same strategy of grouping rows into packrows as previously. However, to make it really efficient, we needed to solve two problems: 1) how to build compact summaries that contain enough knowledge about the original packrows and 2) how to accurately perform operations on those summaries thereby removing the need for access to the actual data.
Literature overview
Approximate query processing is already a popular trend in data analytics. Exact results of database computations are not always a must, e.g., for the purposes of reporting, visualization, trend analysis, event detection, or decision making in general. By analogy to modern image processing, approximate outcomes of data operations are acceptable, if they enable the users to validly perceive and deploy dataderived knowledge. One may look at our algorithms also from the perspective of information granulation and granular computing, wherein the major rationales are that: 1) crisp, finegrained information is often not available, 2) precise information is costly, 3) finegrained information is not necessary and 4) coarsegrained information reduces cost (Zadeh 1997). Furthermore, approximate query processing corresponds to the area of approximate computing, whereby the primary goal is to determine what aspects and degrees of approximations are feasible so that the produced results are acceptable (Agrawal et al. 2016). Actually, quite many components of the approach considered in this paper, such as reducing the length of heaviest algorithmic loops, quantizing input information, as well as loosening the assumptions about completeness and consistency of the analyzed data sources, could be compared to some of the most fundamental principles of approximate computing.
There are several ways to develop approximate SQL solutions. In most approaches, the results are estimated by executing queries on collections of intelligently derived data samples (Mozafari and Niu 2015). One of advantages of such approaches is their ability to adapt statistical apparatus to deliver confidence intervals for approximate outcomes. For that reason, we also considered introducing sampling into our earlier developments (Kowalski et al. 2013). However, for truly big data sets, goodquality samples will need to be quite large too. This surely limits query acceleration possibilities, not mentioning about other sampling challenges related to multitable joins, handling important outlying values, etc.
The second category of approximate query approaches is based on summaries (histograms, sketches, etc.) (Cormode et al. 2012). These two types of data synopses – samples and summaries – are to some extent combinable (Ramnarayan et al. 2016). However, the solutions developed so far build summaries for predefined query configurations or, e.g., OLAPspecific scenarios (Cuzzocrea and Saccȧ 2013). This limits their usefulness for exploratory analytics, where – by default – it is hard to anticipate queries that will be executed (Nguyen and Nguyen 2005). Because of this, it is important to emphasize that the way we construct and utilize data summaries makes the engine introduced in this paper fully applicable for adhoc analytical SQL statements.
Our summaries are expressed by means of histograms, whereby there is a long tradition of their use within standard relational database optimizers (Jagadish et al. 1998). This is a great source of inspiration when developing new engines such as ours. In the literature, a lot of effort has been spent on addressing a need of updating histogram structures while loading new data (Gibbons et al. 2002). However, this is one aspect where our approach is different. This is because we build separate summaries for each subsequently ingested data chunk. Hence, the newly buffered packrows do not interfere with the previously captured knowledge structures.
There is also a significant amount of research related to multidimensional summaries, although their derivation and utilization is far more complex than for single columns. Histo grams reflecting intensities of cooccurrences of values of different columns are a good starting point in this field (Bruno et al. 2001). One can also refer to some examples of utilizing more complex multidimensional summaries, e.g., treebased structures that can be applied in a progressive / iterative framework for approximate querying (Lazaridis and Mehrotra 2001). Compared to the aforementioned approaches, we operate with relatively simple multicolumn representations that are easy to embed into our knowledge transformation mechanisms yet they contain sufficient information about cooccurrences of values in the original data. Let us reemphasize that our transformations work in a loop over the captured packrow summaries. This gives us a truly valuable opportunity to model a potential drift of occurrence and cooccurrence characteristics with respect to time and origin of the continually ingested data sources.
It is also important to realize that approximate and traditional styles of querying can be used within the same application frameworks. Let us refer again to the tasks of data exploration, where it is quite common to begin an analytical process with approximations and finish with their thorough validation. With this respect, it is worth referring to methods that support incremental and interactive computations. Herein, let us refer to paradigms and tools related to evolution of complex SQL execution processes (Hu et al. 2007) and interactive visualization / representation of analytical query results (Fisher et al. 2012).
In addition to savings in execution time and resource consumption, there are also other motivations for approximate querying. Some of them are related to the fact that in dynamic environments the data may evolve too fast to care about exact answers or – in other cases – some data fragments may be temporarily unavailable (Narayanan et al. 2008). Yet another challenge – analogous to the one encountered by search engines – is that available data items and query conditions often do not perfectly match each other. In such situations, one may wish to retrieve approximately fitting items (Naouali and Missaoui 2005). However, such approaches usually require modifications of query syntax. This can be problematic in practice, like in the case of changing standard format of query outcomes.
Knowledge capture settings
In this section, we describe the methods designed to fill the engine with meaningful knowledge about the original chunks of data, i.e., packrows which are collections of rows acquired from the original data tables.
Histograms, special values and domain representations
The knowledge capture layer needs to assess what aspects of the ingested data will be most useful later for approximate query algorithms. This may turn out to be even more important than the ability to store and process the original data, especially given the fact that perfect data access is questionable in many environments (Narayanan et al. 2008). For the knowledge representation purposes, we utilize enhanced histogram structures (Cormode et al. 2012). For each original data pack, its histogram contains information about dynamically derived rangebased bars and special values that differ from neighboring values of the given column by means of their frequencies in the given packrow. We also store information about significant gaps, i.e., the areas where there are no values occurring. This is further used as a foundation for multicolumn summaries capturing packrowspecific cooccurrences of special values or values belonging to histogram ranges of different data columns. The developed methods decide what is worth storing based on a significance ranking of the detected cooccurrences.
Algorithm 1 outlines how we derive cutpoints between consecutive histogram bars. This is a simple combination of two standard domain quantization techniques. However, out of all procedures that we have examined by now, it yields the highest accuracies of approximate queries over data sets with complex column characteristics. First, the domain of a data column within a given packrow is decomposed onto a fixed number of intervals of (almost) equal length (lines 57). Then, the algorithm allocates (almost) equal amounts of wanted histogram bars per interval and – for each interval separately – it produces the final bars attempting to equalize frequencies of occurrence of rows with values dropping into particular ranges (lines 1618).
Figure 3 shows the representation of a data pack corresponding to “intersection” of a packrow and a column. Once cutpoints between histogram bars are decided, the knowledge capture algorithms prepare heuristic rankings of candidates for special values and gaps. Only a limited number of the highestranked values and gaps per data pack can be stored, to address the need for a reasonable tradeoff between summary footprint (translatable to the expected query execution speed, by means of runtime of algorithmic hot loops that transform data summaries) and completeness of gathered knowledge (translatable to the expected accuracy of approximate query outputs).
Special values are chosen to minimize an average error related to estimating frequencies of the original values. It resembles classical methods of histogram variance minimization (Jagadish et al. 1998), although some modifications were required for columns with irregular domains. Similar ranking is created for gaps. Any value occurring in a given pack can be a potential special value. Any interval between consecutive (along an ordering defined for a given domain) values occurring in a data pack is a candidate for a gap. However, only the most significant gaps and values can be explicitly represented. Further discussion about heuristic ranking functions that are worth considering during the process of creation of singledimensional summaries can be found, e.g., in ChądzyńskaKrasowska et al. (2017).
Gaps play an important role in estimating local value distributions within histogram ranges. As it will be clearly visible in Section 4, such estimations are crucial for proper behavior of data summary transformations and final query result generation. For instance, referring to Fig. 3, we know that in a given packrow, over a given data column, there were 3580 rows with values dropping into the range between 0 and 100. Let us assume that the greatest common divisor stored for the corresponding data pack (denoted as the field gcd in Fig. 5 described in Section 3.3) equals to 2. Thus, if a given query execution requires the estimation of the number of distinct values occurring in the considered range, then knowledge about a gap between 40 and 60 may improve that estimation by avoiding counting 9 potential values that did not occur in the original data.^{Footnote 1} Knowledge components related to occurrence of distinct values will be further referred to as domain representation.
Cooccurrences of histogram bars and special values
The remaining task is to describe cooccurrences between (ranges of) values of different columns. This is quite a significant differentiator in our approach when compared to other methodologies of utilizing summaries in approximate query processing. The key objective is to set up a multicolumn representation that is simple enough to operate with at the levels of both knowledge capture and queryspecific knowledge transformations. This drove our decision to restrict ourselves to modeling cooccurrences for only pairs of columns and represent such pairwise relationships only in a partial way. Although one might say that such a model cannot be sufficiently informative, remember that it is applied locally for each of separate packrows. This allows our ranking algorithms to focus on different pairs of columns and their corresponding values for different fragments of the ingested data, leading towards ability to express far more complex trends and patterns than one might expect.
To keep a low footprint of summaries, we store cooccurrencerelated information only for a limited number of important pairs of bars and special values. For packrow t and columns a and b, let us refer to a’s and b’s histogram bars using iterators i and j, respectively. Let us denote by \(p_{t}\left ({r_{t}^{a}}[i]\right )\), \(p_{t}\left ({r_{t}^{b}}[j]\right )\) and \(p_{t}\left ({r_{t}^{a}}[i], {r_{t}^{b}}[j]\right )\) dataderived probabilities of occurrence of a’s values within its ith bar, b’s values within its jth bar and pairs of a’s values within its ith bar and b’s values within its jth bar, respectively. We use the following ranking function to express the importance of pairs of histogram bars:
Formula (1) evaluates how much accuracy could be lost by a query execution process based on the product \(p_{t}\left ({r_{t}^{a}}[i]\right )p_{t}\left ({r_{t}^{b}}[j]\right )\) instead of the actual quantity of \(p_{t}\left ({r_{t}^{a}}[i],{r_{t}^{b}}[j]\right )\). For a given packrow t, we use function b a r s _{ t } to jointly evaluate all pairs of bars for all pairs of columns. This results in devoting relatively more footprint to pairs of columns, which are more interdependent than others. After choosing a fixed amount of the most important pairs of ranges, for given columns a and b, we store two types of information. For the chosen pairs \(\left ({r_{t}^{a}}[i],{r_{t}^{b}}[j]\right )\), we store the following cooccurrence ratios:
For not chosen pairs, we need an approximate average ratio for the purposes of, e.g., our WHERErelated version of the belief propagation process (Neapolitan 2003) (see Section 4). For the pairs of histogram bars, such default ratio (denoted by default_bar_bar_tau in Fig. 5) can be defined as follows:
where by writing ‘\(\tau _{t}\left ({r_{t}^{a}}[i],{r_{t}^{b}}[j]\right )\in \tilde {t}\)’ we mean that the ratio \(\tau _{t}\left ({r_{t}^{a}}[i],{r_{t}^{b}}[j]\right )\) is chosen to be stored in the summarized representation of packrow t. Let us refer to Fig. 4, which shows how the ratios are calculated (Fig. 4).
The ratio (3) minimizes a weighted error comprising quantities of the form \( \tau _{t}\left (a,b\right )  \frac {p_{t}\left ({r_{t}^{a}}[i],{r_{t}^{b}}[j]\right )}{p_{t}\left ({r_{t}^{a}}[i]\right )p_{t}\left ({r_{t}^{b}}[j]\right )} \), for pairs of ranges whose cooccurrence ratios are not stored. Formulation of τ _{ t }(a,b) enabled us to redesign and adapt some of classical probability estimation and revision methods (which would have – in their original versions – quadratic complexity with respect to the number of bars / values), so they work linearly for the knowledge transformation purposes (see Section 4). In ChądzyńskaKrasowska et al. (2017), we reported also some interesting mathematical properties of τ _{ t }(a,b) with regards to machine learning methods.
An analogous approach can be introduced for pairs of special values. Let us denote by \({s^{a}_{t}}[k]\) and \({s^{b}_{t}}[l]\) the kth and lth special values for columns a and b within packrow t, respectively. Let us denote dataderived probabilities of their occurrence and cooccurrence as \(p_{t}\left ({s^{a}_{t}}[k]\right )\), \(p_{t}\left ({s^{b}_{t}}[l]\right )\) and \(p_{t}\left ({s^{a}_{t}}[k],{s^{b}_{t}}[l]\right )\). The ranking used in the knowledge capture layer compares cooccurrence ratios of special values to the ratios for their “parents”, i.e., histogram ranges that \({s^{a}_{t}}[k]\) and \({s^{b}_{t}}[l]\) drop into – denoted by \({s^{a}_{t}}[k]^{\uparrow }\) and \({s^{b}_{t}}[l]^{\uparrow }\) – respectively:
Default cooccurrence ratios analogous to formula (3) are stored for special values as well.^{Footnote 2} They are utilized by the engine during the knowledge transformation processes whenever there is a need to refer to pairs of special values that were not chosen to be stored. For chosen pairs of special values, we store their corresponding ratios \(\tau _{t}\left ({s^{a}_{t}}[k],{s^{b}_{t}}[l]\right )\) defined similarly to (2).
Analytical testing environment
Our engine stores data summaries in binary files accessible by approximate query execution methods via internal interfaces. From a logical viewpoint, the contents of those files can be represented as a collection of data summary tables displayed in Fig. 5. For diagnostic purposes, we implemented a converter that transforms the contents of the binary files into an explicit relational form. As a result, one can use standard PostgreSQL to access and analyze the data summary tables that store information about histogram frequencies and ranges, special values, gaps, as well as bartobar and valuetovalue cooccurrences, per each database, table, column and packrow, independently from the implemented knowledge transformation mechanisms.
This kind of an alternative access to the outcomes of our knowledge capture layer proved helpful when developing our quality testing environment. Moreover, the schema presented in Fig. 5 is a great means for elucidation to our customers. With a use of simple SQL, one can check which pairs of columns are most correlated by means of cooccurrences of their histogram ranges and special values (by querying bar_bar and special_special, respectively), what is the degree of repeatability of special values in different data packs of the same data column (by querying special), whether values of particular data columns evolve from packrow to packrow (by querying pack and gap), etc.
This kind of extra knowledge can be helpful for the users to better understand quality and performance characteristics of our query execution processes. Moreover, they may wish to explore data summary tables directly to do basic analytics, e.g., approximately visualize demographics of particular columns. We also work with data summaries in their relational form visible in Fig. 5 while prototyping new algorithms. For instance, in ChądzyńskaKrasowska et al. (2017) we reported some examples of feature selection methods performed entirely on our data summary tables. The obtained empirical results prove that our onedimensional and twodimensional structures described in the previous subsections can provide completely new means to scale machine learning algorithms (Peng et al. 2005).
Knowledge transformation operations
For each SELECT statement, the engine transforms available data summaries iteratively to build a summary of the query result. Different transformation mechanisms are dedicated to different operations such as filtering, joining, aggregating, etc. Each subsequent operation in a query execution chain takes as its input relevant summaries produced by previous operations. Once the summaries corresponding to the final stage of query execution are assembled, we use them to produce the outcome interpretable as a standard SQL result. This idea becomes particularly interesting when compared to approximate query techniques based on data sampling, where – after all – the whole computational process remains at the original level of atomic data.
Filteringrelated transformations
Let us consider the operation of filtering, i.e., taking into account SQL clauses such as WHERE (in the case of summaries of the original tables or dynamically derived summaries of the nested SELECT results) or HAVING (in the case of dynamically derived summaries of earlier GROUP BY operations). For the engine introduced in this paper, this means revising frequencies for histograms, special values and cooccurrence ratios for columns relevant for further query execution stages, basing on filters specified over other columns.
The filtering process adapts the treebased belief propagation (Neapolitan 2003), which is a popular tool in decision making and modeling (Weiss and Pearl 2010). Over the years, many improvements and extensions of the original belief propagation algorithms were proposed, including their SQLbased implementations (Gatterbauer et al. 2015). Nevertheless, this paper presents the first productionready attempt to embed this idea inside an approximate query engine, where – in a loop over packrows – the most interesting cooccurrences of bars and special values of different columns are used as partial knowledge about dataderived conditional probability distributions. For more examples of probabilistic information flow models that operate with partial knowledge about value cooccurrences, let us refer to Pawlak (2005).
Let us consider a simple query of the form SELECT SUM(a) from T WHERE b > x, which involves columns a and b in data table T. From the perspective of our knowledge transformation layer, the process of executing this query is split into two separate phases: 1) the filtering transformation: calculating new counts and frequencies of a’s histogram bars (and special values) subject to the condition b > x and 2) the result generation: assembling SUM(a) based on the outcome of filtering, in combination with information about values occurring inside particular ranges of a. Herein, we concentrate on the first of the above phases. The second one will be discussed in Section 5.
Figure 6 illustrates how b > x influences b’s and a’s representation. For a simpler query, i.e., SELECT COUNT(*) FROM T WHERE b > x, the whole procedure would be finished by revising the heights of b’s bars – contribution of an exemplary packrow t into estimation of the amount of the original rows satisfying b > x would be then roughly 43333. However, to estimate SUM(a), we need to propagate information about b > x onto a. Namely, we need to replace probabilities \(p_{t}\left ({r_{t}^{a}}[1]\right )\), \(p_{t}\left ({r_{t}^{a}}[2]\right )\) and \(p_{t}\left ({r_{t}^{a}}[3]\right )\) that reflect the input data with revised probabilities \(p^{\prime }_{t}\left ({r_{t}^{a}}[1]\right )\), \(p^{\prime }_{t}\left ({r_{t}^{a}}[2]\right )\) and \(p^{\prime }_{t}\left ({r_{t}^{a}}[3]\right )\) that reflect the filtered data. Contribution of packrow t to overall calculation of SUM(a) can be then derived using estimation of the number of its corresponding rows holding b > x and the revised probabilities of a’s bars.
Equation \(p^{\prime }_{t}\left ({r_{t}^{a}}[1]\right ) = p_{t}\left ({r_{t}^{a}}[1]{r_{t}^{b}}[1]\right ) p^{\prime }_{t}\left ({r_{t}^{b}}[1]\right ) +...+p_{t}\left ({r_{t}^{a}}[1]{r_{t}^{b}}[3]\right ) p^{\prime }_{t}\left ({r_{t}^{b}}[3]\right )\) in Fig. 6 illustrates a general idea of belief propagation. The revised probability \(p^{\prime }_{t}\left ({r_{t}^{a}}[1]\right )\) is calculated using classical total probability that combines previouslyrevised marginal probabilities over b with fixed conditional probabilities of a subject to b. In our approach, conditionals \(p_{t}\left ({r_{t}^{a}}[i]{r_{t}^{b}}[j]\right )\) can be retrieved as \(\tau _{t}\left ({r_{t}^{a}}[i],{r_{t}^{b}}[j]\right )p_{t}\left ({r_{t}^{a}}[i]\right )\) or, for not stored cooccurrences, approximated by \(\tau _{t}\left (a,b\right )p_{t}\left ({r_{t}^{a}}[i]\right )\). This leads to the following:
The above proportion is a foundation for the WHERErelated knowledge transformations implemented in the engine. In further subsections, we show how to extend it towards multicolumn scenarios. As for now, let us just note that it can be easily generalized onto the case of operating with both histogram bars and special values. Moreover, let us emphasize that the mechanism of revising the whole vector of probabilities over a using (5) can be implemented in such a way that its computational cost remains linear with respect to the explicitly stored cooccurrences. As briefly outlined in Section 3.2, this is a great performance advantage when compared to a nonoptimized quadratic style of working with conditional probability distributions.
Dynamic construction of propagation trees
Let us now explain how the idea of belief propagation can be adapted to serve for the WHERErelated knowledge transformations. The most basic propagation algorithm operates on a directed tree spanned over variables assigned with finite domains of values. A directed edge in the tree expresses causal relation, whereby a parent makes its child independent from its remaining nondescendants. Such relation is encoded in a form of conditional probability distribution of the child node subject to its parent. For each packrow t, our task is to construct and use a tree spanned over columns involved in a given query, with probabilities defined by means of available histograms and cooccurrence ratios. In the general case, distributions are defined over special values and histogram ranges embracing the remaining “nonspecial” values. For the sake of clarity, we limit ourselves to a simplified scenario where onedimensional column representations correspond only to histogram ranges.
Let us refer again to the example in Fig. 6. In that case, the corresponding tree can be envisioned as a←b, that is a is the child of b. Singlevariable probability distributions are defined over ranges \({r_{t}^{a}}[1]\), \({r_{t}^{a}}[2]\) and \({r_{t}^{a}}[3]\), as well as \({r_{t}^{b}}[1]\), \({r_{t}^{b}}[2]\) and \({r_{t}^{b}}[3]\). (Actually, one can see that this is one of differences when compared to classical model of belief propagation, where the singlevariable distribution is stored only for the root of a tree.) Furthermore, conditional distribution \(p_{t}\left ({r_{t}^{a}}[i]{r_{t}^{b}}[j]\right )\) can be retrieved as either \(\tau _{t}\left ({r_{t}^{a}}[i],{r_{t}^{b}}[j]\right )p_{t}\left ({r_{t}^{a}}[i]\right )\) in the case of \(\tau _{t}\left ({r_{t}^{a}}[i],{r_{t}^{b}}[j]\right ) \in \tilde {t}\), or \(\tau _{t}\left (a,b\right )p_{t}\left ({r_{t}^{a}}[i]\right )\) otherwise.
As another example, let us consider the statement SELECT b, SUM(a) FROM T GROUP BY b WHERE b > x. In this case, the propagation tree a←b looks like before. However, in addition to singlecolumn distributions, we need to pass further also revised knowledge about cooccurrences involving a’s and b’s histogram ranges. This is because of the specifics of the GROUP BY operation, which – as briefly outlined at the beginning of Section 7 – transforms summaries reflecting the filtered data into summaries representing packrows of tuples indexed by the grouping values and their corresponding aggregation coefficients. Thus, if appropriately recalculated ratios between a and b are not provided, then the whole mechanism would produce incorrect final summaries. We go back to this topic in Section 4.3, where it is discussed how to use propagation trees to revise knowledge about cooccurrence ratios.
As yet another example, let us consider the query in Fig. 7, where it is required to revise data summaries with respect to conjunction of two conditions – b > x and c < y. In this case, in order to follow the idea of treebased belief propagation, we need first to construct a tree over nodes a, b and c. Initially, we actually used to derive such trees during the phase of knowledge capture. The tree structures could be quite different for different packrows but, on the other hand, we used to assume that – for a given packrow t – it would be reasonable to have a single fixed tree ready for usage in the case of arbitrary queries. That approach turned out to be not flexible enough, as different queries – involving potentially different subsets of columns – may require different trees somewhat optimized with regards to their needs. To better address this issue, we developed a queryadaptive method of dynamic tree derivation.
Algorithm 2 addresses the task of tree construction in lines 23 and 6. The first stage is to connect pairs of columns, for which there are corresponding cooccurrence ratios stored in the summary of packrow t. This process is limited to the set of columns B ∪ C, where B gathers columns that will be required at further query stages and C denotes columns with query conditions specified. The idea of not connecting pairs of columns with no cooccurrence representation is based on the assumption that such pairs are relatively weakly correlated with each other. Indeed, according to function b a r s _{ t } defined by (1), such complete lack of twodimensional information means that the given two columns – when treating them as quantized random variables with dataderived probabilities p _{ t } – seem to be approximately independent from each other.
If the undirected graph G _{ t } constructed as a result of lines 23 has multiple connected components – further denoted by \({G^{X}_{t}}\) for some column subsets X ⊆ B ∪ C – then, as above, we can assume that those subsets are approximate independent from each other. This means that further calculations can be conducted for each of such components separately and then merged together. Let us denote by p _{ t }(Q ^{↓X}) the estimated ratio of rows in packrow t that satisfy filters specified in query Q on columns in C ∩ X (line 16). Then, the estimated ratio of rows in packrow t that satisfy all conditions of Q – denoted as p _{ t }(Q) – can be calculated as the product of coefficients p _{ t }(Q ^{↓X}) (line 17). Similarly, we can run beliefpropagationbased mechanisms aimed at revision of histograms and cooccurrence ratios over particular components \({G^{X}_{t}}\) and then readjust them to obtain final summaries representing the filtered data.
In order to perform calculations with respect to a given X ⊆ B ∪ C, we need to construct a tree based on \({G^{X}_{t}}\). This is initiated in line 6 of Algorithm 2. We employ the measure of mutual information (Peng et al. 2005) computed for pairs of columns to span an undirected tree \({S^{X}_{t}}\) over \({G^{X}_{t}}\). Precisely, we use a summarybased approximation of that measure, as it was investigated in ChądzyńskaKrasowska et al. (2017):
The usage of mutual information is justified by relationships between the dataderived information entropy of a graphical model and the accuracy of conditional independence assumptions that it represents (Ślęzak 2009). The usage of spanning trees that maximize mutual information is actually related to the foundations of secondorder product approximations of joint probability distributions, which were deeply investigated in the literature (Meila 1999).
In Section 4.3, we describe how to calculate the revised probabilities and ratios within each X ⊆ B ∪ C. To do this, we need to transform an undirected tree \({S^{X}_{t}}\) into its directed version \(\overrightarrow {S}^{X}_{t}\). Algorithm 2 addresses this task in lines 78. In order to minimize the amount of further computations, we choose as the tree’s root a column that is on average closest to the elements of C ∩ X. Later, during the belief propagation process, it is also worth considering line 18 that checks whether complete calculations are really necessary. Namely, if the first phase of propagation provides us with p _{ t }(Q ^{↓X}) = 0 over any of connected components of G _{ t }, then packrow t can be skipped as irrelevant.
Surely, we realize that the whole introduced approach raises a number of questions with regards to the accuracy of estimations represented by transformed data summaries. First, one needs to take into account quantization of original column domains as discussed in Section 3.1. Second, as elaborated in Section 3.2, we can afford storing only a fraction of cooccurrence ratios representing relationships between quantized / discretized domains of different columns, which leads, e.g., to potentially inaccurate judgements with respect to dataderived probabilistic independencies. Third, in order to adapt the treebased belief propagation mechanisms to model the process of filtering, we partially neglect some of twocolumn relationships that are evaluated as weaker than the others. Still, despite all those potential inaccuracy factors, the approximate results of analytical queries that we tested on large realworld data sets are usually close enough to their exact counterparts. Thanks to the diagnostic analysis performed on data summary tables described in Section 3.3, we can conclude that this is because our histogram generation, as well as special value / domain gap / cooccurrence ratio ranking functions are indeed able to store the most meaningful aspects of the original data.
Adaptation of belief propagation algorithm
In this subsection, we focus on calculations over a single tree \(\overrightarrow {S}^{X}_{t}\), for a column subset X ⊆ B ∪ C corresponding to one of connected components of graph G _{ t }. As before, we refer to Algorithm 2. The discussed procedures are also visualized by Fig. 6 that actually concerns the case of a connected graph, i.e., X = B ∪ C, where B = {a,b,c} and C = {b,c}.
The idea of adapting treebased belief propagation is motivated by a need to develop an efficient method to model conjunctions of conditions specified over multiple columns. In this paper, we do not consider other logical operations (such as, e.g., disjunctions). We do not consider conditions defined over derived columns either (such as, e.g., arithmetic or CASE WHEN expressions – see Section 7 for further discussion). Such enhancements are quite straightforward to develop in the near future. Moreover, the current engine already contains some significant extensions when compared to the classical characteristics of propagation mechanisms (Weiss and Pearl 2010). In particular, it enables us to work with conjunctions of more complex singlecolumn conditions than the “variable = value” filters that would be supported by standard model (Neapolitan 2003).
Let us denote by c _{ Q } a twovalued variable characterizing whether a given original row in packrow t could hold the condition specified by query Q over column c ∈ C ∩ X. Let us represent the values of c _{ Q } as \({r^{c}_{Q}}[1]\) and \({r^{c}_{Q}}[2]\) corresponding to rows satisfying and not satisfying the considered condition, respectively. (Note that we do not index those values with t, as they have the same meaning for all packrows of a given table.) Using such new notation, we can rewrite probabilities discussed in Section 4.2 as \(p_{t}(Q^{\downarrow X})=p_{t}(\bigwedge _{c\in C\cap X}{r^{c}_{Q}}[1])\) and \(p_{t}(Q)=p_{t}(\bigwedge _{c\in C}{r^{c}_{Q}}[1])\). Moreover, the main task of this subsection can be now formulated as estimation of new frequencies and cooccurrence ratios for (pairs of) columns in B as conditionals subject to \(\bigwedge _{c\in C}{r^{c}_{Q}}[1]\).
As shown in Fig. 7 (see also line 11 in Algorithm 2), we extend tree \(\overrightarrow {S}_{t}^{X}\) with nodes corresponding to variables c _{ Q } and specify their parents as the corresponding columns c ∈ C ∩ X. Such new causal relations are equipped with conditional probabilities of the following form (line 12):
where the nominator estimates the number of rows in packrow t whose values on c drop into its ith range and in the same time satisfy the considered WHERE clause over c, while the denominator simply denotes the height of the ith bar for c. Let us refer to Fig. 6, where – for b _{ Q } representing filter b > x – we obtain \(p_{t}({r^{b}_{Q}}[1]{r_{t}^{b}}[1])=0\), \(p_{t}({r^{b}_{Q}}[1]{r_{t}^{b}}[2])=1/3\) and \(p_{t}({r^{b}_{Q}}[1]{r_{t}^{b}}[3])=1\). For this particular example, Algorithm 2 would proceed with the tree a←b→b _{ Q }. Let us also note that computation of \(p_{t}({r^{b}_{Q}}[1]{r_{t}^{b}}[2])\) would be the only place referring to more detailed knowledge about the domain of b. Namely, the analysis of gaps would influence estimation of the proportion of values observed in \({r_{t}^{b}}[2]\) that are greater than x. At all the remaining stages, Algorithm 2 would operate entirely with histograms and cooccurrence ratios.
After extensions described in lines 1112, the given tree is ready to perform belief propagation, where query conditions are modeled by setting appropriate coefficients at the leaves corresponding to variables c _{ Q }, c ∈ C ∩ X. For a ∈ X and tree \(\overrightarrow {S}_{t}^{X}\), let us denote by \({D_{t}^{a}}\) and \({N_{t}^{a}}\) the sets of all descendants and nondescendants of a (including a itself), respectively. Let:

\(p_{t}^{\downarrow }\left ({r_{t}^{a}}[i]\right )\) be the probability that conjunction \(\bigwedge _{c\in C\cap {D_{t}^{a}}}{r^{c}_{Q}}[1]\) holds in packrow t subject to observing values within range \({r_{t}^{a}}[i]\) on column a,

\(p_{t}^{\uparrow }\left ({r_{t}^{a}}[i]\right )\) be the probability that values within range \({r_{t}^{a}}[i]\) are observed on column a subject to satisfaction of conjunction \(\bigwedge _{c\in C\cap {N_{t}^{a}}}{r^{c}_{Q}}[1]\).
Given the independence of \({N_{t}^{a}}\) and \({D_{t}^{a}}\) subject to a in the considered treebased probability model, we shall assume the following proportion:^{Footnote 3}
The left hand side of (8) represents the revised probability distribution on the quantized domain of a, so it sums up to 1 for all considered indexes i. This means that it is sufficient to find any value proportional to the above right hand side. The major advantage of treebased belief propagation is its ability to recursively determine parameters \(\lambda _{t}\left ({r_{t}^{a}}[i]\right ) \propto p_{t}^{\downarrow }\left ({r_{t}^{a}}[i]\right )\) and \(\pi _{t}\left ({r_{t}^{a}}[i]\right ) \propto p_{t}^{\uparrow }\left ({r_{t}^{a}}[i]\right )\), so as to replace (8) with the following one:
Let us denote by \(C{h_{t}^{a}}\subseteq {D_{t}^{a}}\) the set of all children of column a in \(\overrightarrow {S}_{t}^{X}\). The standard way of calculating parameters λ _{ t } and π _{ t } is as follows:

for each node c _{ Q }, we put \(\lambda _{t}\left ({r^{c}_{Q}}[1]\right ) = 1\), \(\lambda _{t}\left ({r^{c}_{Q}}[2]\right ) = 0\),

for all other leaves in \(\overrightarrow {S}_{t}^{X}\), we put λ _{ t } ≡ 1,

for each b ∈ X which is not a leaf in \(\overrightarrow {S}_{t}^{X}\), we put:
$$ \begin{array}{c} \lambda_{t}\left( {r_{t}^{b}}[j]\right) = {\prod}_{a \in C{h_{t}^{b}}} {\lambda_{t}^{a}}\left( {r_{t}^{b}}[j]\right) \end{array} $$(10)
where coefficients \({\lambda _{t}^{a}}\left ({r_{t}^{b}}[j]\right )\) should be calculated as \({\sum }_{i} p_{t}\left ({r_{t}^{a}}[i]{r_{t}^{b}}[j]\right ) \lambda _{t}\left ({r_{t}^{a}}[i]\right )\). However, as the engine stores only partial knowledge about dataderived probabilities, the classical computation of \({\lambda _{t}^{a}}\left ({r_{t}^{b}}[j]\right )\) is replaced by:
where \({\alpha _{t}^{a}} = {\sum }_{i}p_{t}\left ({r_{t}^{a}}[i]\right )\lambda _{t}\left ({r_{t}^{a}}[i]\right )\). (10–11) are the basis for the λdownward phase of the developed WHERErelated belief propagation. (12–13) considered below enable us to run the πupward phase. Let us note that – when compared to the general layout – we do not need to initiate π _{ t } for nodes c _{ Q }, as those are leaves in the extended version of \(\overrightarrow {S}_{t}^{X}\).
First, for the root of \(\overrightarrow {S}_{t}^{X}\), we put \(\pi _{t}\left (r_{t}^{root}[i]\right ) = p_{t}\left (r_{t}^{root}[i]\right )\). Then, for each a ∈ X which is not the root, one would normally specify \(\pi _{t}\left ({r_{t}^{a}}[i]\right )\) as equal to \({\sum }_{j} p_{t}\left ({r_{t}^{a}}[i]r_{t}^{\hat {a}}[j]\right ){\pi _{t}^{a}}\left (r_{t}^{\hat {a}}[j]\right )\), where \(\hat {a}\) denotes the parent of a and:
However, for the same reason as above, we change the classical way of deriving \(\pi _{t}\left ({r_{t}^{a}}[i]\right )\) based on coefficients \({\pi _{t}^{a}}\left (r_{t}^{\hat {a}}[j]\right )\) to the following one:
where \({\beta _{t}^{a}} = {\sum }_{j}{\pi _{t}^{a}}\left (r_{t}^{\hat {a}}[j]\right )\). Due to the specifics of our tree construction, there is a straightforward way to estimate the ratio of rows in packrow t that satisfy conditions of query Q defined on columns belonging to C ∩ X:
We can retrieve p _{ t }(Q ^{↓X}) right after the λdownward phase of belief propagation and, in the case of multiple connected components of graph G _{ t }, derive the final p _{ t }(Q) according to line 16 in Algorithm 2. This leads towards alreadydiscussed potential performance benefits. Namely, if p _{ t }(Q) = 0, we can skip further calculations for packrow t and jump directly to line 34.
Another aspect of performance acceleration relates to parameters \({\lambda _{t}^{a}}\left ({r_{t}^{b}}[j]\right )\) and \(\pi _{t}\left ({r_{t}^{a}}[i]\right )\). Let m a x_n o_o f_b a r s denote the maximum number of histogram bars, as in Algorithm 1. In their original form, formulas \({\lambda _{t}^{a}}\left ({r_{t}^{b}}[j]\right )={\sum }_{i} p_{t}\left ({r_{t}^{a}}[i]{r_{t}^{b}}[j]\right ) \lambda _{t}\left ({r_{t}^{a}}[i]\right )\) and \(\pi _{t}\left ({r_{t}^{a}}[i]\right )={\sum }_{j} p_{t}\left ({r_{t}^{a}}[i]r_{t}^{\hat {a}}[j]\right ){\pi _{t}^{a}}\left (r_{t}^{\hat {a}}[j]\right )\) would lead toward the quadratic cost \(\mathcal {O}\left (X\cdot max\_no\_of\_bars^{2}\right )\). On the contrary, (11) and (13) provide us with computational complexity at the level \(\mathcal {O}\left (\max \left (X\cdot max\_no\_of\_bars, \tilde {t}\right )\right )\), where \(\tilde {t}\) denotes the number of cooccurrence ratios stored by the considered engine for packrow t.
The last aspect of Algorithm 2 is to use λ/πcoefficients to deliver revised singlecolumn distributions and cooccurrence ratios (denoted in general as \(\tilde {t}^{\prime }\)) as an input to subsequent query execution steps or the final phase of query result generation (discussed in Section 5). With regards to histograms, one can refer to proportion (9). As about cooccurrence ratios involving connected columns, it is enough to note that – for a given a and its parent \(\hat {a}\) – the revised probability \(p_{t}^{\prime }\left ({r_{t}^{a}}[i]r_{t}^{\hat {a}}[j]\right )\) could be resolved by the belief propagation process performed with Q’s conditions extended by \(\lambda _{t}(r_{t}^{\hat {a}}[j])\leftarrow 1\) (and 0 otherwise). Simple recalculations lead then towards the following formula for the revised cooccurrence ratio derived as \(p_{t}^{\prime }\left ({r_{t}^{a}}[i]r_{t}^{\hat {a}}[j]\right )\) divided by \(p_{t}^{\prime }\left ({r_{t}^{a}}[i]\right )\):
The above equation can be used to recalculate all cooccurrence ratios stored in \(\tilde {t}\). For pairs of columns that are not connected, we can use the same formula for a slightly modified tree as described in lines 2329 of Algorithm 2. This part of the process is also highlighted as the last step in Fig. 7. The idea is to temporarily replace one of the existing edges with the edge directly connecting the considered columns. The whole operation is performed in such a way that a loss of summarybased approximation of mutual information I _{ t } – and therefore also a loss of the tree entropy – is minimized. The revised default ratios \(\tau _{t}^{\prime }(a, b)\) are then obtained by using the formula analogous to (3).
Generating final query results
Once the summary of a query output is calculated, the engine translates it into the standard SQL result format. Prior to this stage, as visible in Fig. 8, knowledge being transformed throughout subsequent query execution stages is highly condensed and therefore it requires only a fraction of resources of a traditional database engine to produce the results. However, at the end, we need to deliver a result in an externally interpretable format. (Unless this is the case of a nested query, where the result of subquery can remain in a summarized form, like we did with rough queries in Kowalski et al. 2013.) This phase could be referred to as materialization, though it should not be confused with a standard meaning of materialization in columnar databases (Abadi et al. 2013).
Alternatively, if the knowledge capture layer is regarded as responsible for aforementioned information granulation (Zadeh 1997), then translation of query result summaries into final approximate results can be treated as information degranulation. This way, the overall design of our engine fits the idea of calculations on information granules, with a special emphasis on their transformation and reorganization. More inspirations – related in particular to switching between granular and detailed data representations – can be found in Liu et al. (2013).
Query result generation can also be understood as transition from a result summary to a result comprising crisplyvalued tuples. For simple queries, such as aggregations optionally grouped with respect to lowcardinality data columns, this stage is quite straightforward. As an example, let us go back to the SELECT statement considered in Section 4.1. Its execution consists of transforming information about a subject to the clause b > x and then producing the final outcome SUM(a). In this case, the result can be computed as a total sum of outputs produced by each of packrows. For a given packrow, we take into account the transformed frequencies of a’s special values and we add v a l u e ⋅ f r e q u e n c y scores to the constructed sum. The remaining part is to calculate contributions of histogram bars after subtracting frequencies of their special values. For this purpose, we construct an estimate of an average “nonspecial” value within each bar. Such estimate can be obtained based on domain representation discussed at the end of Section 3.1.
The situation changes if highcardinality columns are involved. Then, it is especially beneficial to model all data operations at the level of summaries and switch to the detailed values just prior to shipping the query result outside. Let us consider query SELECT a, b, c FROM T WHERE b > x and c < y highlighted in Fig. 7. In this case, after belief propagation, we need to generate collections of the resulting abc tuples corresponding to each of (nonzeroed) packrow summaries. First, we use those summaries (both single and multidimensional ones) to create collections of tuples labeled with codes of special values and histogram ranges (symbolizing occurrences of “nonspecial” values). At the end, we replace codes with final values generated using – again – domain information available for the involved columns.
In conclusion, query result generation relies both on operations presented in previous subsections and the domain representation used at the last stage. Let us recall that in addition to knowledge summarized in terms of (co)occurrences of special values and histogram bars, we store the most significant gaps and the greatest common divisors of values observed in the original data packs. Referring to the theory of rough sets (Pawlak and Skowron 2007), we can say that special values whose frequencies were not pushed down to zero during query execution constitute a kind of domain’s positive region, i.e., these are values that should contribute to the query result. On the other hand, gaps, greatest common divisors, dictionaries (if available) and zeroed frequencies can be used together to define the domain’s negative region, i.e., values that should not contribute to the result. Utilization of such regions during result generation is one of the areas of our ongoing improvement, as it strongly relates to elimination of nonexisting values in the approximate query outcomes (ChądzyńskaKrasowska et al. 2017).
Empirical evaluation
Let us now discuss how we approach the topic of the engine evaluation. From the user perspective, the most important aspect is to look at the speed versus accuracy of deriving approximate results, in comparison to both exact and samplebased styles of calculations. From an architectural roadmap standpoint, we should also consider abilities of our approach to scale with respect to available computational resources and perform sufficiently well for increasingly complex operations. This is in line with Fig. 9 that shows our ultimate goal to combine summarybased computations with other paradigms.
Figure 10 illustrates a typical scenario of comparing our solution with the others. Surely, one can expect an increase in performance at the cost of losing accuracy. The point is to find out the right balance between these two aspects for particular case studies involving expected queries, data characteristics and data owner requirements (Glick 2015). The initial task is to assist the users in evaluating the accuracy. As mentioned in Section 2.2, several approximate query engines enrich results with confidence intervals that reflect the output’s credibility when comparing it to a hypothetical actual result (Mozafari and Niu 2015). However, first we need to establish the means for expressing similarities between approximate and explicitly computed standard results, so as to confirm the user expectations with respect to the speedversusaccuracy tradeoffs.
A reasonable measure of similarity between exact and approximate outcome of queries should certainly satisfy some basic mathematical properties. Moreover, it should be coherent with human perception of the meaning of both results in practice. In our framework, if an outcome of query Q is supposed to take a form of a single numeric value, then – denoting by o u t _{ Q } / \(\widetilde {out}_{Q}\) the exact / approximate results, respectively – we use the following:
Queries with GROUP BY deliver tuples labeled by grouping columns and aggregate scores. The formula (16) is then averaged over all groups that occur in either exact or approximate outcomes. In the case of query in Fig. 10, the groups are induced by column tm_day. Fortunately, the same tm_day values occur for both exact and approximate results. Moreover, for each tm_day, its corresponding exact and approximate aggregates are very close to each other. This is why the overall similarity is close to the maximum. However, if an approximate result included a large amount of grouping values that should actually not occur at all, then its overall similarity to the exact result would be significantly lower. This is one more reason to pay special attention to the final query result generation, as it was discussed in Section 5.
Referring to Fig. 10 again, one might claim that comparing ourselves to Hive / Spark SQL is not fair because these are products with quite different aims. Indeed, such solutions are expected to cooperate rather than compete with each other, at the level of both external characteristics and the underlying computational paradigms. Actually, we have already conducted experiments related to reimplementation of our knowledge transformation operations in native Spark environment (Zaharia et al. 2016). Preliminary results are highly encouraging, as such implementation allows us to take advantage of query acceleration factors corresponding to both summarybased computations and resourcerelated scalability used, e.g., by Spark SQL (Armbrust et al. 2015). On the other hand, it is certainly important to empirically compare our approach with other approximate query processing solutions. However, as mentioned in Section 2.2, this is a difficult challenge because those solutions will still require far more development work to scale and provide sufficient accuracy for complex queries.
The engine should be also evaluated from the viewpoint of complex analytical pipelines. Figure 11 illustrates the idea of using query results within machine learning algorithms. Indeed, there are a number of scalable implementations of decision tree induction (Nguyen and Nguyen 2005) or data clustering (Ordonez and Cereghini 2000), that are based on iterative execution of adhoc SQL statements. This gives us an opportunity to evaluate approximate results not only by means of similarity measures but also by means of comparing the final outcomes of SQLbased machine learning algorithms while feeding them with approximate versus exact calculations. This seems to refer to the aforementioned analogy between approximate querying and perceptual image processing methods, whereby the main point is to focus on functionally important aspects of managed information.
As shown in Fig. 11, one can also run some machine learning methods directly against data summaries, instead of using classical SQLbased interface. Any steps toward this direction would fit the harnessing knowledge dimension illustrated in Fig. 9. (We will discuss it further in Section 7.) Moreover, from the perspective of evaluation of the engine components, it would let us evaluate the knowledge capture layer decoupled from SQLspecific knowledge transformation mechanisms. That was actually our motivation to develop the alreadymentioned summarybased feature selection approach reported in ChądzyńskaKrasowska et al. (2017). Namely, we considered one of standard minimum redundancy maximum relevance (mRMR) feature selection techniques (Peng et al. 2005) and executed it for some realworld data sets using: 1) mutual information scores computed over the original data, 2) mutual information scores computed over 15%samples and 3) approximations of mutual information scores derived directly from the stored data summaries. The outputs – interpreted as rankings of columns produced by each of the three above runs of the mRMR algorithm – turned out to be encouragingly similar to each other, while summarybased approximate calculations were incomparably faster than for both other cases.
Further discussion and development directions
There are a number of important topics that were not addressed in this paper in a detailed way. We summarize them briefly below.
Coverage of SQL operations
Besides data filtering, the engine supports also other operations, such as JOIN, GROUP BY, etc. While joining two tables, we need to create summaries of collections of tuples that would belong to a standard output of JOIN operation. This is done by amalgamating the pairs of summarized packrows of input tables. Such amalgamation is analogous to the filtering operations, though it can be preceded by a singletable process of merging similar packrow summaries (the outcomes of Algorithm 2, if any filters were applied prior to joining). As for GROUP BY, its standard output would consist of vectors labeled by aggregate results and the corresponding values of columns used for grouping. Herein, the goal is to produce summaries of such vectors without generating them explicitly. More examples of our knowledge transformation operations will be described in the near future.
Data types and derived columns
In general, some components of abovementioned operations are agnostic, while the others depend on semantics of particular columns. For example, in Algorithm 2, almost all calculations are conducted on probabilities of abstracted partition blocks, except line 12. This is worth remembering while developing new knowledge capture techniques. For instance, for some alphanumeric columns one could consider histogram bars labeled by prefixes instead of ranges. For such new types of histograms, some fundamental knowledge transformation functions – such as (7) – would require reimplementation. There is also requirement for dynamic creation of both onedimensional and twodimensional summaries reflecting new columns derived as complex expressions. Certainly, computations at this level need to rely on the types of original columns as well.
Disk and memory management
Many database mechanisms can be adapted to work with granulated summaries. Their contents can be clustered into even bigger chunks – “megapacks” – and labeled with higherlevel descriptions. For the considered engine, it would be a step toward the paradigms of multilevel granular data analytics (Yao 2016). Yet another aspect refers to vertical data organization (Abadi et al. 2013), which – in our case – means an independent access to collections of histograms, special values, gaps, cooccurrence ratios, etc. This way, for every query, we can grasp these components of stored data summaries that are required to execute particular operations. Such components can be cached in memory for the purposes of future queries. Moreover, summaries representing intermediate outputs of query execution stages can be managed in memory in a pipeline style like in the case of standard database engines.
Knowledge capture improvements
It is still possible to enhance our criteria for choosing ranges, special values, etc. Improvements can be also achieved by putting more intelligence into the process of assigning the original data rows to buffered packrows, based on our earlier experiences (Ślėzak et al. 2013). Moreover, we have started integrating our knowledge capture layer with some components of the Apache ecosystem (https://kafka.apache.org/). We also work on the synchronization of distributed knowledge capture processes with respect to maintenance of global structures, such as dictionaries for lowcardinality columns (see table dict in Fig. 5). Our goal is to build a modular architecture (see Fig. 12), where a layer of reconfigurable knowledge capture agents is the means for connecting to different types of data locations and preparing summaries available within different analytical platforms (Zaharia et al. 2016).
Accuracy tradeoffs and hybrid scenarios
In Section 3, we pointed out that our engine should store only a limited fraction of knowledge, specified by means of maximum amounts of histogram bars, special values, gaps and cooccurrence ratios. Although all those amounts are constrained by some defaults, it is possible to vary them with respect to particular columns or the whole data tables. For instance, it may be desirable to optionally access some of tables in a classical form and join them with summarized contents of other tables within hybrid query execution schemes. Such flexibility fits realworld scenarios of incremental data analytics (Fisher et al. 2012), but it requires careful maintenance of information that allows the algorithms to extract the original data from remote sources. From the perspective of Fig. 12, this kind of information needs to be managed by knowledge capture agents. Parameters enabling the engine to trace the original data should constitute a kind of analytical knowledge grid, by analogy to the concept of knowledge grid considered in semantic web (Cannataro and Talia 2003).
Accuracy measures and confidence intervals
In Section 6, we referred to the accuracy as a measure of similarity between exact and approximate query results, as well as similarity between the outcomes of machine learning algorithms working with detailed and summarized data sets. One can conduct a deeper discussion on a need of designing the quality measures of data summaries and investigating relationships between the accuracies of input summaries and the expected accuracies of query results. Finally, special attention should be devoted to development of confidence interval calculations, which would assist the users in their everyday work, as is used with some samplingbased approximate query engines. To accomplish this, our plans are twofold. First, we intend to enhance our framework with a kind of query explain functionality enabling the users to trace potential inaccuracies emerging at particular stages of query execution. Second, we intend to create a tool, which maintains a sample of original packrows and uses it – in combination with the core engine – to produce confidence intervals that characterize similarities between exact and approximate outcomes of SQL statements.
Ability to expand beyond SQL
Figure 12 illustrates our plans with respect to extending the engine with a librarystyle support for machine learning and knowledge discovery (KDD) processes. As discussed in Section 6, it is indeed possible to use our data summaries to conduct processes like mRMR feature selection (Peng et al. 2005). Some other feature selection approaches seem to be easy to consider as well (Ślęzak 2009). The next steps can include implementations of some elements of deep learning (Hinton 2007) and redesign of some SQLbased decision tree induction algorithms (Nguyen and Nguyen 2005). Let us note that most of such algorithms rely on iterative data and model transformations. This constitutes a kind of conceptual equivalence between the work that we have already done within our framework with respect to approximate SQL and the work that is still ahead in the field of intelligent data processing. By analogy to Section 4, we need to develop efficient and accurate methods of expressing atomic machine learning operations by means of transforming summaries of their inputs into summaries of their outputs. Abstracting such operations can lead towards a powerful environment for interactive analytics, wherein the users could work with summaries of intermediate inputs / outputs via their favorite interfaces.
Integrating with other computational paradigms
Figure 12 also summarizes opportunities related to putting our architectural strengths together with other trends in the areas of scalable and distributed data processing. SQL Layer denotes the currently existing layer of relationalqueryfocused transformations that – as outlined in Section 6 – can be successfully embedded into Spark environment (Zaharia et al. 2016) (see also the harnessing resources dimension in Fig. 9). ML Layer represents methods discussed above (see also ChądzyńskaKrasowska et al. 2017). Capture Agent refers to the distributed knowledge capture layer that can save constructed histograms, etc., in a format uniformly accessible by SQL Layer, ML Layer and Spark’s generic tools enabling the users to work directly with data summaries. This last aspect is especially interesting when combined with the previously mentioned idea of interactive analytics, which – in addition to intermediate inputs / outputs of machine learning algorithms – can embrace an access to summaries of simply filtered, joined and grouped data sets.
Conclusions
We presented a novel approximate query processing engine, which works by means of SQLrelated transformations of granulated data summaries. The engine does not assume an access to the original data. Instead, it processes quantized histograms and a low percentage of cooccurrence ratios reflecting multicolumn interdependencies detected in the buffered chunks of ingested data tables. Switching from the level of atomic data rows to lowfootprint descriptions of their bigger clusters enables us to significantly decrease the computational cost of operations corresponding to filtering, joining, aggregating, etc. On the other hand, by dealing with summaries at the level of data chunks – unlikely in the case of other database solutions that do it for larger partition blocks or the whole tables – we are able to better control their quality and address complex trends occurring in the original data.
The discussed engine is available in its first production version and it is heavily tested by the first customers. The empirical observations confirm that it can be suitable for data owners and databased services providers who cannot currently cope with exploring permanently growing data, or who simply want to lower the cost of resources required for data maintenance and analytics. Some of the already validated use cases include network traffic analytics and intrusion detection, digital advertising, as well as monitoring of industry processes. More generally, the users of similar engines can come from the fields of online applications, internet of things, sensorbased risk management systems and other tools related to machinegenerated data.
We envision a number of interesting directions for further research and development. First of all, we can see a huge opportunity in strengthening connections of our overall approach with the theories of rough sets and granular computing, e.g., with respect to designing multilevel models of approximate summarybased computations. Moreover, although the existing approximate querying approaches based on data sampling do not seem to scale sufficiently in realworld scenarios, they provide the users with very helpful functionalities such as confidence interval derivations. Thus, one of our goals is to introduce such functionalities into our own framework, potentially by enriching our granulated data summaries with small subsets of original rows. However – like in the case of any other extensions – this will need to be done very carefully, so as to maintain the current performance characte ristics, especially from the viewpoint of working with distributed and remote data sets.
Notes
 1.
Gaps are interpreted as open intervals.
 2.
In Fig. 5, one can find two types of default ratios – default_special_special_tau and default_not_covered _special_special_tau – corresponding to the cases of “parents” of special values whose ratios are stored and are not stored, respectively.
 3.
Let us recall that we deal with a dynamically derived approximate probability model, with all its consequences discussed at the end of Section 4.2. The investigation how accurately such models reflect the data is one of our main analytical testing scopes.
References
Abadi, D., Boncz, P., Harizopoulos, S., Idreos, S., & Madden, S. (2013). The design and implementation of modern columnoriented database systems. Foundations and Trends in Databases, 5(3), 197–280.
Agrawal, A., Choi, J., Gopalakrishnan, K., Gupta, S., Nair, R., Oh, J., Prener, D.A., Shukla, S., Srinivasan, V., & Sura, Z. (2016). Approximate computing: challenges and opportunities, Proceedings of ICRC (pp. 1–8).
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., & Zaharia, M. (2015). Spark SQL: relational data processing in spark, Proceedings of SIGMOD (pp. 1383–1394).
Bruno, N., Chaudhuri, S., & Gravano, L. (2001). STHoles: a multidimensional workloadaware histogram, Proceedings of SIGMOD (pp. 211–222).
Cannataro, M., & Talia, D. (2003). The knowledge grid. Communications of the ACM, 46(1), 89–93.
ChądzyńskaKrasowska, A., Betliński, P., & Ślęzak, D. (2017). Scalable machine learning with granulated data summaries: a case of feature selection, Proceedings of ISMIS (pp. 519–529).
ChądzyńskaKrasowska, A., Stawicki, S., & Ślęzak, D. (2017). A metadata diagnostic framework for a new approximate query engine working with granulated data summaries, Proceedings of IJCRS (1) (pp. 6230–643).
Cormode, G., Garofalakis, M.N., Haas, P.J., & Jermaine, C. (2012). Synopses for massive data: samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(13), 1–294.
Cuzzocrea, A., & Saccȧ, D. (2013). Exploiting compression and approximation paradigms for effective and efficient online analytical processing over sensor network readings in data grid environments. Concurrency and Computation: Practice and Experience, 25(14), 2016–2035.
Fisher, D., Popov, I.O., Drucker, S.M., & Schraefel, M.C. (2012). Trust me, i’m partially right: incremental visualization lets analysts explore large datasets faster, Proceedings of CHI (pp. 1673–1682).
Ganter, B., & Meschke, C. (2011). A formal concept analysis approach to rough data tables. In Transactions on Rough Sets XIV (pp. 37–61). Springer.
Gatterbauer, W., Günnemann, S., Koutra, D., & Faloutsos, C. (2015). Linearized and singlepass belief propagation. Proceedings of the VLDB Endowment, 8 (5), 581–592.
Gibbons, P.B., Matias, Y., & Poosala, V. (2002). Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems, 27(3), 261–298.
Glick, R. (2015). Current trends in analytic databases – Keynote Talk at FedCSIS. https://fedcsis.org/2015/keynotes/rick_glick.
Hinton, G.E. (2007). Learning multiple layers of representation. Trends in Cognitive Sciences, 11, 428–434.
Hu, Y., Sundara, S., & Srinivasan, J. (2007). Supporting timeconstrained SQL queries in oracle, Proceedings of VLDB (pp. 1207–1218).
Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., & Suel, T. (1998). Optimal histograms with quality guarantees, Proceedings of VLDB (pp. 275–286).
Kowalski, M., Ślęzak, D., & Synak, P. (2013). Enhanced rough SQL for correlated subqueries, Proceedings of FedCSIS (pp. 1455–1462).
Lazaridis, I., & Mehrotra, S. (2001). Progressive approximate aggregate queries with a multiresolution tree structure, Proceedings of SIGMOD (pp. 401–412).
Liu, Y., Li, D., He, W., & Wang, G. (2013). Granular computing based on Gaussian cloud transformation. Fundamenta Informaticae, 127(14), 385–398.
Meila, M. (1999). An accelerated Chow and Liu algorithm: fitting tree distributions to highdimensional sparse data, Proceedings of ICML (pp. 249–257).
Mozafari, B., & Niu, N. (2015). A handbook for building an approximate query engine. IEEE Data Engineering Bulletin, 38(3), 3–29.
Naouali, S., & Missaoui, R. (2005). Flexible query answering in data cubes, Proceedings of DaWaK (pp. 221–232).
Narayanan, D., Donnelly, A., Mortier, R., & Rowstron, A. (2008). Delay aware querying with seaweed. The VLDB Journal, 17(2), 315–331.
Neapolitan, R.E. (2003). Learning Bayesian networks. Prentice Hall.
Nguyen, H.S., & Nguyen, S.H. (2005). Fast split selection method and its application in decision tree construction from large databases. International Journal of Hybrid Intelligent Systems, 2(2), 149–160.
Ordonez, C., & Cereghini, P. (2000). SQLEM: fast clustering in SQL using the EM algorithm, Proceedings of SIGMOD (pp. 559–570).
Pawlak, Z. (2005). Flow graphs and intelligent data analysis. Fundamenta Informaticae, 64(14), 369–377.
Pawlak, Z., & Skowron, A. (2007). Rudiments of rough sets. Information Sciences, 177(1), 3–27.
Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of maxdependency, maxrelevance, and minredundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.
Ramnarayan, J., Mozafari, B., Wale, S., Menon, S., Kumar, N., Bhanawat, H., Chakraborty, S., Mahajan, Y., Mishra, R., & Bachhav, K. (2016). SnappyData: a hybrid transactional analytical store built on spark, Proceedings of SIGMOD (pp. 2153–2156).
Ślęzak, D. (2009). Degrees of conditional (in)dependence: a framework for approximate bayesian networks and examples related to the rough setbased feature selection. Information Sciences, 179(3), 197–209.
Ślęzak, D., Synak, P., Wojna, A., & Wróblewski, J. (2013). Two database related interpretations of rough approximations: data organization and query execution. Fundamenta Informaticae, 127(14), 445–459.
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., & Murthy, R. (2009). Hive – a warehousing solution over a mapreduce framework. Proceedings of the VLDB Endowment, 2(2), 1626–1629.
Weiss, Y., & Pearl, J. (2010). Belief propagation: technical perspective. Communications of the ACM, 53(10), 94.
Yao, Y. (2016). A triarchic theory of granular computing. Granular Computing, 1(2), 145–157.
Zadeh, L.A. (1997). Toward a theory of fuzzy information granulation and tts centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems, 90(2), 111–127.
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., & Stoica, I. (2016). Apache spark: a unified engine for big data processing. Communications of the ACM, 59(11), 56–65.
Author information
Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Ślęzak, D., Glick, R., Betliński, P. et al. A new approximate query engine based on intelligent capture and fast transformations of granulated data summaries. J Intell Inf Syst 50, 385–414 (2018). https://doi.org/10.1007/s1084401704716
Received:
Revised:
Accepted:
Published:
Issue Date:
Keywords
 Big data processing
 Approximate query processing
 Data summarization
 Granular computing
 Machine intelligence