Approximate Calculation of Window Aggregate Functions via Global Random Sample

Window functions have been a part of the SQL standard since 2003 and have been studied extensively during the past decade. They are widely used in data analysis; almost all the current mainstream commercial databases support window functions. However, in recent years the size of datasets is growing steeply; the existing window function implementations are not efficient enough. Recently, some sampling-based algorithms (e.g., online aggregation) are proposed to deal with large and complex data in relational databases, which offer us a flexible trade-off between accuracy and efficiency. However, few sampling techniques has been considered for window functions in databases. In this paper, we extend our previous work (Song et al. in Asia-Pacific web and web-age information management joint conference on web and big data, Springer, pp 229–244, 2017) and proposed two new algorithms: range-based global sampling algorithm and row-labeled sampling algorithm. The proposed algorithms use global sampling rather than local sampling and are more efficient than other existing algorithms. And we find our proposed algorithms out performed the baseline method over the TPC-H benchmark dataset.


Introduction
Window functions, as a feature of relational database, were introduced in SQL:1999 as extended documents and formally specified in SQL:2003. Window functions allow aggregation over a set of tuples within a particular range instead of a single tuple. Unlike aggregation with the group-by clause which only outputs one tuple for each group, the OVER() clause associates each tuple in a window with the aggregated value over the window. As a result, window functions greatly facilitate the formulation of business intelligence queries by using the aggregation functions such as ranking, percentiles, moving averages and cumulative totals. Besides, comparing with grouped queries, correlated subqueries and self-joins [2,3], window functions also improve the readability of the scripts. The syntax of a window function is so user-friendly that database users without professional programming skills can write business intelligence queries effortlessly. Due to these desirable features, window functions are widely used by current mainstream commercial databases in recent years, such as Oracle, DB2, SQL Server, Teradata, (Pivotal's) Greenplum [4][5][6].
However, the existing implementation strategies of window functions are too expensive to process big data. While some pioneering works are focusing on improving the time efficiency of executing window functions [7][8][9][10][11], the speed improvements cannot keep up with the evergrowing data size generated by modern applications, and none of them can meet the requirement of real-time processing which is vital to online services.
Researchers have been using various sampling techniques for query processing to accommodate the growing data size. Sampling are effective as analytical queries do not always require precise results. It is often more desirable to return an approximate result with an acceptable margin of error quickly than waiting for the exact calculation for a long time. Database users can specify confidence intervals in advance and stop the query processing as soon as desired accuracy is reached.
Online aggregation is a classic sampling technique first proposed in 1997 [12]. It enables users to terminate the processing of a query as soon as the result is acceptable. This is particularly useful for expensive analytical queries. For example, if the final answer is 1000, after k seconds, the user gets the estimates in form of a confidence interval like [990, 1020] with 95% probability. This confidence interval keeps on shrinking as the system gets more and more samples. Online aggregation has been used in standard SQL queries, such as join [13] and group by [14]. However, few systems fully support online aggregation. And best to our knowledge we are the first to study online aggregation for window functions.
This work develops several different sampling approaches, based on the sampling techniques in online aggregation, for executing window functions. The proposed sampling algorithms speed up query processing time by reducing the number of tuples used in window aggregation functions. And this work is an extension of our work [1], in which we proposed naive random sampling (NRS) and incremental random sampling (IRS) algorithms.
The main contributions of this paper are summarized as follows: -We designed the global random sampling for fast execution of window functions with ROW mode and RANGE mode: Range-based global sampling algorithm (RGSA) and row-labeled sampling algorithm (RLSA). To improve time efficiency, our methods sample a subset from the original global dataset and then estimate the results with quality guarantee based on the subset. -We proved that our algorithms generate unbiased estimators for various aggregate functions and provide an approach to calculate the confidence intervals. We also specified how to adjust parameters (e.g., sampling rate) according to the unique characteristics of the window functions, to satisfy various application scenarios. -Our algorithms outperformed the native algorithm in PostgreSQL based on the TPC-H benchmark.
The rest of this paper is organized as follows. Section 2 discusses the background of window functions and sampling and introduces the related work. Section 3 describes details of native algorithm in PostgreSQL and IRS and then presents the formulae to calculate confidence intervals. Section 4 explains the differences between RANGE and ROW mode, and then we show the details of RGSA and RLSA. We show the experimental results in Sect. 5. Finally, this paper is concluded in Sect. 6 with a vision toward the future work.

Window Functions
Window functions are part of the SQL:2003 standard in order to accommodate complex analytical queries. The general form of a window aggregate function is as follows: where Agg can be any of the standard aggregation functions such as SUM, AVG, COUNT and expression is the attribute to be aggregated upon. The OVER clause defines the aggregation window, inside which PARTITION BY, ORDER BY and frame clause determine the properties of the window jointly(In fact, a frame is just a window). Examples include MAX(velocity) OVER (PARTITION BY roadid,carid ORDER BY time t ROWS BETWEEN 10 PRECEDING AND 10 FOLLOWING), which means we monitor the maximum speed of the car within 20 min of each time in a car monitoring application. Figure 1 illustrates the process of determining the range of a window. Firstly, the PARTITION BY clause divides the tuples into disjoint groups according to the attribute columns in partition list. Nextly, attributes in order list are used to sort tuples in each partition. Finally, we confirm the bounds of the frame, i.e., the range of tuples to aggregate upon, for the current tuple by frame clause and invoke the related aggregate function to compute the aggregation result. The first two processes are called reordering, and the last one is called sequencing.
For each row in a partition, a window frame identifies two bounds in the window and only the rows between these two bounds are included for aggregate calculation. There are two frame models, ROWS and RANGE. In the ROWS model, the boundaries of a frame can be expressed as how many rows before or after the current row belong to the frame; in the RANGE model, the boundaries are specified as a range of values with respect to the value in the current row [10]. Like traditional aggregate functions, a window aggregate function is composed of a state transition function transfun and an optional final calculation function finalfun.
To compute the final result of an aggregate function, a transition value is maintained for the current internal state of aggregation, which is continuously updated by transfun for each new tuple seen. After all input tuples in the window have been processed; the final aggregate result is computed using finalfun. Then we determine the frame boundaries for the next tuple in the partition and do the same work again.

Sampling
Sampling has been widely used in various fields of computer science, such as statistics, model training, query processing. With the advent of Big Data era, sampling is gaining unprecedented popularity in various applications, since it is prohibitively expensive to compute an exact answer for each query, especially in an interactive system. Sampling techniques provide an efficient alternative in applications that can tolerate small errors, in which case exact evaluation becomes an overkill.
Online aggregation [12] a classic approach based on sampling provides users with continuous updates of aggregation results as the query evaluation proceeds. It is attractive since estimates can be timely returned to users, and the accuracy improves with time as the evaluation proceeds.
There are four components in a typical online aggregation system, as shown in Fig. 2. The first one is the approximate result created by the system. The second and third parts consist of reliability and confidence interval which reflect the accuracy and feasibility of the result. The progress bar is the last part that reports the schedule information for the current task. All values improve over time until they meet a user's needs.
Since the results generated by sampling are approximate, it is important to calculate an associated confidence interval. For a simple aggregate query on one table, like SELECT op (expression) FROM table WHERE predicate, it's easy to achieve online aggregation by continuously sampling from the target table. Standard statistical formulas can help us get unbiased estimators and estimate the confidence interval. A lot of previous work [13][14][15][16] have made great contributions on this problem.

Related Work
A window function is evaluated in two phases: resorting and sequencing. Accordingly, existing approaches for evaluating window functions mainly fall into two categories.
Cao et al. [7] improved the full sort technique for evaluating window functions by proposing more competitive hash-sort and segment-sort algorithms. Then Cao et al. [8] proposed a novel method for the multi-sorting operations in one query. They found that most computation time is spent in sorting and partitioning when window size is small. And they proved that it is an NP-hard problem to find the optimal sequence to avoid repeated sorting for multiple window functions in one query and provided a heuristic method. Other earlier studies, such as [9,17,18] and [19], made a lot of contributions with regard to the ORDER BY clause and GROUP BY clause and they proposed optimization algorithms based on either the function dependencies or the reuse of the intermediate results.
Besides, there are several recent works about window operators and their functions. Leis et al. [10] proposed a general execution framework of window functions which is more suitable for memory-based systems. An novel data structure named segment tree is proposed to store aggregates for sub-ranges of an entire group, which helps reduce redundant calculations. However, the approach is only suitable for distributive and algebraic aggregates. Wesley et al. [11] proposed an incremental method for three holistic windowed aggregates. And Song et al. [20] provided an general optimization method based on shared computing, which uses temporary windows to store the shared results and only need a small amount of memory. However, all the above works are toward exact evaluation and they face performance bottlenecks when processing massive data.
Sampling has been used in approximate query evaluation [21,22], data stream processing [23], statistical data analysis [24] and other fields [25,26]. Moreover, as the classic application of sampling, Hellerstein et al. [12] first proposed the concept of online aggregation. Since then, research on online aggregation has been actively pursued. Xu et al. [14] studied online aggregation with group by clause and Wu et al. [16] proposed a continuous sampling algorithm for online aggregation over multiple queries. Qin and Rusu [27] extended online aggregate to distributed and parallel environments. Li et al. [13] studied online aggregation on the queries with join clauses. These techniques based on sampling give us an opportunity to calculate the aggregate easily. In general, sampling is required to be uniform and independent in order to get reasonable results, but both [13] and [28] proved that a non-uniform or a nonindependent samples can be used to estimate the result of an aggregate.
There are many other valuable works [29][30][31][32] on query optimization and sampling, but none of them deals with the problem of window functions. In this paper, we study how to evaluate online aggregate for queries involving and extend the existing work [1] for new application scenarios. What's more, work [33] extended the definition of window functions for graph computing and proposed three different types of graph window: unified window, K-hop window and topological window. This provides a way for us to achieve cross-domain development of the window in the future.

Window Function Execution
Definition 1 (window W) A window consists of a set of rows and is denoted by a triple W i (h, t, V), where i indicates that it is the i-th window, h is the window head and t is the window tail, and V is the transition value for the window.

Native Algorithm
As mentioned in Sect. 2, window functions in commercial database systems are generally evaluated over a windowed table in a two-phase manner. In the first phase, the windowed table is reordered into a set of physical window partitions, each of which has a distinct value of the PAR-TITION BY key and is sorted on the ORDER BY key. The generated window partitions are then pipelined into the second phase, where the window function is sequentially invoked for each row over its window frame within each window partition.
Firstly, for each window W i in the partition, we initialize the parameters of W i and place the read pointer at the start of the window (Line 3). Then determine whether the start position of the current window is equal to that of the previous one. If the condition is true, we can reuse the previous result and assign it to W i :V (Line 5). Nextly, we place the read pointer at the end of the previous window and traverse all tuples from W iÀ1 :t to W i :t. If it is false, we just traverse all tuples from W i :h to W i :t. Finally, return the result and prepare for the next window.
An example of the processing of window functions is shown in Fig. 3, where the aggregate function is summation and the frame model adopted is ROW. The entire table is partitioned and sorted according to attr 2 and makes summation on attr 1. Red brackets in Fig. 3 represent windows. Then we find that W2 begins with the same tuple as W1 and the result of W2 is incrementally updated. However, the start tuple is different between W2 and W3, we have to traverse all the tuples to evaluate a result.
For a table with N tuples, if we use a general window function query, like SUM(attr 1) OVER ( PARTITION BY attr_2 ORDER BY attr_2 ROWS BETWEEN S / 2 PRE-CEDING AND S / 2 FOLLOWING). In most cases, we must use S tuples to get one result for each tuple. And the total execution cost is HðNSÞ without considering partitioning and sorting.

Incremental Sampling Algorithm
The evaluation cost of the window functions is proportional to the number of tuples in the table and in each window. Thus, the execution time increases sharply as the data size grows. Since the table size N is a constant, we propose to reduce the number of tuples checked in each window, S, by sampling.
Definition 2 (Sampling window SW) The sampling window contains a subset of rows in window W i and is defined by a triple SW i (sh, st, indexarr), where i represents the i-th sampling window, indexarr is an array which contains all the indexes of sampled tuples, sh is the head of sampling window, i.e., the first sampled tuple's position in indexarr and st is the tail of sampling window.
NRS is the base algorithm in paper [1], which using native random sampling in each window. Here, we only introduce the improved version named ISR in the paper. Since a large number of common tuples are shared by adjacent windows, ISR reuses the sampling results of the previous window.
Algorithm 2 shows the procedure of incremental sampling. Firstly, we reuse the previous structure SW iÀ1 (Line 4). Then tuples, which are beyond the range of current window, are removed from SW i (Line 6). Function remove head is used to remove the first item in SW i :indexarr and update SW i :sh depending on how many tuples are removed. Then we use a temporary structure T SW to store the indexes of newly sampled tuples in the range from SW i :indexarr½SW i :st to W i :t (Line 9-11). Next, we append the indexes in T SW:indexarr to SW i and update SW i :st (Line 12). Finally, we aggregate the tuples in SW i to estimate the final result (Line 13-14).

Global Sampling Algorithms
All the algorithms we have described in above sections maintain a complete result set, which means we have to reorder all the tuples and calculate results for each one. When facing large data processing problems, these algorithms seem powerless. However, complete results are not always needed in statistics. On the one hand, there is a similarity between adjacent data results, due to the large number of overlaps between adjacent windows. For example, as shown in Fig. 3, W 2 and W 3 are adjacent windows and there is only one data that is different. If the two windows are big enough, the gap between them is minimal. On the other hand, the user only needs to obtain a general description of the whole and pays no attention to details in most statistical requirements scenes. For example, if you want to know the distribution of price of all the orders in an online sales website, you don't need to know the details of each order and a sample can reflect the overall situation.
Hence, this provides us with a wider range of optimization. We can generate a subset of original data and all operations are based on the subsets, which reduces the cost of sort and calculation. According to this idea, we propose two optimization algorithms for ROW mode and RANGE mode: row-labeled sampling algorithm (RLSA) and rangebased global sampling algorithm (RGSA).
The reason why we build different algorithms for ROW and RANGE is that the adaptability of them is different in sampling environment. As is shown in Fig. 4, suppose the current tuple is 7 (i.e., the red number), there are two windows which are defined in the same form (i.e., BETWEEN 2 PRECEDING AND 2 FOLLOWING), but they are in different modes. The green bracket represents the window in ROW mode, while the blue one represents the window in RANGE mode. Therefore, when the windows are used on original data, there are 5 tuples in ROW window (i.e., 6,6.5,7,7.8,8) and 7 tuples in RANGE window (i.e., 5.6,6,6.5,7,7.8, 8,8.5). However, when we apply the same windows on the subset which is generated from original data by using random sampling, there are still 5 tuples in ROW window (i.e., 4,6,7,7.8,8.5) and only 4 tuples in RANGE window (i.e., 6,7,7.8,8.5). And we find ROW window on subset contains 2 dirty tuples that don't belong to original ROW window (i.e., the yellow tuples 4 and 8.5), but it doesn't happen in RANGE mode.

Range-Based Global Sampling Algorithm
The first improved algorithm is designed for window in RANGE mode, named range-based global sampling algorithm (RGSA). As we discuss above, range mode is able to ensure that the data is clean, which is friendly in statistical calculations. And we don't need to adjust the window boundaries, because the nature of the range window ensures that the exact boundary can be found. So all operations are the same as dealing with normal windows. For more information, refer to algorithm 3. Here we just give the basic version.
In algorithm 3, we firstly generate a sub-dataset ST by using the specified sample rate (Line 1-2). Then the subdataset is processed in the same way as algorithm 1: partition, order, calculate. However, we have to use adjust_transfunc to replace the normal one (Line 9 and Line 13). And adjust_transfunc accepts three parameters (i.e., current value, sampling rate and current tuple) to adjust the temporary window value correctly.
Besides, because RGSA has the same characteristics as native algorithm, we can use some improved methods to optimize RGSA. Almost all the optimized algorithms in work [10,11,20] can be used to RGSA with simple change.

Row-Labeled Sampling Algorithm
Another improved algorithm is designed for window in ROW mode. As we discuss above, there exist dirty tuples in ROW window. And these tuples influence the accuracy of estimated results. Hence, in order to reduce or eliminate adverse effects, an intuitive idea is that we can dynamically adjust the window boundaries according to the sampling rate. For example, suppose the sampling rate is 0.5, we can reduce the window size by half. Sometimes this simple method can work, but the fatal flaw is that the dirty data can't be completely avoided. As shown in Fig. 5, although we have changed the size of window from 2 to 1, there exists a dirty tuple (tuple 8.5).
In order to eliminate dirty data completely, we design row-labeled sampling algorithm (RLSA). This algorithm doesn't sample from the origin data directly, but sample from the sorted data after first phase, when each tuple is in the final correct position. Unlike the RGSA algorithm, a new column of attributes is added to each sampling data to record the subscript position. Then the boundaries of the window can be determined accurately according to the subscript value. In other words, a window function in row mode is converted to range mode based on subscript.
Algorithm 4 shows the details of RLSA. Firstly, the data table T is partitioned and sorted according to the normal process (Line 1). After that, the position of each data in T will not be changed and we generate a sub-table ST by random sampling with rate s (Line 2). It should be explained that not only the source data are sampled, but also the index of the sampled data in table S is recorded. Then we calculate the boundaries of each window based on each row in ST and get the subscript of r (Line 3-6). Next, traverse data from the current row to the both sides until the subscript exceeds window range and invoke adjust transfunc for each row that is traversed (Line 7-15). In fact, it's a range style computing process. Finally, return the final results (Line 16). Figure 6 shows the new diagram of previous example by using RLSA. To make it easy to understand, we mark the Fig. 4 Differences between ROW and RANGE subscript of each data in dashed frame. Then we can find that a new subscript attribute is added to the subset( i.e., the sample set) to identify the original position of each tuple. In this example, the current window is based on tuple 7 whose subscript is 4 and the size of window is 2, so the subscript of low bound and high bound is 2 and 6. Next we just need to find the tuples whose subscript satisfy boundary constrains (i.e., greater than or equal to 2, less than or equal to 6). Finally, 4 tuples (i.e., 6,7,7.8,8) are found in the subset and all of them belong to the original window. Hence, the dirty data can be completely avoided by this method.

Estimator and Confidence Interval
Both RGSA and RLSA sample randomly from the original space, and we can use statistical analysis method to calculate errors. Confidence interval is an important component of approximation algorithms and indicates the proximity of each approximate result to extra result. Here we give the formulae related to the estimator and confidence interval of SUM.
Consider a window W containing m tuples, denoted by t 1 ; t 2 ; . . .; t m . Suppose v(i) is an auxiliary function whose result is equal to m times the value of t i when applied to t i . Then we define a new function F f ð Þ, and the extra result a of sum function is equal to F v ð Þ.
Now we sample tuples from window W and get a sequence of tuples, denoted as T 1 ; T 2 ; . . .; T n , after n tuples are selected. Then estimated result Y n can be calculated by using Eq. (2). Y n is the average value of the samples and is equal to T n v ð Þ.
The formula of sample variance is shown as Eq. (3). T n ðf Þ is the average value of the samples. So the sample variance f r 2 is equal to T n;2 ðvÞ.
According to central limit theorem, we can conclude that the left-hand side follows normal distribution with mean 0 and variance 1. ffiffi ffi n p ðY n À aÞ ffiffiffiffiffiffiffiffiffiffiffiffiffi ffi T n;2 ðvÞ p ) Nð0; 1Þ ð4Þ After we get Y n and f r 2 , we have where W is the cumulative distribution function of the N ð0; 1Þ random variable. Finally, we define z p as the pþ1 2quantile of the normal distribution function and set Eq. (5) equal to p. Then a confidence interval is computed as Other aggregation functions, such as COUNT, AVG, VAR, have the similar proof process. And [13,15] give more details about the proofs of a few common aggregation functions. In addition, Haas and Hellerstein [28] proved that we could estimate the aggregate by using a non-independent and uniform sample method.

Experiments
In this section, we experimentally evaluate our implementations. We attempted to compare our implementations with commercial and open-source database systems, but unfortunately, most systems do not support sampling. Hence, we compared them with PostgreSQL's native window function implementation over the TPC-H benchmark and explore the impact of sampling rate on the algorithms.

Experimental Setup
We run PostgreSQL on a computer of Lenovo with an Intel (R) Core (TM) i5-4460M CPU at 3.20 Hz, whose system has 24 GB 1600 MHz DDR3 of RAM. Then we implemented IRS and NRS on it. Since PostgreSQL does not fully support window functions in RANGE mode, we achieve RGSA with C?? and use gcc 5.3 as a compiler. Besides, RLSA can be regarded as special range calculation, so we also implemented it with C??.

Dataset
We use the TPC-H DBGEN instruction to generate two ''order'' tables, with 1.5 million rows and 15 million rows, respectively. There are 9 attribute columns in table ''order,'' and it covers a wide variety of data types, such as INTEGER, DOUBLE, VARCHAR, DATE. You can access to TPC's official website for more information.

Experimental Results
In this section, we use the following query template in our experiments to demonstrate our implementation: where aggf are aggregation functions, such as sum, avg, count, frame model is ROW, and the bounds of window are limited by pre and fel.

Comparison with PostgreSQL
In this part we compare the time efficiency of each algorithm. Figure 7 shows the average execution time of different algorithms. Figure 7a shows the result of default algorithm in PostgreSQL, NRS and IRS. The sampling rates of NRS and IRS are set to 0.3. As expected, our algorithms are far better than the default algorithm in PostgreSQL, especially dealing with large windows. In other words, NRS and IRS are more insensitive to window size. Besides, IRS is slightly better than NRS in terms of efficiency, because IRS avoids repeated sampling when calculating the window functions. Figure 7b shows the results of RLSA with different sampling rate (i.e., red line is 0.5 and blue line is 0.3). Obviously, the execution speed of RLSA is much higher than that of default algorithm and the effect is obvious when the sampling rate is low, because lower sampling rate means smaller window size and less tuples needed to be calculated. The similar conclusion can be drawn from Fig. 7c. What's more, Fig. 7d shows the sample time of RLSA. In addition to finding that the time increases with the increase of sampling rate, we also find that the sampling time is very short and has little effect on the overall execution time.

Effect of Sampling Rate
In this part we compare the error levels and the influence of the parameters of each algorithm. Sampling rate is a very important parameter in our methods. Since naive random sampling (NRS) and incremental random sampling (IRS) are similar in this respect, all the pictures in Fig. 8 show the results of IRS algorithm based on the row mode window. Figure 8 shows the results of IRS on the table with 1.5M tuples. The results show that execution time is negatively correlated with sampling rate, and error is positively correlated with it in any window size condition. What's more, Fig. 8a also shows that the slopes of the three curves are not similar and the black curve is the smoothest. It means the growth trend of time consumption with various rates is different as the size of window increases. In more details, IRS with low sampling rate is more insensitive to window size than that with high rate. This is the result of sample data management and the influence becomes larger as the sampling tuples become more.
Then from Fig. 8b, we can find that with fixed sample rate, the error becomes smaller when window size becomes bigger. It can be explained by the law of large numbers. The larger the volume of sample data is, the closer we are to the real value; the smaller the volume of sample data is, the higher the randomness is. And Fig. 8b also shows that the error difference among various sampling rates becomes smaller as the window becomes larger.
Next, Fig. 9a shows the results of RLSA with different sampling rate on 15M tuples table. And there are two obvious conclusions: the error decreases as the sampling rate increases; the bigger the window is, the smaller the difference in error is, which is consistent with ISR. Hence we can guide users to select lower sampling rates when   using large window size. Besides, we can find that the variation of the error is relatively stable when the sampling rate is high. The reason is that there are more tuples in each window to be sampled, which makes the estimation results closer to the true values. What's more, Fig. 9b shows the influences of sampling rate and window size on RGRS. And we can find error of big window is more stable and is more insensitive to sampling rate than that of small window. We can also use the law of large numbers to explain these conclusions. In addition, the difference of error between the different windows is small, when the sampling rate is large.

Conclusion and Future Work
In this paper, we draw on some of the algorithms used in online aggregation to deal with problems of window functions. And we propose two new algorithms: rangebased global sampling algorithm (RGSA) and row-labeled sampling algorithm(RLSA) and implement it in the latest version PostgreSQL. Despite the good experimental results, there are a lot of potentials that need to be exploited in the future: -Not only RLSA and RGSA reduce the computational consumption in each window, but decrease the number of windows over the whole table. However, Post-greSQL does not fully support window functions in RANGE mode and we just implemented them with C??. So it's very useful if we could achieve RLSA and RGSA in PostgreSQL. -The system we have implemented is not an online system and it needs users to give the sampling rate. In the future, we want our system to be more intelligent and become a real online system.