Algorithms for Windowed Aggregations and Joins on Distributed Stream Processing Systems

Window aggregations and windowed joins are central operators of modern real-time analytic workloads and significantly impact the performance of stream processing systems. This paper gives an overview of state-of-the-art research in this area conducted by the Berlin Institute for the Foundations of Learning and Data (BIFOLD) and the Technische Universität Berlin. To this end, we present different algorithms for efficiently processing windowed operators and discuss techniques for distributed stream processing. Recently, several approaches have leveraged modern hardware for windowed stream processing, which we will also include in this overview. Additionally, we describe the integration of windowed operators into various stream processing systems and diverse applications that use specialized window operations.


Introduction
Windowing is a fundamental building block of any stream processing system. Data streams are divided into windows that capture a finite portion of tuples to which stateful operators can be applied. As a result, windowing is a prerequisite for performing aggregations or joins and enables stream processing systems to produce timely responses to long-running streaming queries.
Modern real-time analytics require complex queries, including joins, complex window types, different window measures, and diverse aggregation functions. Concurrent queries and high-velocity data streams generate increased workloads for the systems. The algorithms also have to take into account characteristics of the data streams, such as outof-order tuples or concept drift. Consequently, stream processing systems need efficient approaches for windowed operators. Centralized computation solutions limit the scalability of applications. Thus, the efficient analysis of everincreasing data streams requires processing with multiple nodes. However, distributed approaches need to be adapted to the characteristics of stream processing and windowed operators.
Complex query workloads in combination with data-intensive streams lead to a significant overhead in stream processing systems. Low latency and high throughput are requirements of today's real-time applications. As a result, the efficiency of window aggregation and windowed joins is critical for the performance of stream processing systems.
In this paper, we particularly provide an overview of the research conducted at TU Berlin and BIFOLD (Table 1). We present different approaches for the efficient computation of windowed aggregations and joins on stream processing systems. The rest of this paper is structured as follows: We first discuss related work in Sect. 2. Sect. 3 presents operators that utilize stream slicing, which enables efficient aggregation for overlapping windows and concurrent queries. The approaches presented in Sect. 4 exploit advantages of modern hardware, such as multi-core processors and high-speed networks, to accelerate the performance of stream processing systems. Sect. 5 discusses operators that utilize parallelism and distributed processing. In Sect. 6, we describe the integration of the techniques and algorithms in various stream processing systems. Various applications are provided in Sect. 7. We summarize the evaluation results of the presented work in Sect. 8 and point out directions for future research in Sect. 9.

Related Work
While this paper focuses on presenting research conducted at BIFOLD and TU Berlin, there exists a wide body of work dealing with related challenges. This section reviews related work from different research areas, which are grouped according to the topics of the remaining paper.
Window Aggregation Techniques. Li et al. [40][41][42] contribute to the research area of window aggregation by introducing buckets which store window aggregates that can be computed incrementally to achieve low latency. Sharing aggregates among overlapping windows is not possible with buckets, resulting in redundant computations. To overcome this issue, several techniques use partial window aggregation (e.g., Arasu and Widom [5], Theodorakis et al. [57], Zhang et al. [75]), where intermediate results of an aggregate are calculated and then combined to obtain the final result. For storing partial aggregates, Tangwongsan et al. [53][54][55] utilize various data structures (e.g., arrays, trees, or stacks) in different partial aggregation techniques including FlatFAT, TwoStacks, DABA, and FIBA. Slicing techniques, such as Pairs presented by Krishnamurthy et al. [39], increase throughput by assigning each tuple to exactly one non-overlapping slice, requiring only one aggregation operation. Furthermore, sharing aggregates is possible among multiple concurrent window queries, as shown, for instance, by Theodorakis et al. [58,59]. Aggregation functions can be differentiated in commutative or non-commutative, and invertible or non-invertible. The optimization techniques of Shein et al. [51,52] address these different types of aggregation functions. Bou et al. [10][11][12] present multiple techniques for out-of-order incremental sliding window aggregation. To give an overview of the different techniques, Hirzel et al. [30] survey several sliding window aggregation algorithms and summarize that the choice of the best algorithm depends on the aggregation operation, latency requirements, window type, sharing requirements, and out-oforder processing.
Stream Join Algorithms. Kang et al. [33] examine different algorithms for stream joins over sliding windows and propose a cost model for analyzing their expected performance. Teubner and Müller [56] introduce the handshake join as a method for inter-window joins (i.e., joining overlapping windows such as sliding windows) to utilize highly parallel architectures. In a follow-up work, Roy et al. [49] demonstrate that fast-forwarding tuples between CPU cores reduces the high latency of the handshake join. Karnagel et al. [36] utilize the GPU of tightly-coupled processors with an integrated GPU for computationally intensive parts of the stream join. As another technique for inter-window joins, the SplitJoin proposed by Najafi et al. [46] introduces a top-down dataflow model that utilizes modern multicores. Elseidy et al. [22] processes intra-window joins (i.e., joining two streams over a single window) on parallel threads while adapting to data dynamics by state repartitioning and dataflow routing. The concurrent stream join of Shahvarani and Jacobsen [50] uses a shared index data structure for state materialization on multi-core processors.

Stream Slicing
Window aggregation has a high impact on the performance of stream processing systems due to complex window types (e.g., tumbling, sliding, or session windows), aggregation functions (e.g., sum, avg, or median), concurrent queries, and out-of-order events. The following work focuses on optimizing the window aggregation process for such complex workloads.
Cutty [14] combines the approaches of stream slicing and partial aggregation to support a wide range of different window types. Stream slicing (see Fig. 1) decomposes windows into non-overlapping subsets called slices. The tuples contained in the slices are used to compute partial aggregates, which can be combined to generate further intermediate results and final aggregates. Carbone et al. introduce the concept of user-defined windows (UDWs) that contain custom logic of window types defined by the user. The differentiation in deterministic and non-deterministic window types eliminates the need for knowing exact window semantics by exploiting the properties of these two classes. Deterministic window types (e.g., tumbling, sliding, session, punctuation) can decide during the processing of the tuple whether a tuple represents the beginning or the end of a window. In contrast, non-deterministic window types (e.g., delta-based [23], multi-type [14,23,66], adaptive windowing [9]) can not assert whether the currently processed tuple starts a window or not. The aggregator also enables sharing aggregates between concurrent windows of multiple queries.
Based on the Cutty technique, the open-source operator Scotty [65,66,68] introduces general and efficient window aggregation for out-of-order streams. It yields general applicability by implementing the general stream slicing technique [66] which adapts to the different workload characteristics of aggregation queries, i.e., window types, aggregation functions (e.g., invertible, associative), window measures (e.g., time-based, count-based), and stream order. Scotty extends the slicing technique in combination with out-of-order processing on complex types of windows such as session windows. To this end, the framework 1 supports multiple window types of varying complexity (e.g., tumbling, sliding, punctuation, slide-by-tuple windows). Furthermore, Scotty can be extended with user-defined window types as well as aggregation functions.
The following approaches aim to optimize window aggregations and joins on stream processing systems by exploiting modern hardware, e.g., multi-core CPUs, GPUs, and highspeed networks.
Zeuch et al. [72] propose a windowing mechanism using a double-buffer and lock-free data structures that allow writing to a window buffer in parallel to minimize synchronization overhead. Multiple non-active buffers store previous window results, output the final aggregation result, and reinitialize the buffer memory. One buffer is always active to collect incoming tuples, which avoids a delay in processing the input stream and increases the throughput. To reduce the latency, tuples are incrementally aggregated in the active buffer whenever possible.
Grizzly [28] introduces adaptive query compilation to increase the efficiency of streaming queries on modern hardware. Depending on the specific window types, window measures, and window functions, Grizzly selects the physical operators and generates specialized code. This enables Grizzly to specialize the executed code with regards to the user-provided workload. Furthermore, Grizzly follows a task-based parallelization to utilize the modern multicore CPUs fully. Similar to Zeuch et al., it leverages lockfree operations to compute window aggregates. To this end, Grizzly introduces a lightweight coordination scheme to coordinate the finalization of windows across multiple threads while avoiding coordination overhead. As streaming queries are inherently long-running, Grizzly continuously monitors the execution, collects profiling information, and applies adaptive optimizations to improve execution efficiency. For example, it specializes the data structure for keyed aggregations if it can detect a specific key distribution. In combination with stream slicing, this technique could further improve performance.
Edge devices in the Internet of Things (IoT) perform computations of intermediate results closer to the data sources to avoid network congestion and computation overhead in the cloud. Since these devices are batterypowered, they have a limited energy budget, which becomes more taxed when processing complex workloads. Michalke et al. propose ecoJoin [44], a stream join operator exploiting the modern hardware of these edge devices to reduce energy consumption. In particular, ecoJoin focuses on devices that combine embedded CPUs and GPUs on a single system. To this end, it provides a new stream join algorithm, which uses the GPU to accelerate processing. The algorithm adapts the size of tuple batches based on stream ingestion rates and latency tolerances. These batches are distributed over the cores to parallelize the join phases.
As a result, the efficiency increases even for fast input rates on large windows.
Remote Direct Memory Access (RDMA) hardware allows data transfer with high throughput and low latency. This has the potential to mitigate the bottleneck that networks pose in distributed settings while fulfilling the realtime requirements of stream processing. With a novel processing model designed for RDMA, Slash [21] accelerates distributed stream processing computations. The stateful query executor applies multiple instances of the same operator in parallel to scale the processing of streaming queries across many nodes. A special protocol enables the data exchange among nodes via RDMA channels leveraging the full speed of the RDMA network. Multiple slash executor instances store their partial state (e.g., partial aggregates of windows) in the Slash State Backend. Distributed operator states (e.g., of windows) are merged in a lazy approach. The technique also provides a windowed join based on a hashjoin. Slash operates on a windowing approach that relies on general stream slicing [66]. The shared mutable state allows the technique to omit expensive re-partitioning operations which increases the throughput in contrast to scaleout stream processing systems.

Parallel and Distributed Stream Processing
This section presents parallel and distributed techniques that address the challenge of handling high-velocity data streams while delivering real-time results with low latency and high throughput.
Since the volume of data streams that need to be processed in modern real-time analytics increases continuously, scalability is an important factor of stream processing systems. Distributed stream processing approaches enable such scaling, but have to deal with challenges coming from the distribution of streaming queries and operators, such as windows, window aggregations, and windowed joins. Disco [8] performs complex window aggregation in a distributed manner by aggregating incoming tuples on multiple independent nodes and merging them to the final result. Merging strategies ensure correct aggregation semantics for different window types (i.e., context-free or contextaware) and aggregation functions (i.e., decomposable or holistic). In contrast to the centralized data collection that stream processing systems generally perform, streams can be processed closer to their sources.
Streaming queries have the property of being continuous and long-running, but their operators may need to be modified at some time, for instance, to adapt to varying data rates. Bartnik et al. [6] present generic protocols that allow the modification of operators and of the data flow of running queries. Many stream processing systems enable such a reconfiguration only by restarting the execution of the modified query, which leads to a redistribution of the query state and affects other systems relying on the output. The protocols enable changing the operator function or introducing new operators as well as migrating operators to different nodes for distributed processing. Running queries with very large distributed operator state can be reconfigured on the fly with the library Rhino [20]. A handover protocol migrates the processing and state of the running operator among workers, and a state migration protocol asynchronously replicates the local check-pointed state on workers. During the configuration, Rhino guarantees exactly-once processing and does not stop the query execution.
In real-world applications, it is often necessary to handle many short-term ad-hoc queries in addition to processing continuous long-running queries. The framework AStream [35] extends distributed stream processing systems to support ad-hoc query workloads while sharing computation and resources. When operators have common upstream operators and common partitioning keys, they can be shared among queries. Using a distributed pipeline-parallel architecture, AJoin [34] supports ad-hoc stream processing joins. The technique shares data and computation between multiple queries and utilizes late materialization to pass down a reduced number of intermediate results to subsequent operators. A periodic re-optimization of the query execution plan at runtime ensures efficient execution.
The IoT consists of distributed and heterogeneous sensor nodes, which brings new challenges for processing data streams. Clock offsets occur among the diverse sensor nodes due to different time synchronization techniques. Consequently, joins are affected by incoherent timestamps of tuples produced by multiple devices, resulting in incorrect predictions and false correlations. SENSE [67] provides time coherence for data acquired from distributed sensors without the need of requiring reliable clock synchronization among all nodes. Traub combines the techniques presented in SENSE with windowed joins and as well as temporal and spatial aggregation techniques [62,63]. SENSE, Rhino, and Scotty are also incorporated in the the IoT platform NebulaStream (Sect. 6).
Over time, data streams are subject to changes, for instance, changing user preferences or economic changes. These so-called concept drifts lead to incorrect predictions of the trained machine learning model because it is not appropriately fitted to the current data. Adaptive windowing (ADWIN) [9] detects concept drift and dynamically adapts the model to changes. Grulich et al. [26] identify bottlenecks of the ADWIN algorithm and modify it to run in parallel on multiple threads. As a result, the scalability of adapting to concept drift increases.
Zhang et al. [76] have studied different algorithms of parallel intra-window joins (i.e., joining two streams over a single window) on multi-cores. They differentiate existing approaches in lazy execution and eager execution methods. The lazy approach buffers input tuples of windows from two streams before joining them. Eager execution immediately joins tuples as they arrive, producing partial matches. Zhang et al. conclude that the choice of an appropriate algorithm depends on workload characteristics (e.g., tuple arrival rate, window length), application requirements (e.g., latency, throughput), and hardware architectures (e.g., number of cores and vector extensions).

Systems Integration
As the proposed techniques provide efficient and state-ofthe-art window aggregation, they have been integrated into various systems.
The NebulaStream [73,74] platform offers an end-to-end data-management system for the IoT. It provides a unified environment for a sensor-fog-cloud infrastructure that handles heterogeneous hardware, unreliable nodes, and elastic network topology. To ensure a high performance across these heterogeneous devices, NebulaStream utilizes adaptive query compilation [28] and generates specialized code depending on the device capabilities. Furthermore, the system follows the design principle of maximized sharing. To achieve this on query level, the integration of AStream [35] enables to share data among multiple streaming queries. The general stream slicing technique [66] reuses partial aggregation results among overlapping windows and is therefore integrated in NebulaStream to share data on an operator level. Additionally, NebulaStream integrates Babelfish [29] for the acceleration of UDF-based operators, e.g., to enable the definition of user-provided windows function and aggregations.
Agora [69] provides a unified ecosystem for assets of the entire data value chain i.e. algorithms, data, and physical infrastructure components. In marketplaces, different stakeholders can offer their assets and modify and remove them. Agora enables not only to exchange assets but also to combine them to novel applications as well as the resources to execute these applications. Fair payment requires to track the asset usage. A tracking function called from asset source code, operators, or the tracking of the amount of processed data results in many function calls. The aggregation of these usage counters leverages the techniques Scotty [65,66,68] and Disco [8].
Darwin [7] introduces a scale-in stream processing system that attempts to maximize hardware utilization on diverse hardware setups to reduce the overall infrastructure costs. To this end, it leverages query compilation and tailors execution towards a specific hardware setup. Furthermore, it provides fault-tolerance by supporting larger-than-memory window states.

Applications
In this section, we present diverse applications that utilize windowed stream operations.
The framework Condor [48] allows users to write synopses-based streaming jobs. Synopses enable the approximate computation of quantities that are otherwise expensive or impossible to compute precisely. The work models synopses as stateful window aggregation functions due to their similar concept of combining several values into one total value. Condor supports parallel synopses computation and evaluation and implements all synopses types (i.e., sketches, histograms, wavelets, samplers). It uses Scotty as an underlying slicing technique for computing approximate aggregates with windowed synopses.
The open-source stream generator proposed by Grulich et al. [27] enables the evaluation of modern stream processing systems. It produces deterministic data streams from arbitrary input data sets with different data rates, distributions, and characteristics such as the fraction of out-of-order tuples and their delay. Besides providing realistic workloads by these configurations, it is able to generate the same experiment data ensuring reproducibility.
To visualize streaming data in real-time, the interactive development environment I2 [64] has been proposed. The running cluster applications dynamically adapt to changes in the visualization without restarting. The algorithm ensures that only the depicted data points are transferred, which reduces workload in the front-end and backend.
The aggregation technique M4 [32] reduces the dimensionality of time-series data by rewriting queries for RDBMS-based visualization systems. Additional operators for data reduction are incorporated in the queries that determine four values (i.e., min, max, first, last) per pixel column. This reduces the computational load for visualization while still providing loss-free plots in the form of linecharts.

Evaluation Summary
This section summarizes the main results of the evaluations of each of the techniques presented. More detailed descriptions of experiments and results can be found in the original publications.

Stream Slicing
The evaluations of Cutty [14] and Scotty [65,66,68] show that slicing techniques outperform other approaches such as tuple buffers, buckets, and aggregate trees in terms of throughput. Furthermore, they provide an increased scalability to a large number of concurrent windows. In addition, Scotty outperforms alternative techniques with regard to throughput and scales to a large amount of concurrent windows for workloads that include out-of-order tuples and context-aware windows (e.g., session windows).

Algorithms for Optimizations on Modern Hardware
The experiments of Zeuch et al. [72] show that current stream processing systems underutilize modern hardware in terms of full computational power and memory bandwidth. The proposed streaming optimizations for modern hardware are compared to Apache Flink [13], Spark Streaming [71], Storm [61], Saber [37], and StreamBox [43] on three streaming benchmarks and enable a throughput up to two orders of magnitude higher than these systems. The evaluation shows that the lock-free windowing approach provides a high throughput for stream processing systems on modern hardware. The optimizations of Zeuch et al. are also used in the experiments of Grulich et al. [28], where Grizzly is compared to two hand-optimized implementations [72], Apache Flink [13], StreamBox [43], and Saber [37] on the Yahoo! Streaming Benchmark [19] (Fig. 2). It's utilization of modern hardware causes Grizzly to outperform current stream processing systems by up to one order of magnitude in throughput. The query compilation approach additionally shows this performance improvement under different complex query workloads (i.e., window types, aggregation Michalke et al. [44] demonstrate that EcoJoin significantly enhances performance with regard to throughput and power consumption compared to state-of-the-art stream join algorithms. Adjusting batch sizes and scaling the clock frequency leads to improved energy efficiency. The experiments show that large batch sizes consume less power but result in higher latencies, leading to a trade-off between the two requirements.
In the evaluation presented by Del Monte et al. [21], Slash is compared against state-of-the-art stream processing systems such as LightSaber [58] and Apache Flink [13]. Native RDMA acceleration allows the system to scale with the number of nodes and enables higher throughput for common stream workloads than partitioning-based approaches. For window aggregations and windowed joins, Slash achieves a significantly higher throughput compared to the strongest scale-out baseline.

Parallel and Distributed Stream Processing
Benson et al. [8] compare Disco to centralized data collection and show that the distributed technique exhibits higher performance by scaling linearly with the number of nodes for decomposable and holistic aggregation functions. Disco significantly reduces network traffic by completely avoiding to send individual tuples between nodes for decomposable functions. For holistic functions, it combines tuples and sends them in slices to reduce TCP-overhead.
The protocols of Bartnik et al. [6] enable migrating operators with small state as fast as Apache Flink's [13] savepoint mechanism and outperforms it for migrating operators of jobs with large state. The advantage over Apache Flink is that the migration mechanism prevents data loss during restarting a job, which occurs when data is not consumed from a persistent source. The related work Rhino [20] reduces the latency compared to Flink [13] by up to three orders of magnitude for large state. The state migration protocol allows Rhino to reconfigure a query 50x faster than Flink and 15x faster than Megaphone [31].
As shown by Karimov et al. [35], the framework AStream supports thousand concurrent queries and a throughput of 70 million tuples per second. For a fluctuating workload of starting and deleting concurrent short-running queries, the framework creates and deleted 50 queries per 10 seconds within 1 second event-time latency. AStream deploys ad-hoc queries in the order of milliseconds which is a much lower query deployment latency than Apache Flink [13]. Along with Apache Flink and Spark, AStream is compared to AJoin [34]. The stream join processing engine outperforms all the systems for single query workloads and performs better than Flink for simultaneous submission of all queries at compile time. With multiple joins operators in a query, AJoin performs better than these other systems.
Experiments conducted by Traub et al. show that SENSE [67] maintains a guaranteed coherence below a user-defined upper bound while optimizing the coherence estimate of tuples. Furthermore, the system is scalable to thousands of sensor nodes, robust to node failures, and able to reintegrate recovering nodes.
Grulich et al. [26] demonstrate that the parallel Optimistic ADWIN algorithm outperforms the original implementation and an optimized sequential reimplementation by an increased throughput of two orders of magnitude and a reduced latency of at least 50%.
In their experiments for evaluating stream joins, Zhang et al. [76] compare different parallel intra-window join algorithms with regard to throughput, latency, and progressiveness on several real-world and synthetic workloads. An algorithm that outperforms the other algorithms in all cases does not exist. Their results show that lazy approaches perform better than specifically designed eager algorithms on most workloads. They summarize their findings in a decision tree that should guide the choice of an appropriate algorithm based on different factors such as arrival rate, key duplication, and number of cores.

Systems Integration
Stream slicing can conceptually be supported by every dataflow system. Currently, Scotty is integrated in the various open-source systems we listed before (Sect. 6). For evaluation, Scotty was used as a window operator within those systems and compared to their built-in operators (Fig. 3). The results show that it provides higher throughput then all of the tested systems' built-in operators.
In an experiment performed by Zeuch et al. [73], the throughput of the Yahoo! Streaming Benchmark [70] was evaluated on on a RaspberryPi 3B+ using NebulaStream, Python, Flink, and a hand-optimized Java program. Neb-ulaStream achieves a throughput of more than 10 million tuples per second and a higher performance than the others. Additionally, the system reduces energy requirements while achieving the same performance. In the evaluations of Benson et al. [7], Darwin is compared to the scale-up engine Grizzly [28] and scale-out engine Apache Flink [13]. The results show the same performance as Grizzly for in-memory processing. Grizzly's additional optimizations could be utilized orthogonal to Darwin. In contrast, Darwin performs better than Flink by over an order of magnitude.

Applications
Poepsel et al. [48] evaluate Condor on several representative jobs one real and four synthetic datasets and compared to one-off implementations of the count-min sketch and Yahoo! DataSketches [70]. In the case of one-off custom implementations, Condor performs better and even maintains high performance for a large number of concurrent windows. Secondly, Condor's sketch libraries are designed to support high parallelism applications and results prove they scale linearly with the number of cores in the system which is not the case for Yahoo! DataSketches. This also holds for the other provided synopses and evaluation operators. In summary, the framework allows high-throughput for parallel synopsis maintenance while keeping the same accuracy as centralized techniques.
Traub et al. [64] show that the environment I2 significantly reduces the number of processed and transferred data points. It supports visualizing high-bandwidth data streams without a reduction of the quality.
In experiments of Jugel et al. [32], the M4 aggregation technique is compared with line simplification techniques and common naive approaches (e.g., averaging, sampling, rounding) on real-world data sets. Measuring the visualization quality shows that M4 achieves error-free visualizations. Furthermore, it provides a reduction of data volume by two orders of magnitude and a decrease of latency by one order of magnitude.

Future Work
There are several directions for future research on efficient algorithms for window aggregations and windowed stream joins on distributed stream processing systems. Window concepts become more complex since new window measures have been introduced (e.g., delta-based [23], multimeasure [14,23,66]) and novel window types have been proposed (e.g., Frames [25], window policies [23], snapshot windows [4,24]). Window aggregation and windowed joins algorithms have to adapt to such sophisticated window schemes. Intra-window join algorithms have to consider and dynamically adjust to various factors such as workload, metrics, and hardware [76].
The presented optimization techniques are orthogonal and can be combined to enhance multiple aspects in a system. Future stream processing systems have to include multiple optimization techniques on operator, query, hardware, and application level to meet the high-throughput and lowlatency requirements of modern streaming applications. We are working towards this goal at BIFOLD by developing the NebulaStream System [73,74].

Conclusion
Researchers at TU Berlin and BIFOLD have conducted various research related to windowed operations on streaming systems. We surveyed these works covering efficient window aggregations, modern hardware optimizations, distributed stream processing approaches, and applications.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4. 0/.