A Survey on the Evolution of Stream Processing Systems

Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between early ('00-'10) and modern ('11-'22) streaming systems, and discuss recent trends and open problems.


Introduction
Applications of stream processing technology have gone through a resurgence, penetrating multiple and very diverse industries.Nowadays, virtually all Cloud vendors offer first-class support for deploying managed stream processing pipelines, while streaming systems are used in a variety of use-cases that go beyond the classic streaming analytics (windows, aggregates, joins, etc.).For instance, web companies are using stream processing for dynamic cartrip pricing, banks apply it for credit card fraud detection, while traditional industries apply streaming technology for real-time harvesting analytics.At the moment of writing we are witnessing a trend towards using stream processors to build more general event-driven architectures [85], largescale continuous ETL and analytics, and microservices [81].
During the last 20 years, streaming technology has evolved significantly, under the influence of database and Figure 1 presents a schematic categorization of influential streaming systems into three generations and highlights each era's domains of focus.Although the foundations of stream processing have remained largely unchanged over the years, stream processing systems have transformed into sophisticated and scalable engines, producing correct results in the presence of failures.Early systems and languages were designed as extensions of relational execution engines, with the addition of windows.Modern streaming systems have evolved in the way they reason about completeness and ordering (e.g., out-of-order computation) and have witnessed architectural paradigm shifts that constituted the foundations of processing guarantees, reconfiguration, and state management.At the moment of writing, we observe yet another paradigm shift towards general event-driven architectures, actor-like programming models and microservices [9,29], and a growing use of modern hardware [86,126,141,142].
This survey is the first to focus on the evolution of streaming systems rather than the state of the field at a particular point in time.To the best of our knowledge, this is also the first attempt at understanding the underlying reasons why certain early techniques and designs prevailed in modern systems while others were abandoned.Further, by examining how ideas survived, evolved, and were often reinvented, we reconcile the terminology used by the different generations of streaming systems.

Contributions
With this survey paper, we make the following contributions: -We summarize existing approaches to streaming systems design and categorize early and modern stream processors in terms of underlying assumptions and mechanisms.-We compare early and modern stream processing systems with regard to out-of-order data management, state management, fault-tolerance, high availability, load management, elasticity, and reconfiguration.
-We highlight important but overlooked works that have influenced today's streaming systems design.-We establish a common nomenclature for fundamental streaming concepts, often described by inconsistent terms in different systems and communities.

Related surveys and research collections
We view the following surveys as complementary to ours and recommend them to readers interested in diving deeper into a particular aspect of stream processing or those who seek a comparison between streaming technology and advances from adjacent research communities.Cugola and Margara [50] provide a view of stream processing with regard to related technologies, such as active databases and complex event processing systems, and discuss their relationship with data streaming systems.Further, they provide a categorization of streaming languages and streaming operator semantics.The language aspect is further covered in another recent survey [71], which focuses on the languages developed to address the challenges in very large data streams.It characterizes streaming languages in terms of data model, execution model, domain, and intended user audience.Röger and Mayer [114] present an overview of recent work on parallelization and elasticity approaches of streaming systems.They define a general system model which they use to introduce operator parallelization strategies and parallelism adaptation methods.Their analysis also aims at comparing elasticity approaches originating in different research communities.Hirzel et al. [72] present an extensive list of logical and physical optimizations for streaming query plans.They present a categorization of streaming optimizations in terms of their assumptions, semantics, applicability scenarios, and trade-offs.They also present experimental evidence to reason about profitability and guide system implementers in selecting appropriate optimizations.To, Soto, and Markl [127] survey the concept of state and its applications in big data management systems, covering also aspects of streaming state.Finally, Dayarathna and Perera [52] present a survey of the advances of the last decade with a focus on system architectures, use-cases, and hot research topics.They summarize recent systems in terms of their features, such as what types of operations they support, their fault-tolerance capabilities, their use of programming languages, and their best reported performance.Theoretical foundations of streaming data management and streaming algorithms are out of the scope of this survey.A comprehensive collection of influential works on these topics can be found in Garofalakis et al. [60].The collection focuses on major contributions of the first generation of streaming systems.It reviews basic algorithms and synopses, fundamental results in stream data mining, streaming languages and operator semantics, and a set of representative applications from different domains.

Survey organization
We begin by presenting the essential elements of the domain in Section 2. Then we elaborate on each of the important functionalities offered by stream processing systems: out-oforder data management (Section 3), state management (Section 4), fault tolerance and high availability (Section 5), and load management, elasticity, and reconfiguration (Section 6).Each one of these sections contains a Vintage vs. Modern discussion that compares early to contemporary approaches and a summary of open problems.We summarize our major findings, discuss prospects, and conclude in Table 1 and Section 7.

Preliminaries
In this section, we provide necessary background and explain fundamental stream processing concepts the rest of this survey relies on.We discuss the key requirements of a streaming system, introduce the basic streaming data models, and give a high-level overview of the architecture of early and modern streaming systems.

Requirements of streaming systems
A data stream is a data set that is produced incrementally over time, rather than being available in full before its processing begins [60].Data streams are high-volume, real-time data that might be unbounded.Therefore, stream processing systems can neither store the entire stream in an accessible way nor can they control the data arrival rate or order.In contrast to traditional data management infrastructure, streaming systems have to process elements on-the-fly using limited memory.Stream elements arrive continuously and either bear a timestamp or are assigned one on arrival.
Respectively, a streaming query ingests events and produces results in a continuous manner, using a single pass or a limited number of passes over the data.Streaming query processing is challenging for multiple reasons.First, continuously producing updated results might require storing historical information about the stream seen so far in a compact representation that can be queried and updated efficiently.Such summary representations are known as sketches or synopses.Second, in order to handle high input rates, certain queries might not afford to continuously update indexes and materialized views.Third, stream processors cannot rely on the assumption that state can be reconstructed from associated inputs.To achieve acceptable performance, streaming operators need to leverage incremental computation.
The aforementioned characteristics of data streams and continuous queries provide a set of unique requirements for streaming systems, other than the evident performance ones of low latency and high throughput.Given the lack of control over the input order, a streaming system needs to produce correct results when receiving out-of-order and delayed data (cf.Section 3).It needs to implement mechanisms that estimate a stream's progress and reason about result completeness.Further, the long-running nature of streaming queries demands that streaming systems manage accumulated state (cf.Section 4) and guard it against failures (cf.Section 5).Finally, having no control over the data input rate requires stream processors to be adaptive so that they can handle workload variations without sacrificing performance (cf.Section 6).

Streaming data models
There exist many theoretical streaming data models, mainly serving the purpose of studying the space requirements and computational complexity of streaming algorithms and understanding which streaming computations are practical.For instance, a stream can be modeled as a dynamic onedimensional vector [60].The model defines how this dynamic vector is updated when a new element of the stream becomes available.While theoretical streaming data models are useful for algorithm design, early stream processing systems instead adopted extensions of the relational data model.Recent streaming dataflow systems, especially those influenced by the MapReduce philosophy, place the responsibility of data stream modeling on the application developer.

Relational Streaming Model
In the relational streaming model as implemented by firstgeneration systems [7,15,43,49], a stream is interpreted as describing a changing relation over a common schema.Streams are either produced by external sources and update relation tables or are produced by continuous queries and update materialized views.An operator outputs event streams that describe the changing view computed over the input stream according to the relational semantics of the operator.Thus, the semantics and schema of the relation are imposed by the system.STREAM [15] defines streams as bags of tupletimestamp pairs and relations as time-varying bags of tuples.The implementation unifies both types as sequences of timestamped tuples, where each tuple also carries a flag that denotes whether it is an insertion or a deletion.Input streams consist of insertions only, while relations may also contain deletions.TelegraphCQ [43] uses a similar data model.Aurora [7] models streams as append-only sequences of tuples, where a set of attributes denote the key and the rest of the attributes denote values.Borealis [6] generalizes this model to support insertion, deletion, and replacement messages.Messages may also contain additional fields related to QoS metrics.Gigascope [49] extends the sequence database model.It assumes that stream elements bear one or more timestamps or sequence numbers, which generally increase (or decrease) with the ordinal position of a tuple in a stream.Ordering attributes can be (strictly) monotonically increasing or decreasing, monotone non-repeating, or increasing within a group of records.In CEDR [25], stream elements bear a valid timestamp, , after which they are considered valid and can contribute to the result.Alternatively, events can have validity intervals.The contents of the relation at time are all events with ≤ .

Dataflow Streaming Model
The dataflow streaming model, as implemented by systems of the second generation [11,32,140], does not impose any strict schema or semantics to the input stream elements, other than the presence of a timestamp.While some systems, like Naiad [105], require that all stream elements bear a logical timestamp, other systems, such as Flink [32] and Dataflow [11], expect the declaration of a time domain.Applications can operate in one of three modes: (i) event (or application) time is the time when events are generated at the sources, (ii) processing time is the time when events are processed in the streaming system, and (iii) ingestion time is the time when events arrive at the system.Modern dataflow streaming systems can ingest any type of input stream, irrespectively of whether its elements represent additions, deletions, replacements or deltas.The application developer is responsible for imposing the semantics and writing the operator logic to update state accordingly and produce correct results.Designating keys and values is also usually not required at ingestion time, however, keys must be defined when using certain data-parallel operators, such as windows.

Architectures of Streaming Systems
The general architecture of streaming systems has evolved significantly over the last two decades.Before we delve into the specific approaches to out-of-order management, state, fault tolerance, and load management, we outline some fundamental differences between early (1st generation) and modern (2nd generation) streaming systems.Table 1 summarizes our findings.
The architecture of a first-generation DSMS follows closely that of a database management systems (DBMS), with the addition of certain components designated to address the requirements of streaming data (cf.Section 2.1).In particular, the input manager is responsible for ingesting streams and possibly buffering and ordering input elements.The scheduler determines the order or operator execution, as well as the number of tuples to process and push to the outputs.Two important additional components are the quality monitor and load shedder which monitor stream input rates and query performance and selectively drop input records to meet target latency requirements.Queries are compiled into a shared query plan which is optimized and submitted to the query execution engine.In the common case, a DSMS supports both ad-hoc and continuous queries.Early architectures are designed with the goal to provide fast, but possibly approximate results to queries.
The next generation of distributed dataflow systems are usually deployed on shared-nothing clusters of machines.Dataflow systems employ task and data parallelism, have explicit state management support, and implement advanced fault-tolerance capabilities to provide result guarantees.Distributed workers execute parallel instances of one of more operators (tasks) on disjoint stream partitions.In contrast to DSMSs, queries are independent of each other, maintain their own state, and they are assigned dedicated resources.Every query is configured individually and submitted for execution as a separate job.Input sources are typically assumed to be replayble and state is persisted to embedded or external stores.Modern architectures prioritize high throughput, robustness, and result correctness over low latency.
Despite the evident differences between early and modern streaming systems' architectures, many fundamental aspects have remained unchanged in the past two decades.The following sections examine in detail how streaming systems have evolved in terms of out-of-order processing, state capabilities, fault-tolerance, and load management.

Managing Event Order and Timeliness
A streaming system receives data continuously from one or more input sources.Typically the order of data in a stream is part of the stream's semantics [98].Depending on the computations to perform, a streaming system may have to process stream tuples in a certain order to provide semantically correct results [119].However, in the general case, a stream's data tuples arrive out of order [91,132] for reasons explained in Section 3.1.
Out-of-order data tuples [119,130] arrive in a streaming system after tuples with later event time timestamps.
In the rest of the paper we use the terms disorder [98] and out-of-order [10,94] to refer to the disturbance of order in a stream's data tuples.Reasoning about order and managing disorder are fundamental considerations for the operation of streaming systems.
In the following, we highlight the causes of disorder in Section 3.1, clarify the relationship between disorder in a stream's tuples and processing progress in Section 3.2, and outline the two key system architectures for managing outof-order data in Section 3.3.Then, we describe the consequences of disorder in Section 3.4 and present the mechanisms for managing disorder in Section 3.5.Finally, in Section 3.6, we discuss the differences of out-of-order data management in early and modern systems and we present open problems in Section 3.7.

Causes of Disorder
Disorder in data streams may be owed to stochastic factors that are external to a streaming system or to the operations taking place inside the system.
The most common external factor that introduces disorder to streams is the network [87,119].Depending on the network's reliability, bandwidth, and load, the routing of some stream tuples can take longer to complete compared to the routing of others, leading to a different arrival order in a streaming system.Even if the order of tuples in an individual stream is preserved, ingestion from multiple sources, such as sensors, typically results in a disordered collection of tuples, unless the sources are carefully coordinated, which is rare.
External factors aside, specific operations on streams break tuple order.First, join processing takes two streams and produces a shuffled combination of the two, since a parallel join operator repartitions the data according to the join attribute [133] and outputs join results by order of match [67,80].Second, windowing based on an attribute different to the ordering attribute reorders the stream [49].Third, data prioritization [112,134] by using an attribute different to the ordering one also changes the stream's order.Finally, the union operation on two unsynchronized streams yields a stream with all tuples of the two input streams interleaving each other in random order [7].

Disorder and Processing Progress
In order to manage disorder, streaming systems need to detect processing progress.We discuss how disorder management and progress tracking are intertwined in Sections 3.3 and 3.4.
Progress regards how much the processing of a stream's tuples has advanced over time.Processing progress can be defined and quantified with the aid of an attribute A of a stream's tuples that orders the stream.The processing of the stream progresses when the smallest value of A among the unprocessed tuples increases over time [94].A then is a progressing attribute and the oldest value of A per se, is a measure of progress because it denotes how far in processing tuples the system has reached since the beginning.Beyond this definition, streaming systems often make their own interpretation of progress, which may involve more than one attributes.

System Architectures for Managing Disorder
Two main architectural archetypes have influenced the design of streaming systems with respect to managing disorder: (i) in-order processing systems [7,16,49,119], and (ii) out-of-order processing systems [10,32,94,105].
In-order processing systems manage disorder by fixing a stream's order.As a result, they essentially track progress by monitoring how far the processing of a data stream has advanced.In-order systems buffer and reorder tuples up to a lateness bound.Then, they forward the reordered tuples for processing and clear the corresponding buffers.
In out-of-order processing systems, operators or a global authority produce progress information using any of the metrics detailed in Section 3.5.1,and propagate it to the dataflow graph.The information typically reflects the oldest unprocessed tuple in the system and establishes a lateness bound for admitting out-of-order tuples.In contrast to in-order systems, tuples are processed without delay in their arrival order, as long as they do not exceed the lateness bound.

Effects of Disorder
In unbounded data processing, disorder can impede progress [94] or lead to wrong results if ignored [119].
Disorder affects processing progress when the operators that comprise the topology of the computation require ordered input.Various implementations of join and aggregate rely on ordered input to produce correct results [7,119].When operators in in-order systems receive out-of-order tuples, they have to reorder them prior to including them in the window they belong.Reordering, however, imposes processing overhead, memory space overhead, and latency.Out-oforder systems, on the other hand, track progress and process data in whatever order they arrive, up to the lateness bound.To include late tuples in results, they additionally need to store the processing state up to the lateness bound.As a sidenote, order-insensitive operators [7,94,119], such as apply, project, select, dupelim, and union, are agnostic to disorder in a stream and produce correct results even when presented with disordered input.
Ignoring out-of-order data might lead to incorrect results if the output is computed on partial input only.Thus, a streaming system needs to be capable of processing outof-order data and incorporate their effect to the computation.However, without knowledge of how late data can be, waiting indefinitely can block output and accumulate large computation state.This concern manifests on all architectures and we discuss how it can be countered with disorder management mechanisms, next.

Mechanisms for Managing Disorder
In this section, we elaborate on influential mechanisms for managing disorder in unbounded data, namely slack [7], heartbeats [119], low-watermarks [94], pointstamps [105], and triggers [11].Heartbeats, low-watermarks, and pointstamps track processing progress and quantify a lateness bound using a metric, such as time.In contrast, slack merely quantifies the lateness bound.If tuples arrive after the lateness bound expires, triggers can be used to update computation results in revision processing [6].We also discuss punctuations [132], a generic mechanism for communicating information across the dataflow graph, that has been heavily used as a vehicle in managing disorder.

Tracking processing progress
We present the four most notable progress tracking mechanisms: slack, heartbeats, low-watermark, and pointstamps.In addition, we accompany the analysis of each mechanism with a figure.Figure 2 showcases the differences between slack, heartbeats, and low-watermarks.The figure depicts a simple aggregation operator that counts tuples in 4-second event time tumbling windows.The operator awaits for some indication that event time has advanced past the end timestamp of a window so that it computes and outputs an aggregate per window.The indication varies according to the progress-tracking mechanism.The input to this operator are seven tuples containing only a timestamp from t=1 to t=7.The timestamp signifies the event time in seconds that the tuple was produced in the input source.Each tuple contains a different timestamp and all tuples are dispatched from a source in ascending order of timestamp.Due to network latency, the tuples may arrive to the streaming system out of order.

3.5.1.1
Slack is a simple mechanism that involves waiting for out-of-order data for a fixed amount of a certain metric.Slack originally denoted the number of tuples intervening between the actual occurrence of an out-of-order tuple and the position it would have in the input stream if it arrived on time.However, it can also be quantified in terms of elapsed time.Essentially, slack marks a fixed grace period for late tuples.
Figure 2a presents the slack mechanism.In order to accommodate out-of-order tuples the operator admits out-oforder tuples up to slack=1.Thus, the operator having admitted tuples with t=1 and t=2 not depicted in the figure will receive tuple with t=4.The timestamp of the tuple coincides with the max timestamp of the first window for interval [0, 4).Normally, this tuple would cause the operator to close the window and compute and output the aggregate, but because of the slack value the operator will wait to receive one more tuple.The next tuple t=3 belongs to the first window and is included there.At this point, slack also expires and this event finally triggers the window computation, which outputs C=3 for t= [1,2,3].On the contrary, the operator will not accept t=5 at the tail of input because it arrives two tuples after its natural order and is not covered by the slack value.

3.5.1.2
A heartbeat is an alternative to slack that consists of an external signal carrying progress information about a data stream.It contains a timestamp indicating that all succeeding stream tuples will have a timestamp larger than the heartbeat's timestamp.Heartbeats can either be generated by an input source or deduced by the system by observing environment parameters, such as network latency bound, application clock skew between input sources, and out-of-order data generation [119].
Figure 2b depicts the heartbeat mechanism.An input manager buffers and orders the incoming tuples by timestamp.The number of tuples buffered, two in this example (t=5, t=6), is of no importance.The source periodically sends a heartbeat to the input manager, i.e. a signal with a timestamp.Then the input manager dispatches to the operator all tuples with timestamp less or equal to the timestamp of the heartbeat in ascending order.For instance, when the heartbeat with timestamp t=2 arrives in the input manager (not shown in the figure), the input manager dispatches the tuples with timestamp t=1 and t=2 to the operator.The input manager then receives tuples with t=4, t=6, and t=5 in this order and puts them in the right order.When the heartbeat with timestamp t=4 arrives, the input manager dispatches the tuple with timestamp t=4 to the operator.This tuple triggers the computation of the first window for interval [0, 4).The operator outputs C=2 counting two tuples with t= [1,2] not depicted in the figure.The input manager ignores the incoming tuple with timestamp t=3 as it is older than the latest heartbeat with timestamp t=4.

The low-watermark for an attribute
A of a stream is the lowest value of A within a certain subset of the stream.Thus, future tuples will probabilistically bear a higher value than the current low-watermark for the same attribute.Often, A is a tuple's event time timestamp.The mechanism is used by a streaming system to track processing progress via the low-watermark for A, to admit out-of-order data whose attribute A's value is not smaller than the low-watermark.Further, it can be used to remove state that is maintained for A, such as the corresponding hash table entries of a streaming join computation.
Figure 2c presents the low-watermark mechanism, which signifies the oldest pending work in the system.Here punctuations carrying the low-watermark timestamp decide when windows will be closed and computed.After receiving two tuples with t=1 and t=2, the corresponding low-watermark for t=2 (which is propagated downstream), and tuple t=3, the operator receives tuple t=5.Since this tuple carries an event time timestamp greater or equal to 4, which is the end timestamp of the first window, it could be the one to cause the window to fire or close.However, this approach would not account for out-of-order data.Instead, the window closes when the operator receives the low-watermark with t=4.At this point, the operator computes C=3 for t= [1,2,3] and assigns tuples with t= [5,6] to the second window with interval [4,8).The operator will not admit tuple t=4 because it is not  greater (more recent) than the current low-watermark value t=4.
3.5.1.4 Comparison between heartbeats, slack, and punctuations.Heartbeats and slack are both external to a data stream.Heartbeats are signals communicated from an input source to a streaming system's ingestion point.Differently to heartbeats, which is an internal mechanism of a streaming system hidden from users, slack is part of the query specification provided by users [7].
Heartbeats and low-watermarks are similar in terms of progress-tracking logic.However, two important differences set them apart.While heartbeats expose the progress of stream tuple generation at the input sources, the lowwatermark extends this to the processing progress of computations within the streaming system by reflecting their oldest pending work.Second, the low-watermark generalizes the concept of the oldest value, which signifies the current progress point, to any progressing attribute of a stream tuple besides timestamps.
In contrast to heartbeats and slack, punctuations are metadata annotations embedded in data streams.A punctuation is itself a stream tuple, which consists of a set of patterns each identifying an attribute of a stream data tuple.A punctuation is a generic mechanism that communicates informa-tion across the dataflow graph.Regarding progress tracking, it provides a channel for communicating progress information such as a tuple attribute's low-watermark produced by an operator [94], event time skew [119], or slack [7].Thus, punctuations can convey which data cease to appear in an input stream; for instance the data tuples with smaller timestamp than a specific value.Punctuations are useful in other functional areas of a streaming system as well, such as state management, monitoring, and flow control.
3.5.1.5Pointstamps, like punctuations, are embedded in data streams, but a pointstamp is attached to each stream data tuple as opposed to a punctuation, which forms a separate tuple.Pointstamps are pairs of timestamp and location that position data tuples on a vertex or edge of the dataflow graph at a specific point in time.An unprocessed tuple p at a specific location could-result-in another unprocessed tuple p' with timestamp t' at another location when p can arrive at p' before or at timestamp t'.Unprocessed tuples p with timestamp t are in the frontier of processing progress when no other unprocessed tuples could-result-in p.Thus, tuples bearing t or an earlier timestamp are processed and the frontier moves on.The system enforces that future tuples will bear a greater timestamp than the tuples that generated them.This modeling of processing progress traces the course of data tuples on the dataflow graph with timestamps and tracks the dependencies between unprocessed events in order to compute the current frontier.The concept of a frontier is similar to a low-water mark.
The example shown in Figure 3 showcases how pointstamps and frontiers work.The example in Figure 3a includes three active pointstamps.Poinstamps are active when they correspond to one or more unprocessed events.Pointstamp (1, OP1) is in the frontier of active pointstamps, because its precursor count is 0. The precursor count, specifies the number of active pointstamps that could-result-in that pointstamp.In the frontier, notifications for unprocessed events can be delivered.Thus, unprocessed events e1 and e2 can be delivered to OP2 and OP3 respectively.The occurrence count is 2 because both events e1 and e2 bear the same pointstamp.Looking at this snapshot of the data flow graph, it is easy to see that pointstamp (1, OP1) could-resultin pointstamps (2, OP2) and (2, OP3).Therefore, the precursor count of the latter two pointstamps is 1.A bit later as Figure 3b depicts, after events e1 and e2 are delivered to OP2 and OP3 respectively, their processing results in the generation of new events e5 and e6, which bear the same pointstamp as unprocessed events e3 and e4 respectively.Since there are no more unprocessed events with timestamp 1, and the precursor count of pointstamps (2, OP2) and (2, OP3) is 0, then the frontier moves on to these active pointstamps.Consequently, all four event notifications can be delivered.Obsolete pointstamps (1, OP1), (2, OP2), and (2, OP3), are removed from their location, since they correspond to no unprocessed events.Although this example is made simple for educational purposes, the progress tracking mechanism, has the power to track the progress of arbitrary iterative and nested computations.
Pointstamps/frontiers track processing progress regardless of the notion of event time.However, it is possible for users to capture out-of-order data with pointstamps/frontiers by establishing a two-dimensional frontier of event time and processing time that is flexibly open on the side of event time.

Tracking progress of out-of-order data in cyclic queries
Cyclic queries require special treatment for tracking progress.A cyclic query always contains a binary operator, such as a join or a union.The output produced by the binary operator meets a loop further in the dataflow graph that connects back to one of the binary operator's input channels.In a progress model that uses punctuations for instance, the binary operator forwards a punctuation only when it appears in both of its input channels otherwise it blocks waiting for both to arrive.Since one of the binary operator's input channels depends on its own output channel, a deadlock is inevitable.
Chandramouli et al. [41] propose an operator for detecting progress in cyclic streaming queries on the fly.The operator introduces a speculative punctuation in the loop that is derived from the passing events' timestamp.While the punctuation flows in the loop the operator observes the stream's tuples to validate its guess.When this happens and the speculative punctuation re-enters the operator, it becomes a regular punctuation that carries progress information downstream.Then a new speculative punctuation is generated and is fed in the loop.By combining a dedicated operator, speculative output, and punctuations this work achieves to track progress and tolerate disorder in cyclic streaming queries.The approach works for strongly convergent queries and can be utilized in systems that provide speculative output.
In Naiad [105,106], the general progress-tracking model features logical multidimensional timestamps attached to events.Each timestamp consists of the input batch to which an event belongs and an iteration counter for each loop the event traverses.Like in Chandramouli et al. [41], Naiad supports cyclic queries by utilizing a special operator.However, the operator is used to increment the iteration counter of events entering a loop.To ensure progress, the system allows event handlers to dispatch only messages with larger timestamp than the timestamp of events being currently processed.This restriction imposes a partial order over all pending events.The order is used to compute the earliest logical time of events' processing completion in order to deliver notifications for producing output.Naiad's progress-tracking mechanism is external to the dataflow.This design defies Fig. 3: High-level workflow of pointstamps and frontier the associated implementation complexity in favor of a) efficient delivery of notifications that is proportional to dataflow nodes instead of edges and b) incremental computation that avoids redundant work.Although not directly incorporated, the notion of event time can be encapsulated in multidimensional timestamps to account for out-of-order data.

Revision processing
Revision processing is the update of computations in face of late, updated, or retracted data, which require the modification of previous outputs in order to provide correct results.Revision processing made its debut in Borealis [6].From there on, it has been combined with in-order processing architectures [40,107], as well as out-of-order processing architectures [11,12,25,87].In some approaches revision processing works by storing incoming data and revising computations in face of late, updated, or retracted data [11,12,25].
Other approaches replay affected data, revise computations, and propagate the revision messages to update all affected results until the present [6,107,115].Finally, a third line of approaches maintain multiple partitions that capture events with different levels of lateness and consolidate partial results [40,87].

Store and revise.
Microsoft's CEDR [25] and StreamInsight [12], and Google's Dataflow [11] buffer or store stream data and process late events, updates, and deletions incrementally by revising the captured values and updating the computations.The dataflow model [11] divides the concerns for out-oforder data into three dimensions: the event time when late data are processed, the processing time when corresponding results are materialized, and how later updates relate to earlier results.The mechanism that decides the emission of updated results and how the refinement will happen is called a trigger.Triggers are signals that cause a computation to be repeated or updated when a set of specified rules fire.
One important rule regards the arrival of late input data.Triggers ensure output correctness by incorporating the effects of late input into the computation results.Triggers can be defined based on watermarks, processing time, data arrival metrics, and combinations of those; they can also be user-defined.Triggers support three refinement policies, accumulating where new results overwrite older ones, discarding where new results complement older ones, and accumulating and retracting where new results overwrite older ones and older results are retracted.Retractions, or compensations, are also supported in StreamInsight [12].[6] and speculative processing [107] replay an affected past data subset when a revision tuple is received.An optimization of this scheme relies on two revision processing mechanisms, upstream processing and downstream processing [115].Both are based on a special-purpose operator, called connection point, that intervenes between two regular operators and stores tuples output by the upstream operator.According to the upstream revision processing, an operator downstream from a connection point can ask for a set of tuples to be replayed so that it can calculate revisions based on old and new results.Alternatively, the operator can ask from the downstream connection point to retrieve a set of output tuples related to a received revision tuple.Under circumstances, the operator can calculate correct revisions by incorporating the net effect of the difference between the original tuple and its revised one to the old result.

Replay and revise. Dynamic revision
Dynamic revision emits delta revision messages, which contain the difference of the output between the original and the revised value.It keeps the input message history to an op-erator in the connection point of its input queue.Since keeping all messages is infeasible, there is a bound in the history of messages kept.Messages that go further back from this bound can not be replayed and, thus, revised.Dynamic revision differentiates between stateless and stateful operators.A stateless operator will evaluate both the original ( ) and the revised message ( ′ ) emitting the delta of their output.For instance, if the operator is a filter, is true and ′ is not, then the operator will emit a deletion message for .A stateful operator, on the other hand, has to process many messages in order to emit an output.Thus, an aggregation operator has to re-process the whole window for both a revised message and the original message contained in that window in order to emit revision messages.Dynamic revision is implemented in Borealis.
Speculative processing, on the other hand, applies snapshot recovery if no output has been produced for a disordered input stream.Otherwise, it retracts all produced output in a recursive manner.In speculative processing because revision processing is opportunistic, no history bound is set.

Partition and consolidate.
Both order-independent processing [87] and impatience sort [40] are based on partial processing of independent partitions in parallel and consolidation of partial results.In order-independent processing, when a tuple is received after its corresponding progress indicator a new partition is opened and a new query plan instance processes this partition using standard out-of-order processing techniques.On the contrary, in impatience sort, the latest episode of the vision of CEDR [25], an online sorting operator incrementally orders the input arriving at each partition so that it is emitted in order.The approach uses punctuations to bound the disorder as opposed to orderindependent processing which can handle events arriving arbitrarily late.
In order-independent processing, partitioning is left for the system to decide while in impatience sort it is specified by the users.In order-independent processing, tuples that are too old to be considered in their original partition are included in the partition which has the tuple with the closest data.When no new data enter an ad-hoc partition for a long time, the partition is closed and destroyed by means of a heartbeat.Ad-hoc partitions are window-based; when an out-of-order tuple is received that does not belong to one of the ad-hoc partitions, a new ad-hoc partition is introduced.An out-of order tuple with a more recent timestamp than the window of an ad-hoc partition causes that partition to flush results and close.Order-independent processing is implemented in Truviso.
On the contrary, in impatience sort, users specify reorder latencies, such as 1 , 100 , and 1 , that define the buffering time for ingesting and sorting out-of-order input tuples.According to the specified reorder latencies, the system cre-ates different partitions of in-order input streams.After sorting, a union operator merges and synchronizes the output of a partition with the output of a partition that features lower reorder latency than .Thus, the output will incorporate partial results provided by with later updates that contains.This way applications that require fast but partial results can subscribe to a partition with small reorder latency and vice versa.By letting applications choose the desired extent of reorder latency this design provides for different trade-offs between completeness and freshness of results.Impatience sort is implemented in Microsoft Trill.

1st generation vs. 2nd generation
The importance of event order in data stream processing became obvious since its early days [20] leading to the first wave of simple intuitive solutions.Early approaches involved buffering and reordering arriving tuples using some measure for adjusting the frequency and lateness of data dispatched to a streaming system in order [7,43,119].A few years later, the introduction of out-of-order processing [94] improved throughput, latency, and scalability for window operations by keeping track of processing progress without ordering tuples.In the meantime, revision processing [6] was proposed as a strategy for dealing with out-of-order data reactively.In the years to come, in-order, out-of-order, and revision processing were extensively explored, often in combination with one another [11,12,25,87,107].Modern streaming systems implement a refinement of these original concepts.Interestingly, concepts devised several years ago, like lowwatermarks, punctuations, and triggers, which advance the original revision processing, were popularized recently by streaming systems such as Millwheel [10] and the Google Dataflow model [11], Flink [32], and Spark [18].Table 2 presents how both 1st generation and modern streaming systems implement out-of-order data management.

Open Problems
Managing data disorder entails architecture support and flexible mechanisms.There are open problems at both levels.
First, which architecture is better is an open debate.Although many of the latest streaming systems adopt an outof-order architecture, opponents finger the architecture's implementation and maintainance complexity.In addition, revision processing, which is used to reconcile out-of-order tuples is daunting at scale because of the challenging state size.On the other hand, in-order processing is resource-hungry and loses events if they arrive after the disorder bound.
Second, applications receiving data streams from different sources may need to support multiple notions of event time, one per incoming stream, for instance.However, Finally, data streams from different sources may have disparate latency characteristics that render their watermarks unaligned.Tracking the processing progress of those applications is challenging for today's streaming systems.

State Management
State is effectively what captures all internal side-effects of a continuous stream computation.This includes, for example, active windows, buckets of records, partial or incremental aggregates used in an application as well as possibly some user-defined variables created and updated during the execution of a stream pipeline.A careful look into how state is exposed and managed in stream processing systems unveils an interesting trace of trends in computer systems and cloud computing as well as a revelation of prospects on upcoming capabilities in event-based computing.This section provides an overview of known approaches, modern directions and discussions of open problems in the context of state management.

Managing Stream Processing State
Stream state management is still an active research field, incorporating methods on how state should be declared in a stream application, as well as how it should be scaled and partitioned.Furthermore, state management considers state persistence methods infinite/long running applications and defines system guarantees and properties to maintain whenever a change in the system occurs.
A change during a system's runtime often requires state reconfiguration.Such a change can be the result of a partial process or network failure, but also actions that need to be taken to adjust compute and storage capacity (e.g., scalingup/down).Most of these research issues have been introduced in part within the context of pioneering DSMSs such as Aurora and Borealis [36].Specifically, Boralis has set the foundations in formulating many of these problems such as the need for embedded state, persistent store access as well as failure recovery protocols.In Table 3 we categorize known data stream processing systems according to their respective state management approaches.The rest of this section offers an overview of each of the topics in stream state management along with past and currently employed approaches, all of which we categorize as follows:

Programmability & Responsibility
State in a programming model can be either implicitly or explicitly declared and used.We define state programmability as the ability of a streaming system to allow its users to define user-defined state.State in this case can be a local variable within a stateful map function, representing a counter.Programmability in state requires support from the underlying execution engine, a feature that directly affects the engine's complexity.Different system trends have influenced both how state can been exposed in a data stream programming model as well as how it should be scoped and managed.In this section, we discuss different approaches and their trade offs.As shown in Table 3, very few systems disallow their users to define custom user-defined state.These systems focus more on providing a high-level SQL interface on top of a dataflow processor allowing only their internal operators to define and use state within stateful operations (e.g., joins, windows, aggregates).

State Management Responsibility
An orthogonal aspect to programmability is state management responsibility, which entails the obligation of maintaining state by either the user In-memory Out-of-Core External Aurora/Borealis [47] System ✓ No persistence STREAM [14] System ✓ No persistence TelegraphCQ [116] System [37,124] System ✓ Batch-level or the system.State maintenance includes, allocating memory/disk space for storing application variables, persisting changes to disk and recovering state entries from durable storage if needed upon system recovery.The first generation of data parallel stream processing systems, i.e., Storm [129] and S4 [108] required user-managed state.In such systems, stateful processing was either implemented with no guarantees making use of custom in-memory data structures or, often implemented using external key-value stores which cover certain scalability and persistence needs.For the rest of the systems available, state management concerns have been internally handled by the streaming systems themselves through the use of explicit state APIs or non-programmable, yet, internally managed state abstractions.

Discussion
In the early days of data stream management when main memory was scarce, state had a facilitating role, supporting the implementation of system operators, such as CQL's join, filter, and sort as employed in STREAM [14].We term this type of state, defined by the designers of a given system and used by the internal operators of that system, systemdefined state.A common term used to describe that type of state was "synopsis".Typically, users of such systems were oblivious of the underlying state and its implicit nature resembled the use of intermediate results in DBMSs.Systems such as STREAM, as well as Aurora Borealis [36], attached special synopses to a stream application's dataflow graph supporting different operators, such as a window max, a join index or input source buffers for offsets.A noteworthy feature in STREAM was the capability to re-use synopses compositionally to define other synopses in an application internally in the system.Overall, synopses have been one of the first forms of state in early stream processing systems primarily for stream processing over shared-memory.Several of the issues regarding state, including fault tolerance and load balancing, were already considered back then, for example in Borealis.Although, the lack of user-defined state limited the expressive power of that generation of systems to a subset of relational operations.Furthermore, the use of over-specialized data structures was somewhat oblivious to the needs of reconfiguration which requires state to be flexible and easy to partition.
In the post-MapReduce era, there was a primary focus in compute scalability with systems like Storm [2] allowing the composition of distributed pipelines of tasks.For application flexibility and simplicity, many of these systems did not provide any state management whatsoever, leaving everything regarding state to the hands of the programmer.That included both declaration and management of state.User-declared and managed state was either defined and used within the working memory and scope provided by the hosting framework or defined and persisted externally, using an existing key value storage or database system (e.g.Redis [5,92]).In summary, application-managed state offers flexibility and gives expert users implementation freedom.However, no state management capabilities are offered from the system's side.As a result, the user has to reason about persistence, out-of-core scalability, and all necessary third-party storage system dependencies.These are all complex choices to make and require a combination of deep expertise and additional engineering work to integrate stream and storage technologies.
Currently, most stream processing systems allow a level of freedom for user-defined state through a form of a stateful processing API.This enriches stream applications to define their custom state, while also granting the underlying system access to state information in order to employ data management mechanisms for persistence, scalability and fault tolerance.State information includes types used, serializers/deserializers and read and write operations known at runtime.The main limitation of user-defined, system-managed state is the lack of direct control on data structures that materialize that state (e.g., for custom optimizations).

State Management Architecture
The state management architecture refers to the way that a streaming system stores and manages its internal or userdefined state.We observe three distinct stateful processing directions in the architecture of data stream runtime systems (depicted in Figure 4): -In-memory architectures entail storing active state within in-memory data structures.This approach is able to sustain state bounded within available main-memory available in each node executing stream operators.-Out-of-core architectures make use of multiple levels of storage mediums such as non-volatile memory to manage state.This approach allows exploiting fast main memory acccess within each compute node while also supporting a growing number of state entries which are split and archived in secondary storage.We observe that the outof-core data structure of choice used in most systems is a variant of the LSM-Tree [110] such as FASTER [42] or RocksDB/LevelDB 1 .-External architectures decouple compute and state, allowing state to be handled by an external database or key-value store.This approach enables more modular system designs (state & compute decoupling which is very Cloud-friendly) and effective re-use of several desired properties of database systems (e.g., ACID transactions, consistency guarantees, auto-scaling) in support of more complex guarantees in the context of data streaming.A predominant usage of external state was common within applications in Apache Storm.The lack of systemmanaged state necessitated users to store all of their state in an external systems.In this architecture, when state access is needed, the streaming operator has to reach out to the external system, dramatically increasing its latency.Google's Millwheel, the cloud engine of Beam/Google Dataflow is a representative example of system-managed external state architecture.Millwheel builds on the capabilities of BigTable [45] and Spanner [48] (e.g., blind atomic writes).Tasks in Millwheel are effectively stateless.They do keep recent local changes in memory but overall they commit every single output and state update to BigTable as a single transaction.This means that Millwheel is using an external store for both persisting every single working state per key but also all necessary logs and checkpoints needed for recovery and nonidempontent updates.

Persistence Granularity
The persistence granularity refers to the granularity in which a streaming system makes a snapshot of the state.While some older systems did not provide any guarantees for the state persistence (e.g., Aurora/Borealis relied on duplicate/standby operators as described in section 5).Most systems at the moment employ a coarse-grained persistence.
Epoch-level persistence granularity is typically achieved in the form of application-level snapshots.Most commonly, systems employ a form of asynchronous consistent snapshotting such as the Chandy-Lamport algorithm [44] as such: each epoch, i.e., either periodically or after a certain number of records have been ingested by the system, each operator acquires a copy of its state.The batch-level persistence seen in systems such as Spark Streaming, and Trident/Storm adopts a strict micro-batching processing paradigm: i.e., a batch execution is submitted after collecting a sizable number of records, and the state of an operator is stored right after a given batch has been processed.In S-Store the batch granularity orchestrated as a series of ACID transactions on top of a relational database.
Another extreme to the epoch-based approach is recordlevel persistence.This approach as seen in Millwheel [10] follows a record-level epoch model: it stores the state transition of each operator on every single output (detailed in section 5).Section 4.6 offers an in-depth analysis of the implications between transactional stream processing and persistence granularity.

Discussion
Stream processing has been influenced by general trends in scalable computing.State and compute have gradually evolved from a scale-up task-parallel execution model to the more common scale-out data-parallel model with related implications in state representations and operations that can be employed.Persistent data structures have been widely used in database management systems ever since they were conceived.In data stream processing the idea of employing internal and external persistence strategies was uniformly embraced in more recent generations of systems.Scalable state has been the main incentive of the second generation of stream processing systems which automated deployment and partitioning of data stream computations.The need for scalable state was driven by the need to facilitate unbounded data stream executions where the space complexity for stream state is linear to the over-increasing input consumed by a stream processor at any point in time.This section discusses types of scalable state, as well as scalable system architectures that can sustain support for partitioning, persisting, and committing changes to large volumes of state.

Parallel vs. Global Stateful Operations
Scalable state takes two forms in a stream application, typically referred to as partitioned and non-partitioned state (also referred to as global state).Depending on the nature of a specific operation, one or both of these state types can be employed.
Partitioned State.Partitioned state is the de facto way to enable data-parallel computation on massive data streams.Partitioned state allows key-wise logical partitioning of state to compute tasks, where each logical task handles a specific key.This is enabled in the API level through an additional operation that is invoked prior to stateful processing which lifts the scope from task-to key-based processing such as "keyBy" in Apache Flink or "groupBy" in Beam and Kafka-Streams.At the physical level, multiple keys (or key ranges) can be assigned to a specific physical task or compute node.
Non-partitioned State.Non-partitioned state is mapped as a singleton to physical compute tasks.Such non-partitioned state is typically used in two ways.First, in order to compute global aggregates over the complete input stream.Second, it can be used to calculate aggregates at the level of the physical operator (e.g., count how many keys have been processed per operator).Task-level state can also be useful for keeping offsets when consuming logs from a physical stream source task.Because non-partitioned state either deals with operator-local computations or with global aggregates, its use is not scalable and should be used with caution by practitioners.

Managing Consistency and Persistence
Consistent stream processing has for long been an open research issue due to the challenging nature of distributed unbounded processing but also due to the lack of a formal specification of the problem itself.Consistency relates to guarantees a system can make at the face of failure as well as any need for change during its operation.In data streaming, changing or updating a running data stream application is a concept also known as reconfiguration.For example, this includes the case when one needs to apply a software update to a stream application or scale out to more compute nodes without loss of accuracy or computation.The underlying relation between fault tolerance and reconfiguration has been highlighted by several works in the past such as the research behind the SEEP system [35] that considers an integrated approach to scale and recover tasks from failures.Currently, most stream processors are transactional processing systems governed by consistency rules and processing guarantees.This section highlights the types of guarantees offered by different stream processing systems and implementation strategies that materialize them.
Past Challenges and The Lambda Architecture: When large scale computing became mainstream, a design pattern emerged called "lambda architecture" which suggested the separation of systems across different layers according to their specialization and reliability capabilities.Hadoop and transactional databases were reliable in terms of processing guarantees, thus, they could take all critical computation.Whereas, stream processing systems could achieve low latency and scale but they did not offer a clear set of consistency guarantees.For example, in the state-oblivious Storm system the fault-tolerance approach would solely consider which input events have been fully processed or not and which should be replayed on a timeout.Nevertheless, there was no clear picture of what level of consistency can be expected from stream processors.At the same time, databases had formal guarantees.For example, a set of transactions would be processed using ACID guarantees, which includes atomicity across transactions, consistency for the valid states a database can have, isolation in terms of concurrent execution, and durability on what can be recovered after failure.To reason about consistency in the context of data streaming, there had been a need to lay out a set of assumptions (e.g., logged input) and processing granularity for defining a concept related to transactions.

Consistent State in Stream Processing.
A stream processor today is a distributed system consisting of different concurrently executing tasks.Source tasks subscribe to input streams that are typically recorded in a partitioned log such as Kafka and therefore input streams can be replayed.Sink tasks commit output streams to the outside world and every task in this system can contain its own state.For example, source tasks need to keep the current position of their input streams in their state.A system execution can be often modeled through the concept of "concurrent actions".An action includes: invoking stream task logic on an input event, mutating its state, and producing output events.Every action happening in such a system causes other actions.Effectively, just a single record sent by a source contributes to state updates throughout the whole pipeline and output events created by the sinks.If a specific action is lost or happens twice, then the complete system enters into an inconsistent state.
Fault tolerance is an integral aspect of streaming systems that significantly impacts their state consistency.We analyze the fault tolerance strategies of existing streaming systems in Section 5.1.In addition, due to causal dependencies on state, the order of action execution is also critical.Existing reliable stream processors either define a transaction out of each action or a coarse grained set of actions that we call epochs.We explain these approaches in more detail, next.

State Persistence at Event Granularity
A form of consistent processing in data streaming is employing a transaction per local action.Google's Millwheel, the cloud runtime for the dataflow data streaming service, employs such a strategy.Millwheel uses BigTable to commit each full compute action which includes: input events, state transitions and generated output.The act of committing these actions is also called a "strong production" in Millwheel.
Persisting state of an operator per output event, is an approach which seemingly induces high latency overhead.However, traditional database optimizations can be used to speed up commit and state read times.Write ahead logging, blind writes, bloom filters, and batch commits at the storage layer can be used to reduce the commit latency.More importantly, since the order of actions is predefined at commit time, state persistence on a per-event basis also guarantees deterministic executions.In addition, this approach has important effects on consistency as perceived by applications that consume the system's output.This follows from the fact that "exactly-once processing" in this context relates to each action being atomically committed, as we discuss in Section 5.1.1.

State Persistence at Epoch Granularity
Instead of adopting state persistence on a per-record granularity processing, epoch-level approaches divide computation into a series of mini-batches, also known as "epochs".

Logged Input Committed Output
Durable Storage f F e X c + F q 1 r T j 5 z A n / g f P 4 A 5 c q O N g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " z 9 C V s a f F e X c + F q 1 r T j 5 z A n / g f P 4 A 5 c q O N g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " z 9 C V s a f F e X c + F q 1 r T j 5 z A n / g f P 4 A 5 c q O N g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " z 9 C V s a f F e X c + F q 1 r T j 5 z A n / g f P 4 A 5 c q O N g = = < / l a t e x i t > ep 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " C a M g e m 4 i J d X / W S p c P m W S 0 I X j L L 6 8 S v 1 a 9 r n r 3 t X L 9 J k + j A K d w B h f g w S X U 4 Q 4 a 4 A O D R 3 i G V 3 h z p P P i v D s f i 9 Y 1 J 5 8 5 g T 9 w P n 8 A 5 E a O N Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " v 6 5 U S 0 I X j L L 6 8 S v 1 a 9 r n r 3 t X L 9 J k + j A K d w B h f g w S X U 4 Q 4 a 4 A O D R 3 i G V 3 h z p P P i v D s f i 9 Y 1 J 5 8 5 g T 9 w P n 8 A 5 E a O N Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " v 6 5 U f F e X c + F q 1 r T j 5 z A n / g f P 4 A 5 c q O N g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " z 9 C V s a f F e X c + F q 1 r T j 5 z A n / g f P 4 A 5 c q O N g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " z 9 C V s a f F e X c + F q 1 r T j 5 z A n / g f P 4 A 5 c q O N g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " z 9 C V s a f F e X c + F q 1 r T j 5 z A n / g f P 4 A 5 c q O N g = = < / l a t e x i t > ep 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " C a M g e m 4 i J d X / W S p c P m W x a C 0 4 + c 4 z + w P n 8 A e d O j j c = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C a M g e m 4 i J d X / W S p c P m W x a C 0 4 + c 4 z + w P n 8 A e d O j j c = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C a M g e m 4 i J d X / W S p c P m W x a C 0 4 + c 4 z + w P n 8 A e d O j j c = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C a M g e m 4 i J d X / W S p c P m W f F e X c + F q 1 r T j 5 z A n / g f P 4 A 5 c q O N g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " z 9 f F e X c + F q 1 r T j 5 z A n / g f P 4 A 5 c q O N g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " z 9 f F e X c + F q 1 r T j 5 z A n / g f P 4 A 5 c q O N g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " z 9 f F e X c + F q 1 r T j 5 z A n / g f P 4 A 5 c q O N g = = < / l a t e x i t > ep 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " C a M g e m 4 i J d X / W S p c P m W    In Figure 5 we depict the overall approach, marking input, system states and outputs with a distinct epoch identifier.
Epochs can be defined through markers at the logged input of the streaming application.A system execution can be instrumented to process each epoch and commit the state of the entire task graph after each epoch is processed.If a failure or other reconfiguration action happens during the execution of an epoch then the system can roll back to a previously committed epoch and recover its execution.The term "exactlyonce processing" in this context relates to each epoch being atomically committed.In Section 5.1 where we present the different levels of processing semantics in streaming we call this flavor exactly-once processing on state.The rest of this section focuses on various approaches used to commit stream epochs.
Strict Two-Phase Epoch Commits.A common coordinated protocol to commit epochs is a strict two-phase commit where: Phase-1 corresponds to the full processing of an epoch and the Phase-2 ensures persisting the state of the system at the end of the computation.
This approach was popularized by Apache Spark [140] through the use of periodic "micro-batching" and it is an effective strategy when batch processing systems are used for unbounded processing.The main downside of this approach is the risk of low task utilization due to synchronous execution, since tasks have to wait for all other tasks to finish their current epoch.Drizzle [135] mitigates this problem by chaining multiple epochs in a single atomic commit.A similar approach was also employed by S-Store [101], where each database transaction corresponds to an epoch of the input stream that is already stored in the same database.
Asynchronous Two-Phase Epoch Commits.For pure dataflow systems, strict two-phase committing is problematic since tasks are uncoordinated and long-running.Furthermore, it is feasible to achieve the same functionality asynchronously through consistent snapshotting algorithms, known from classic distributed systems literature [30].Consistent snapshotting algorithms exhibit beneficial properties because they not require pausing a streaming application.Furthermore, they acquire a snapshot of a consistent cut in a distributed execution [44].In other words, they capture the global states of the system during a "valid" execution.Throughout different implementations we can identify i) unaligned and ii) aligned snapshotting protocols.
I. Unaligned / Chandy-Lamport [44] snapshots provide one of the most efficient methods to obtain a consistent snapshot.This approach is currently supported by several stream processors, such as IBM Streams and Flink.The core idea is to make use of a punctuation or "marker", into the regular stream of events and use that marker to separate all actions that come before and after the snapshot while the system is running.A caveat of unaligned snapshots is the need to record input (a.k.a.in-flight) events that arrive to individual tasks until the protocol is complete.In addition to space overhead for logged inputs, unaligned snapshots require more processing during recovery, since logged inputs need to be replayed (similarly to redo logs in database recovery with fuzzy checkpoints).

II. Aligned Snapshots
Aligned snapshots aim to improve performance during recovery and minimize reconfiguration complexity exhibited by unaligned snapshots.The main differentiation is to prioritize input streams that are expected before the snapshot and thus, end up solely with states that reflect a complete computation of an epoch and no in-flight events as part of a snapshot.For example, Flink's epoch snapshotting mechanism [31,33] resembles the Chandy Lamport algorithm in terms of using markers to identify epoch frontiers.However, it additionally employs an alignment phase that synchronizes markers within tasks before disseminating further.This is achieved through partially blocking input channels where markers were previously received until all input channels have transferred all messages corresponding to a particular epoch.
In summary, unaligned snapshots are meant to offer the best runtime performance but sacrifice recovery times due to the redo-phase needed upon recovery.Whereas, aligned snapshots can lead to slower commit times due to the alignment phase while providing a set of beneficial properties.First, aligned snapshots reflect a complete execution of an epoch which is useful in use-cases where snapshot isolated queries need to be supported on top of data streaming [136].Furthermore, aligned snapshots yield the lowest reconfiguration footprint as well as setting the basis for live reconfiguration within the alignment phase as exhibited by Chi [97].

1st vs. 2nd Generation
State is a concept that has been very central to stream processing.The notion of state itself has been addressed with many names such as "summary", "synopsis", "sketch" or "stream table" and it reflects the evolution of data stream management along the years.Early DSMS systems [7,14,20,43] (circa 2000-2010) hinted state and its management from the user.They declared and managed internally in inmemory all data structures needed to support a selected set of operations.This type of state, often referred to as "summary" was used to internally materialize continuous processing operators such as those of the time-varying relational model of CQL [16], as seen in STREAM [14].
A decade later, scalable data computing systems based on the MapReduce [53] architecture allowed for arbitrary user-defined logic to be scaled and executed reliably using distributed middleware and partitioned file systems.Following the same trend, many existing data management models were revisited and re-architectured with scalability in mind (e.g., NoSQL, NewSQL databases).Similarly, a growing number of scalable data stream processing systems [10,11,32,104] married principles of scalable computing with stream semantics and models that were identified in the past (e.g.out-of-order processing [94,119]).This pivoting helped stream management technology to lift all assumptions associated with limited state capacity and thus reach its nearly full potential of executing correctly continuous eventdriven applications with arbitrary state.
As of today, modern stream processors can compile and execute graphs of long-running operators with complete, user-defined state yet system-managed that is fault-tolerant and reconfigurable given a clear set of transactional guarantees [10,31,35].

Open Problems
Data streaming covers many data management needs today that go beyond real-time analytics, which was the original purpose of the stream processing technology.New needs include support for more complex data pipelines with implicit transactional guarantees.Furthermore, modern applications involve Machine Learning, Graph Analysis and Cloud Apps, all of which have a common denominator: complex state and new access patterns.These needs have cultivated novel research directions in the emerging field of stream state management.
The decoupling of state programming from state persistence resembles the concept of data independence in databases.Systems are converging in terms of semantics and operations on state while, at the same time many new methods employed on embedded databases (e.g., LSM-trees, state indexing, externalized state) are helping stream processors to evolve in terms of performance capabilities.A recent study [78] showcases the potential of workload-aware state management, adapting state persistence and access to the individual operators of a dataflow graph.To this end, an increasing number of "pluggable" systems [42,144] for local state management with varying capabilities are being adopted by stream processors.This opens new capabilities for optimization and sophisticated, yet transparent state management that can automate the process of selecting the right physical plan and reconfigure that plan while continuous applications are executed.

Fault Tolerance & High Availability
Fault tolerance is a system's capacity to continue its operation in spite of failures delivering the expected service as if no failures had happened.It is specially important for streaming systems for two reasons.First, streaming systems conduct stateful computations over potentially unbounded data streams.Without fault tolerance streaming systems would have to redo computations from the beginning given that the state or progress thus far would be lost during a failure.Besides losing processing progress accumulated over an arbitrary time period, recomputation is many times infeasible because the already processed segment of a data stream has permanently vanished.
Second, contemporary streaming systems feature a distributed systems architecture for scalability.In a system deployed on multiple physical machines failures occur commonly.Based on this motivation, a lot of exciting work has been performed on fault tolerance in streaming systems.We present it in Section 5.1.
In computer systems, availability is defined as the time period that a system accomplishes its service relative to service interruption periods.It is typically quantified as a percentage, 100% being perfect availability [64].The term high availability has been adopted to denote that a system achieves a very high percentage of availability like 99.999% or higher.
In stream processing where systems are not probed by users as in the case of typical information systems like web applications, what service accomplishment means is open to interpretation.Surprisingly, no definition for high availability is provided in the stream processing literature.Existing research (Section 5.2) quantifies high availability using combinations of three metrics, namely recovery time, performance overhead in terms of throughput and latency, and resource utilization.We highlight the absence of a definition and suitable metric for high availability in the open problems in Section 5. 4 where we propose a definition based on processing progress and a proxy for measuring high availability based on end-to-end latency.Before finishing with the open problems, we separate the 1st generation from the modern in fault tolerance and high availability in Section 5.3.

Fault-tolerance
Many important challenges in stream processing manifest when we take into account failures.Managing failures in a distributed streaming system entails maintaining snapshots of state, migrating state, and scaling out operators while affecting as least as possible the healthy parts of the system.Table 4 presents the fault-tolerance strategies of eighteen streaming systems arranged in order of publication appearance from past to present.We analyse the strategies across the following four dimensions.
1. Processing semantics conveys how a system's data processing is affected by failures.Typically, all systems in the literature are able to produce correct results in failurefree executions.But to mask a failure completely is hard especially in the stream processing domain where, typically, output is delivered as soon as it is produced.
In recent years the stream processing domain has settled on the terms at least-once and exactly-once to characterize the processing semantics [18,32,76,95,109].At most-once is also part of the nomenclature but it is mostly obsolete as systems opt to support one of the two stronger levels.At least-once processing semantics means that the system will produce the same results as a failure-free execution with the addition of duplicate records as a side effect of recovery.
Exactly-once lends itself to two different interpretations.A system may support exactly-once processing semantics within its boundaries ensuring that any inconsistencies or duplicate execution carried out on recovery is not part of its state.We call that exactly-once processing semantics on state.It should be noted that most systems in this category still assume that the computations they apply as well as the system's functions are deterministic, which is often not the case; processing-time windows and operators processing input from multiple sources are two prime examples of nondeterminism.With nondeterminism at play, the system's state on recovery can diverge.Clonos [118] provides exactly-once processing including nondeterministic computations by means of causal consistency.It keeps determinants about nondeterministic computations in a resilient manner and uses them to regenerate the exact computational state following a failure.
While a system can restore its state to a consistent snapshot, the same is not feasible in general to accomplish with the output published by the system.Once the output is out, it is available for consumption by external applications.Thus, a system with exactly-once processing semantics on state will still produce duplicate output on recovery.This problem has been termed the output commit problem [55] in the distributed systems literature.Systems that manage to produce the same output under failure as a failure-free execution have exactly-once processing semantics on output.In Section 5.1.1 we elaborate how streaming systems treat the output commit problem.2. Replication regards the use of additional computational resources for recovering an execution.We adopt the terminology of Hwang et al. [75] that classify replication as either active where two instances of the same execution run in par- allel or passive where each running stateful operator that is part of an execution dispatches its checkpointed state to a standby operator.

Recovery data
addresses what data are regularly stored for recovery purposes.Data may include the state of each operator and the output it produces.In addition, many fault tolerance strategies need to replay tuples of input streams during recovery in order to reprocess them.For this purpose input streams are persistently stored typically in message brokers like Apache Kafka.However, we exclude this fact from the table to save space.

4.
Storage medium states where recovery data is stored.It can be in a resilient store that is local to each stateful operator, in a remote resilient store, or in the memory space of a stateful operator.In-memory means that operators use their memory space as a primary storage medium for recovery data.Systems that cache data for recovery in memory like output tuples do not fall in this category.
The table is meant to be read both horizontally to describe a specific system's approach to fault tolerance and vertically to uncover how the different building blocks shape the landscape of fault tolerance in stream processing.Two remarks are necessary.First, the table contains three more annotations besides the self-explanatory checkmarks.Streamscope [95] presents and evaluates three distinct fault tolerance strategies, an active replication-based strategy , a passive one, and a strategy that relies on recomputing state by replaying data from input streams.Second, the state column in the recovery data dimension captures not only checkpointed state but also state metadata that allow recomputing the state, such as a changelog [109] or state dependencies [111].
The table reveals four interesting patterns.First, of all columns, two accumulate the majority of checkmarks, passive replication and storing state for recovery.This is perhaps the most visible pattern on the table that signifies that passive replication by storing state is, unsurprisingly, a very popular option for streaming systems.One typical recovery approach is to restore the latest checkpoint of a failed operator in a new node and replay input that appeared after the checkpoint.Variations of this approach include saving inflight tuples along with the state and maintaining in-flight tuples in upstream nodes.Second, storing in-flight tuples for recovery is not preferred anymore, although it was a popular option for streaming systems in the past.Third, while past systems strived to support exactly-once output processing semantics, later systems opt for exactly-once semantics on state and outsource the deduplication of output to external systems.We will elaborate on this aspect in Section 5.1.1.Finally, among the various storage media for recovery data a remote resilient store is the clear winner.

The output commit problem
The output commit problem [55] specifies that a system should only publish output to the outside world when it is certain that it can recover the state from where the output was published so that every output is only published once because output cannot be retracted once it is sent.If output is sent twice, then the system manifests inconsistent behavior with respect to the outside world.An important instance of this problem manifests when a system is restoring some previous consistent state due to a failure.In contrast to the system's state, its output cannot be retracted in general.Thus, under failures, systems must be careful not to produce duplicate output.
The output commit problem is relevant in streaming systems, which typically conform to a distributed architecture and process unbounded data streams.In this setting, the side effects of failures are difficult to mask.Streaming systems that solve the output commit problem provide output exactlyonce.Other terms that refer to the same problem are pro-cessing output exactly-once and its paraphrases, as well as precise recovery [75] and strong productions [10].
Although the problem is relevant and hard, solutions in the stream processing domain are scattered in the literature pertaining to each system in isolation.We group the various solutions in three categories, transaction-based, progressbased, and lineage-based, and describe each noting the assumptions it involves.Each of the three types of techniques, use a different trait of the input or computation, to identify whether a certain tuple has appeared again.Transactionbased techniques use tuple identity, progress-based techniques use order, while lineage-based techniques use inputoutput dependencies.Finally, we provide two more categories of solutions, special sink operators and external sinks that do solve the problem practically, but strictly speaking they do not meet the problem's specification because they are either specific or external to a streaming system.
Transaction-based.Millwheel [10] and Trident [3] rely on committing unique ids with records to eliminate duplicate retries.Millwheel assigns a unique id to each record entering the system and commits every record it produces to a highly available storage system before sending it downstream.Downstream operators acknowledge received records.If a delivered record is retried it is ignored by checking the unique id that it carries.Millwheel assumes no input ordering or determinism.Trident, on the other hand, batches records into a transaction, which is assigned a unique transaction id and applies a state update to the state backend.Assuming that transactions are ordered, Trident can accurately ignore retried batches by checking the transaction id.[56] uses timestamp comparison to deliver output exactly-once relying on the order of timestamps.Each operator generates increasing scalar timestamps and attaches them to records.Seep checkpoints the state and output of each operator together with the vector timestamps of the latest records from each upstream operator that affected the operator's state.On recovery, the latest checkpoint is loaded to a new operator, which replays the checkpointed output records and processes replayed records sent by its upstream operators.Downstream operators discard duplicate records based on the timestamps.The system assumes deterministic computations that do not rely on system time or random input.

Progress-based. Seep
A previous version of Seep [35] applies the same process with the difference that a recovered operator rewinds its logical clock to the timestamp of the checkpoint it possesses before emitting records.The system assumes deterministic computations without side-effects and a monotonically increasing logical clock providing timestamps.It further assumes that records in a stream are ordered by their timestamps.Lineage-based.Timestream [111] and Streamscope [76] use dependency tracking to provide exactly-once output.
During normal operation, both systems track operator input and output dependencies by uniquely identifying records with sequence numbers.Streamscope persists records with their identifiers asynchronously.Both systems store operator dependencies periodically in an asynchronous manner.In Streamscope, however, each operator checkpoints individually not only its dependencies but also its state.On recovery, Timestream retrieves the dependencies of failed operators by contacting upstream nodes recursively until all inputs required to rebuild the state are made available.Streamscope follows a similar process, but starts from a failed operator's checkpoint snapshot.For each input sequence number in that snapshot not found in persistent storage Streamscope contacts upstream operators, which may have to recompute the record starting from their most relevant snapshot that can produce the output record given its sequence number.Finally, both systems use garbage collection to discard obsolete dependencies but in a subtly different manner.Timestream computes the input records required by upstream operators in reverse topological order from the final output to the original input and discards those unneeded.Streamscope does the same but instead of computing dependencies, it uses low watermarks per operator and per stream to discard snapshots and records that are behind.In Timestream storing dependencies asynchronously can lead to duplicate recomputation, but downstream operators bearing the correct set of dependencies can discard them.Streamscope applies the same process only if duplicate records cannot be found in persistent storage.Both Timestream and Streamscope assume deterministic computation and input in terms of order and values.
The time-based and lineage-based solutions are vulnerable to failures of the last operator(s) on the dataflow graph, which produce the final output, since both solutions rely on downstream operators for filtering duplicate records.[76] implements special sinks for retracting output from files and databases.The application of this approach solves the output commit problem for specific use cases, but it is not applicable in general since it defies the core assumption of the problem that output cannot be retracted.

Special sink operators. Streams
External sinks.Some systems like Streams [76], Flink [32], and Spark [18] provide exactly-once semantics on state and outsource the output commit problem to external sinks that support idempotent writes, such as Apache Kafka.
One way to categorise the solutions provided by special sink operators and external sinks, is as optimistic output techniques, that push output immediately and retract it or update it if needed, and pessimistic output techniques that use a form of write ahead log, to write the output they will publish, if everything goes well until the output is permanently committed [31].Optimistic output techniques, which resemble multi-version concurrency control from the database world, include modifiable and versioned output destinations, while pessimistic output techniques include transactional sinks and similar tools.
Active replication.Flux [116] implements active replication by duplicating the computation and coordinating the progress of the two replicas.Flux restores operator state and in-flight data of a failed partition while the other partition continues to process input.A new primary dataflow that runs following a failure quiesces when a new secondary dataflow is ready in a standby machine in order to copy the state of its operators to the new secondary.Contrastingly, Borealis [24] has nodes address upstream node failures by switching to a live replica of the failed upstream node.If a replica is not available, the node can produce tentative output for incomplete input to avoid the recovery delay.The approach sacrifices consistency to optimize availability, but guarantees eventual consistency.
Passive replication.Hwang et al. [74] propose that a server in a cluster has another server as backup where it ships independent parts of its checkpointed state.When a node fails, its backup servers that hold parts of its checkpointed state initiate recovery in parallel by starting to execute the operators of the failed node whose state they have and collecting the input tuples they have missed from the checkpointed state they possess.SGuard [90] and Clonos [118] save computational resources in another way by checkpointing state asynchronously to a distributed file system.Upon a failure a node is selected to run a failed operator.The oper-ator's state is loaded from the file system and its in-memory state is reconstructed before it can join the job.Beyond asynchronous checkpointing, a new checkpoint mechanism [65] preserves output tuples until an acknowledgment is received from all downstream operators.Next, an operator trims its output tuples and takes a checkpoint.The authors show that passive replication still requires longer recovery time than active replication, but with 90% less overhead due to reduced checkpoint size.
Hybrid replication.Zwang et al. [143] propose a hybrid approach to replication, which operates in passive mode under normal operation, but switches to active mode using a suspended pre-deployed secondary copy when a transient failure occurs.According to the provided experiment results, their approach saves 66% recovery time compared to passive replication and produces 80% less message overhead than active replication.Alternatively, Heinze et al. [70] propose to dynamically choose the replication scheme for each operator, either active replication or upstream backup, in order to reduce the recovery overhead of the system by limiting the peak latency under failure below a threshold.Similarly, Su et al. [120] counter correlated failures by passively replicating processing tasks except for a dynamically selected set that is actively replicated.

Modeling and simulations.
In their seminal work Hwang et al. [75] model and evaluate the recovery time and runtime overhead of four recovery approaches, active standby, passive standby, upstream backup, and amnesia, across different types of query operators.The simulated experiments suggest that active standby achieves near-zero recovery time at the expense of high overhead in terms of resource utilization, while passive standby produces worse results in terms of both metrics compared to active standby.However, passive standby poses the only option for arbitrary query networks.Upstream backup has the lowest runtime overhead at the expense of longer recovery time.With a similar goal, Shrink [38], a distributed systems emulator, evaluates the models of five different resiliency strategies with respect to uptime SLA and resource reservation.The strategies differ across three axes, single-node vs multi-node, active vs passive replication, and checkpoint vs replay.According to the experiments with real queries on real advertising data using Trill [39], active replication with periodic checkpoints is proved advantageous in many streaming workloads, although no single strategy is appropriate for all of them.

1st generation vs. 2nd generation
In the early years streaming systems put emphasis on high availability setups with preference towards active replication.Contrastingly modern systems tend to leverage passive replication especially by allocating extra resources on demand that is appropriate for Cloud setups.In addition, past systems provided approximate results, while modern systems maintain exactly-once processing semantics over their state under failures.Although past systems lacked in terms of consistency, mainly due to state management aspects, they strived to solve the output commit problem.Instead, a typical avenue for modern systems that gains traction is to outsource the deduplication of output to external systems.Finally, while streaming systems used to store their output in order to be able to replay tuples to downstream operators recovering from a failure, now systems rely increasingly on replayable input source for replaying input subsets.

Open Problems
Many problems wait to be solved in the scope of fault tolerance and high availability in streaming systems.Three of them include novel solutions to the output commit problem, defining and measuring availability in stream processing, and configuring availability for different application requirements.
First, the importance of the output commit problem has the prospect to increase as streaming systems are used in novel ways like for running event-driven applications.Although we presented five different types of solutions, these suffer from computational cost, strong assumptions, limited applicability, and freshness of output results.New types of solutions are required that score better in these dimensions.
Second, the literature of high availability in stream processing has significantly enhanced the availability of streaming systems throughout the years.But, to the best of our knowledge, there has been scant research on what availability means in the area of stream processing.The generic definition of availability for computer systems by Gray et al. [64] relates availability merely to failures.According to the definition a system is available when it responds to requests with correct results, which is termed as service accomplishment.In streaming however, processing is continuous and potentially unbounded.Responding with correct results becomes more challenging.
The factors that may impair availability in streaming include software and hardware failures, overload, backpressure, and types of processing stall, like checkpoints, state migration, garbage collection, and calls to external systems.The common denominator of those factors, is that the system falls behind input.This may not be a problem for other types of systems, like databases which can respond to queries with the historical data they keep, but streaming systems have to continuously catch up processing with the input in order to provide correct results, that is, in order to be available.
Thus, a more specific definition of availability for stream processing can be stated in the following way.system is available when it can provide output based on the processing of its current input.This definition extends to how we measure availability.An appropriate way would be via progress tracking mechanisms, such as the slack between processing time and event time over time, which quantifies the system's processing progress with respect to the input as per Figure 6.The area in the plot signifies the slack between event time and processing time over time.The surface enclosing A amounts to 100% availability, while the surface containing B equals 60% availability.Last, availability is a prime non-functional characteristic of a streaming system and non-trivial to reason about as we showed.Providing user-friendly ways to specify availability as a contract that the system will always respect during its operation will significantly improve the position of streaming systems in production environments.Configuring availability in this way will probably impact resource utilization, performance overhead during normal operation, recovery time, and consistency.

Load management, elasticity, & reconfiguration
Due to the push-based nature of streaming inputs from external data sources, stream processors have no control over the rate of incoming events.Satisfying Quality of Service (QoS) under workload variations has been a long-standing research challenge in stream processing systems.
To avoid performance degradation when input rates exceed system capacity, the stream processor needs to take actions that will ensure sustaining the load.One such action is load shedding: temporarily dropping excess tuples from inputs or intermediate operators in the streaming execution graph.Load shedding trades off result accuracy for sustainable performance and is suitable for applications with strict latency constraints that can tolerate approximate results.
When result correctness is more critical than low latency, dropping tuples is not an option.If the load increase is transient, the system can instead choose to reliably buffer excess data and process it later, once input rates stabilize.Several systems employ back-pressure, a fundamental load management technique applicable to communication networks that involving producers and consumers.Nevertheless, to avoid running out of available memory during load spikes, loadaware scheduling and rate control can be applied.
A more recent approach that aims at satisfying QoS while guaranteeing result correctness under variable input load is elasticity.Elastic stream processors are capable of adjusting their configuration and scaling their resource allocation in response to load.Dynamic scaling methods are applicable to both centralized and distributed settings.Elasticity not only addresses the case of increased load, but can additionally ensure no resources are left idle when the input load decreases.
Next, we review load shedding (Section 6.1), load-aware scheduling and flow control (Section 6.2), and elasticity techniques (Section 6.3).As in previous sections, we conclude with a discussion of 1st generation vs. modern and open problems.

Load shedding
Load shedding [22,121,122,131] is the process of discarding data when input rates increase beyond system capacity.The system continuously monitors query performance and if an overload situation is detected, it selectively drops tuples according to a QoS specification.Load shedding is commonly implemented by a standalone component integrated with the stream processor.The load shedder continuously monitors input rates or other system metrics and can access information about the running query plan.Its main functionality consists of detecting overload (when to shed load) and deciding what actions to take in order to maintain acceptable latency and minimize result quality degradation.These actions presume answering the questions of where (in the query plan), how many, and which tuples to drop.
Detecting overload is a crucial task, as an incorrectly triggered shedding action can cause unnecessary result degradation.To facilitate the decision of when, load shedding components rely on statistics gathered during execution.The more knowledge a load shedder has about the query plan and its execution, the more accurate decisions it can make.For this reason, many stream processors restrict load shedding to a predefined set of operators, such as those that do not modify tuples, i.e. filter, union, and join [51,80,122].Other operator-restricted load shedding techniques target window operators [22,123], or even more specifically, query plans with SUM or COUNT sliding window aggregates [22].An alternative, operator-independent approach, is to frame load shedding as a feedback control problem [131].The load shedder relies on a dynamic model that describes the relationship between average tuple delay (latency) and input rate.
Once the load shedder has detected overload, it needs to perform the actual load shedding.This includes the decision of where in the query plan to drop tuples from, as well as which tuples and how many.The question of where is equivalent to placing special drop operators in the best positions in the query plan.In general, drop operators can be placed at any location in the query plan, however, they are often placed at or near the sources.Dropping tuples early avoids wasting work but it might affect results of multiple queries if the stream processor operates on a shared query network.Alternatively, a load shedding road map (LSRM) can be used [122].This is a pre-computed table that contains materialized load shedding plans, ordered by the amount of load shedding they will cause.
The question of which tuples to drop is relevant when load shedding takes into account the semantic importance of tuples with respect to results quality.A random dropping strategy has been applied to sliding window aggregate queries to provide approximate results by inserting random sampling operators in the query plan [22].Window-aware load shedding [123] applies shedding to entire windows instead of individual tuples, while concept-driven load shedding [82] is a semantic dropping strategy that selects tuples to discard based on the notion of concept drift.

Scheduling and flow control
When load bursts are transient and a temporary increase in latency is preferred to missing results, back-pressure and flow control can provide load management without sacrificing accuracy.Flow control methods include buffering excess load, load-aware scheduling that prioritizes operators with the objective to minimize the backlog, regulating the transmission rate, and throttling the producer.Flow control and back-pressure techniques do not consider application-level quality requirements, such as the semantic importance of input tuples.Their main requirement is availability of buffer space at the sources or intermediate operators and that any accumulated load is within the system capacity limits, so that it will be eventually possible to process the data backlog.
Load-aware scheduling tackles the overload problem by selecting the order of operator execution and by adapting the resource allocation.For instance, backlog can be reduced by dynamically selecting the order of executing filters and joins [19,23].Alternatively, adaptive scheduling [21,34] modifies the allocation of resources given a static query plan.The objective of load-aware scheduling strategies is to select an operator execution order that minimizes the total size of input queues in the system.The scheduler relies on knowledge about operator selectivities and processing costs.These statistics are either assumed to be known in advance, or need to be collected periodically during runtime.Operators are assigned priorities that reflect their potential to minimize intermediate results, and, consequently, the size of queues.
Back-pressure and flow control.In a network of consumers and producers such as a streaming execution graph with multiple operators, back-pressure has the effect that all operators slow down to match the processing speed of the slowest consumer.If the bottleneck operator is far down the dataflow graph, back-pressure propagates to upstream operators, eventually reaching the data stream sources.To ensure no data loss, a persistent input message queue, such as Apache Kafka, and adequate storage space are required.
Buffer-based back-pressure implicitly controls the flow of data via buffer availability.Considering a fixed amount of buffer space, a bottleneck operator will cause buffers to gradually fill up along its dataflow path.Figure 7a demonstrates buffer-based flow control when the producer and the consumer run on the same machine and share a buffer pool.When a producer generates a result, it serializes it into an output buffer.If the producer and consumer run on the same machine and the consumer is slow, the producer might attempt to retrieve an output buffer when none will be available.The producer's processing rate will, thus, slow down according to the rate the consumer is recycling buffers back into the shared buffer pool.The case when the producer and consumer are deployed on different machines and communicate via TCP is shown in Figure 7b.If no buffer is available on the consumer side, the TCP connection will be interrupted.The producer can use a threshold to control how much data is in-flight and it is slowed down if it cannot put new data on the wire.
Credit-based flow control (CFC) [89] is a link-by-link, per virtual channel congestion control technique used in ATM network switches.In a nutshell, CFC uses a credit system to signal the availability of buffer space from receivers to senders.This classic networking technique turns out to be very useful for load management in modern, highly-parallel stream processors and is implemented in Apache Flink [1]. Figure 8 shows how the scheme works for a hypothetical dataflow.Parallel tasks are connected via virtual channels multiplexed over TCP connections.Each task informs its senders of its buffer availability via credit messages.This way, senders always know whether receivers have the required capacity to handle data messages.When the credit of a receiver drops to zero (or a specified threshold), backpressure appears on its virtual channel.An important advantage of this per-channel flow control mechanism is that back-pressure is inflicted on pairs of communicating tasks only and does not interfere with other tasks sharing the same TCP connection.This is crucial in the presence of data skew where a single overloaded task could otherwise block the flow of data to all other downstream operator instances.On the downside, the additional credit announcement messages might increase end-to-end latency.

Elasticity
The approaches of load shedding and back-pressure are designed to handle workload variations in a statically provisioned stream processor or application.Stream processors deployed on cloud environments or clusters can make use of a dynamic pool of resources.Dynamic scaling or elasticity is the ability of a stream processor to vary the resources available to a running computation in order to handle workload variations efficiently.Building an elastic streaming system requires a policy and a mechanism.The policy component implements a control algorithm that collects performance metrics and decides when and how much to scale.The mechanism effects the configuration change.It handles resource allocation, work re-assignment, and state migration, while guaranteeing result correctness.Table 6 summarizes the dynamic scaling capabilities and characteristics of elastic streaming systems.

Elasticity policies
A scaling policy involves two individual decisions.First, it needs to detect the symptoms of an unhealthy computation and decide whether scaling is necessary.Symptom detection is a well-understood problem and can be addressed using conventional monitoring tools.Second, the policy needs to identify the causes of exhibited symptoms (e.g. a bottleneck operator) and propose a scaling action.This is a challenging task which requires performance analysis and prediction.It is common practice to place the burden of scaling decisions on application users who have to face conflicting incentives.They can either plan for the highest expected workload, possibly incurring high cost, or they can choose to be conservative and risk degraded performance.Automatic scaling refers to scaling decisions transparently handled by the streaming system in response to load.Commercial streaming systems that support automatic scaling include Google Cloud Dataflow [84], Heron [88], and IBM System S [62], while DS2 [79], Seep [35] and StreamCloud [66] are recent research prototypes.
In Table 6, we categorize policies into heuristic and predictive.Heuristic policies rely on empirically predefined rules and are often triggered by thresholds or observed conditions while predictive policies make scaling decisions guided by analytical performance models.
Heuristic policy controllers gather coarse-grained metrics, such as CPU utilization, observed throughput, queue sizes, and memory utilization, to detect suboptimal scaling.CPU and memory utilization can be inadequate metrics for streaming applications deployed in cloud environments due to multi-tenancy and performance interference [113].StreamCloud [66] and Seep [35] try to mitigate the problem by separating user time and system time, but preemption can Predictive policy controllers build an analytical performance model of the streaming system and formulate the scaling problem as a set of mathematical functions.Predictive approaches include queuing theory [58,58,96,128], control theory [13,83,100], and instrumentation-driven linear performance models [79].Thanks to their closed-form analytical formulation, predictive policies are capable of making multi-operator decisions in one step.

Elasticity mechanisms
Elasticity mechanisms are concerned with realizing the actions indicated by the policy.They need to ensure correctness and low-latency redistribution of accumulated state when effecting a reconfiguration.To ensure correctness, many streaming systems rely on the fault-tolerance mechanism to provide reconfiguration capabilities.When adding new workers to a running computation, the mechanism needs not only re-assign work to them but also migrate any necessary state these new workers will now be in charge of.Elasticity  mechanisms need to complete a reconfiguration as quickly as possible and at the same time minimize performance disruption.We review the main methods for state redistribution, reconfiguration, and state transfer next.We focus on systems with embedded state, as reconfiguration mechanisms are significantly simplified when state is external.

State redistribution.
State redistribution must preserve key semantics, so that existing state for a particular key and all future events with this key are routed to the same worker.For that purpose, most systems use hashing methods.Uniform hashing evenly distributes keys across parallel tasks.It is fast to compute and requires no routing state but might incur high migration cost.When a new node is added, state is shuffled across existing and new workers.It also causes random I/O and high network communication.Thus, it is not particularly suitable for adaptive applications.Consistent hashing and variations are more often preferred.Workers and keys are mapped to multiple points on a ring using multiple random hash functions.Consistent hashing ensures that state is not moved across workers that are present before and after the migration.When a new worker joins, it becomes responsible for data items from multiple of the existing nodes.When a worker leaves, its key space is distributed over existing workers.Apache Flink [32] uses a variation of consistent hashing in which state is organized into key groups and those are mapped to parallel tasks as ranges.On reconfiguration, reads are sequential within each key group, and often across multiple key groups.The metadata of key group to task assignments are small and it is sufficient to store key-group range boundaries.The number of key groups limits the maximum number of parallel tasks to which keyed state can be scaled.
Hashing techniques are simple to implement and do not require storing any routing state, however, they do not perform well under skewed key distributions.Hybrid partitioning [61] combines consistent hashing and an explicit mapping to generate a compact hash function that provides load balance in the presence of skew.The main idea is to track the frequencies of the partitioning key values and treat normal keys and popular keys differently.The mechanism uses the lossy counting algorithm [99] in a sliding window setting ✓ ✓ ✓ n/a Chronostream [137] n/a n/a ✓ ✓ ACES [13] ✓ ✓ ✓ n/a n/a Stella [138] ✓ ✓ Google Dataflow [84] ✓ ✓ ✓ Dhalion [57] ✓ ✓ ✓ ✓ DS2 [79] ✓ ✓ ✓ ✓ Spark Streaming [18,139] ✓ ✓ ✓ ✓ Megaphone [73] ✓ ✓ Turbine [102] ✓ ✓ ✓ ✓ Rhino [54] n/a n/a ✓ ✓ to estimate heavy hitters, as keeping exact counts would be impractical for large key domains.
Reconfiguration strategy.Regardless of the re-partitioning strategy used, if the elasticity policy makes a decision to change an application's resources, the mechanism will have to transfer some amount of state across workers on the same or different physical machines.
The stop-and-restart strategy halts the computation, takes a state snapshot of all operators, and then restarts the application with the new configuration.Even though this mechanism is simple to implement and it trivially guarantees correctness, it unnecessary stalls the entire pipeline even if only one or few operators need to be rescaled.As shown in Table 6, this strategy is very common in modern systems.
Partial pause and restart, introduced by FLUX [117], is a less disruptive strategy that only blocks the affected dataflow subgraph temporarily.The affected subgraph contains the operator to be scaled, as well as upstream channels and upstream operators.Figure 9 shows an example of the protocol.To migrate state from operator to operator , the mechanism will execute the following steps: (1) First, it pauses 's upstream operators and stops pushing tuples to .Paused operators start buffering input tuples in their local buffers.operator continues processing tuples in its buffers until they are empty.(2) Once 's buffers are empty, it extracts its state and sends it to operator .(3) Operator loads the state and (4) sends a restart signal to upstream operators.Once upstream operators receive the signal they can start processing tuples again.
The pro-active replication strategy maintains state backup copies in multiple nodes so that reconfiguration can be performed in a nearly live manner when needed.The state is organized into smaller partitions, each of which can be transferred independently.Each node has a set of primary state slices and a set of secondary state slices.To move state from operator to , the mechanism executes the following steps: (1) Pause 's upstream operators, (2) extract state from , (3) load state into , and (4) send a restart signal from to upstream operators.ure 10 shows an example of the protocol as implemented by ChronoStream [137].

State transfer.
Another important decision to make when migrating state from one worker to another is whether the state is moved all-at-once or in a progressive manner.If a large amount of state needs to be transferred, moving it in one operation might cause high latency during re-configuration.Alternatively, progressive migration [73] moves state in smaller pieces and flattens latency spikes by interleaving state transfer with processing.On the downside, progressive state migration might lead to longer migration duration.

1st generation vs. 2nd generation
Comparing early to modern approaches, we make the following observations.While load shedding was popular among early stream processors, modern systems do not favor the approach of degrading results quality anymore.Another important difference is that load management approaches in 1st generation systems used to affect the execution of multiple queries as they formed a shared dataflow plan (cf.Section 2).Queries in modern systems are typically executed as independent jobs, thus, back-pressure on a certain query will not affect the execution of other queries running on the same cluster.Scaling down is a quite recent requirement that was not a matter of concern before cloud deployments.The dependence on persistent queues for providing correctness guarantees is another recent characteristic, mainly required by systems employing back-pressure.Finally, while early load shedding and load-aware scheduling techniques assume a limited set of operators whose properties and characteristics are stable throughout execution, modern systems implement general load management methods that are applicable even if cost and selectivity vary or are unknown.

Open Problems
Adaptive scheduling methods have so far been studied in the context of simple query plans with operators whose selectivities and costs are fixed and known.It is unclear whether these methods generalize to arbitrary plans, operators with UDFs, general windows, and custom joins.Load-aware scheduling can further cause starvation and increased per-tuple latency, as low-priority operators with records in their input buffers would need to wait a long time during bursts.Finally, existing methods are restricted to streams that arrive in timestamp order and do not support out-of-order or delayed events.
Re-configurable stream processing is a quite recent research area, where stream processors are designed to not only loads slice #1 and sends ack to the leader, (3) the leader notifies upstream operators to replay events, (4) upstream start rerouting events to , (5) the leader notifies that the transfer is complete and moves slice #1 to the backup group.be capable of adjusting their resource allocation but other elements of their runtime as well.Elasticity, the ability of a stream processor to dynamically adjust resource allocation can be considered as a special case of re-configuration.Others include code updates for bug fixes, version upgrades, or business logic changes, execution plan switching, dynamic scheduling and operator placement, as well as skew and straggler mitigation.So far, each of the aforementioned re-configuration scenarios have been largely studied in isolation.To provide general re-configuration and selfmanagement, future systems will need to take into account how optimizations interact with each other.

Conclusion
While early streaming systems strove to extend relational execution engines with time-based window processing, modern systems have evolved significantly in terms of architecture and capabilities.Table 1 summarizes the evolution of major streaming system aspects over the last three decades.
While approximate results were mainstream in early systems, modern systems have primarily focused on results correctness and have largely rejected the notion of approximation.In terms of languages, modern systems favor generalpurpose programming languages, however, we recently witness a trend to return to extensions for streaming SQL [26].Over the years, execution has also gradually transitioned from mainly centralized to mainly distributed, exploiting data, pipeline, and task parallelism.At the same time, most modern systems construct independent execution plans per query and apply little optimization and sharing.
Regarding time, order, and progress, many of the inventions of the past proved to have survived the test of time, since they continue to hold a place in modern streaming systems.Especially Millwheel and the Google Dataflow Model popularized punctuations, watermarks, the out-of-order architecture, and triggers for revision processing.Streaming state management witnessed a major shift, from specialized in-memory synopses to large partitioned and persistent state supported today.As a result, fault tolerance and high availability also shifted towards passive replication and exactlyonce processing.Finally, load management approaches have transitioned from load shedding and scheduling methods to elasticity and backpressure coupled with persistent inputs.
In state management we identify the most radical changes seen in data streaming so far.The most obvious advances relate to the scalability of state and long-term persistence in unbounded executions.Yet, today's systems have invested thoroughly in providing transactional guarantees that are in par with those modern database management systems can offer today.Transactional stream processing has pivoted data streaming beyond the use for data analytics and has also opened new research directions in terms of efficient methods for backing and accessing state that grows in unbounded terms.Stream state and compute are gradually being decoupled and this allows for better optimizations, wider interoperability with storage technologies as well as novel semantics for shared and external state having stream processors as the backbone of modern continuous applications and live scalable data services.
We believe the road ahead is still long for streaming systems.Emerging streaming applications in the areas of Cloud services [9,63], machine learning [59,103], and streaming graph analytics [8,27] present new requirements and are already shaping the key characteristics of the future generation of data stream technology.We expect systems to evolve further and exploit next-generation hardware [141,142], focus on transactions and iteration support, improve their reconfiguration capabilities, and take state management a step further by leveraging workload-aware backends [78], shared state and versioning.

Fig. 1 :
Fig. 1: An overview of the evolution of stream processing and respective domains of focus.
Section 4.5 covers different architectures and presents examples of how modern systems can support large volumes of state, beyond what can fit in memory, within unbounded executions.Another foundational transitioning step in stream technology has been the development and adoption of transactional-level guarantees.Section 4.6 gives an overview of the state of the art and covers the semantics of transactions in data streaming alongside implementation methodologies.

…
Fig. 4: Scalable Architectures for Stateful Data Streaming x a C 0 4 + c 4 z + w P n 8 A e d O j j c = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C a M g e m 4 i J d X / W S p c P m W x a C 0 4 + c 4 z + w P n 8 A e d O j j c = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C a M g e m 4 i J d X / W S p c P m W x a C 0 4 + c 4 z + w P n 8 A e d O j j c = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C a M g e m 4 i J d X / W S p c P m W

1 < l a t e x i t s h a 1 _ b a s e 6 4 =
x a C 0 4 + c 4 z + w P n 8 A e d O j j c = < / l a t e x i t > " v 6 5 U 5 O r Q o a Y D L t b r H t J w U M c 3 e x A = " > A A A B 7 H i c b V A 9 T w J B E J 3 z E / E L t b T Z C C Z W 5 I 5 G 7 Y g 2 l p h 4 Q A I X s r c M s L J 3 e 9 n d M y E X / o O N h R p b f 5 C d / 8 Y F r l D w J Z O 8 v D e T m X l h I r g 2 r v v t r K 1 v b G 5 t F 3 a K u 3 v 7 B 4 e l o + O m l q 3 h z p P P i v D s f i 9 Y 1 J 5 8 5 g T 9 w P n 8 A 5 E a O N Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " v 6 5 U 5 O r Q o a Y D L t b r H t J w U M c 3 e x A = " > A A A B 7 H i c b V A 9 T w J B E J 3 z E / E L t b T Z C C Z W 5 I 5 G 7 Y g 2 l p h 4 Q A I X s r c M s L J 3 e 9 n d M y E X / o O N h R p b f 5 C d / 8 Y F r l D w J Z O 8 v D e T m X l h I r g 2 r v v t r K 1 v b G 5 t F 3 a K u 3 v 7 B 4 e l o + O m l q 3 h z p P P i v D s f i 9 Y 1 J 5 8 5 g T 9 w P n 8 A 5 E a O N Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " v 6 5 U 5 O r Q o a Y D L t b r H t J w U M c 3 e x A = " > A A A B 7 H i c b V A 9 T w J B E J 3 z E / E L t b T Z C C Z W 5 I 5 G 7 Y g 2 l p h 4 Q A I X s r c M s L J 3 e 9 n d M y E X / o O N h R p b f 5 C d / 8 Y F r l D w J Z O 8 v D e T m X l h I r g 2 r v v t r K 1 v b G 5 t F 3 a K u 3 v 7 B 4 e l o + O m l q 3 h z p P P i v D s f i 9 Y 1 J 5 8 5 g T 9 w P n 8 A 5 E a O N Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " v 6 5 U 5 O r Q o a Y D L t b r H t J w U M c 3 e x A = " > A A A B 7 H i c b V A 9 T w J B E J 3 z E / E L t b T Z C C Z W 5 I 5 G 7 Y g 2 l p h 4 Q A I X s r c M s L J 3 e 9 n d M y E X / o O N h R p b f 5 C d / 8 Y F r l D w J Z O 8 v D e T m X l h I r g 2 r v v t r K 1 v b G 5 t F 3 a K u 3 v 7 B 4 e l o + O m l q x a C 0 4 + c 4 z + w P n 8 A e d O j j c = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C a M g e m 4 i J d X / W S p c P m W x a C 0 4 + c 4 z + w P n 8 A e d O j j c = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C a M g e m 4 i J d X / W S p c P m W x a C 0 4 + c 4 z + w P n 8 A e d O j j c = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C a M g e m 4 i J d X / W S p c P m W

Fig. 6 :
Fig. 6: Measuring availability with the slack between processing time and event time over time

Fig. 8 :
Fig. 8: Credit-based flow control in a dataflow graph.Receivers regularly announce their credit upstream (gray and white squares indicate full and free buffers, respectively).

1 Fig. 9 :
Fig.9: An example of the partial-pause-and-restart protocol.To move state from operator to , the mechanism executes the following steps: (1) Pause 's upstream operators, (2) extract state from , (3) load state into , and (4) send a restart signal from to upstream operators.

Fig. 10 :
Fig.10: An example of the proactive replication protocol.To move slice #1 from to , the mechanism executes the following steps: (1) the leader instructs to load slice #1, (2) loads slice #1 and sends ack to the leader, (3) the leader notifies upstream operators to replay events, (4) upstream start rerouting events to , (5) the leader notifies that the transfer is complete and moves slice #1 to the backup group.

Table 1 :
Evolution of streaming systems

Table 2 :
Event order management in streaming systems streaming systems to date cannot support multiple time domains.

Table 3 :
State Management Features in Streaming Systems

Table 4 :
Fault-tolerance in streaming systems.

Table 5 :
Assumptions that systems make for solving the output commit problem [62]le Cloud Dataflow[84]relies on CPU utilization for scale-down decisions only but still suffers false negatives.Dhalion[57]and IBM Streams[62]also use congestion and back-pressure signals to identify bottlenecks.These metrics are helpful for identifying bottlenecks but they cannot detect resource over-provisioning.

Table 6 :
Elasticity policies and mechanisms in streaming systems