Scalable online first-order monitoring

Online monitoring is the task of identifying complex temporal patterns while incrementally processing streams of data-carrying events. Existing state-of-the-art monitors for first-order patterns, which may refer to and quantify over data values, can process streams of modest velocity in real-time. We show how to scale up first-order monitoring to substantially higher velocities by slicing the stream, based on the events’ data values, into substreams that can be monitored independently. Because monitoring is not embarrassingly parallel in general, slicing can lead to data duplication. To reduce this overhead, we adapt hash-based partitioning techniques from databases to the monitoring setting. We implement these techniques in an automatic data slicer based on Apache Flink and empirically evaluate its performance using two tools—MonPoly and DejaVu—to monitor the substreams. Our evaluation attests to substantial scalability improvements for both tools.


Introduction
In large-scale software systems, millions of events occur each second [25,41]. Identifying instances of interesting patterns in these high-velocity data streams is a central challenge in the area of runtime verification and monitoring. Often, this search must be performed online given the systems' continuous operation and the massive amounts of data they produce.
An online monitor takes as input a pattern and a stream of data, which it consumes incrementally, and it detects and outputs matches with the pattern. The specification lan-guage for patterns significantly influences the monitor's time and space complexity. For propositional languages, such as metric temporal logic or metric dynamic logic, existing stateof-the-art monitors are capable of handling millions of events per second in real time on commodity hardware [9,16,46,47]. Propositional languages, however, are severely limited in their expressiveness. Since they regard events as atomic, they cannot formulate dependencies between the data values stored in events. First-order languages, such as metric first-order temporal logic (MFOTL) [14], do not have this limitation. Various online monitors [6,8,14,17,36,48,50] can handle first-order languages for event streams, but only with modest velocities.
We improve the scalability of online first-order monitors using parallelization. There are two basic approaches regarding what to parallelize. Task parallelism adapts the monitoring algorithm to evaluate multiple subpatterns in parallel. The amount of parallelization offered is limited by the number of subpatterns of a given input pattern. The alternative is data parallelism, where multiple copies of the monitoring algorithm are run unchanged as a black box, in parallel, on different portions of the input data stream.
In this article we focus on data parallelism, which is attractive for several reasons. As it is a black-box approach, data parallelism allows us to reuse existing monitors, which implement heavily optimized sequential algorithms. It also offers a virtually unbounded amount of parallelization, especially on high-volume and high-velocity data streams. Finally, it caters for the use of general-purpose libraries for data-parallel stream processing. These libraries deal with common challenges in high-performance computing, such as deployment on computing clusters, fault-tolerance, and back-pressure induced by velocity spikes.
Data parallelism has previously been used to scale up the offline monitoring of systems (Sect. 2), which is performed after the systems completed their execution. Yet neither offline nor online monitoring is an embarrassingly parallel task in general. Thus, in some cases, the monitors executing in parallel must synchronize. Alternatively, careful data duplication across these monitors allows for a non-blocking parallel architecture. An important contribution of prior work on scalable offline monitoring is the development of a (data) slicing framework [10]. The framework takes as input an MFOTL formula (Sect. 3) and a splitting strategy that determines to which of the parallel monitors the data should be sent. The framework's output is a dispatcher that forwards events to appropriate monitors and ensures that the overall parallel architecture collectively produces exactly the same results that a single monitor would produce.
The previous slicing framework has three severe limitations. First, data can be sliced on only one free variable at a time. Although it is possible to compose multiple singlevariable slices into multi-variable slices, this composition is less expressive than simultaneously slicing on multiple variables. We explain the difference in Sect. 4.3. Second, the user of the slicing framework must supply a splitting strategy, even when it is obvious what the best strategy is for the given formula. Third, the framework's implementation uses Google's MapReduce library for parallel processing, which restricts its applicability to just offline monitoring.
This article addresses all of the above limitations and thereby makes the following contributions: -We generalize the offline slicing framework [10] to support simultaneous slicing on multiple variables and we also adapt it to online monitoring (Sect. 4). -We instantiate the slicing framework with an automatic splitting strategy (Sect. 5) inspired by hash-based partitioning and the hypercube algorithm [3,38]. This algorithm has previously been used to parallelize relational join operators in databases. Skew, which is the presence of frequently occurring values, can cause imbalances in hash-based partitioning. Our automatic strategy also addresses this issue by separately handling events with frequently occurring values, using another database technique that we adapt to the monitoring setting. -We implement our new slicing framework using the Apache Flink [4] stream processing engine (Sect. 6). We use both MonPoly [14,15] and DejaVu [36] as black-box monitors for the slices. A particular challenge was to efficiently checkpoint MonPoly's state within Flink to achieve fault-tolerance. (We do not address faulttolerance and skew for DejaVu.) -We evaluate the slicing framework and automatic strategy selection on both real-world data based on Nokia's data collection campaign [13] and synthetic data exercising difficult cases (Sect. 7). We show that the overall parallel architecture substantially improves the throughput. Although the optimality of the hypercube approach in terms of a balanced data distribution is out of reach for general MFOTL formulas, we demonstrate that our automatic splitting results in balanced slices and improved monitoring performance.
An earlier version of this work was presented at RV 2018 [51]. This article extends the conference paper with detailed proofs of the slicing framework's correctness (Sect. 4) and a significantly expanded description of the automatic strategy selection algorithm (Sect. 5). This includes background information on the standard hypercube algorithm from databases (Sect. 5.1), which we build upon. Moreover, we have integrated DejaVu as a second black-box monitor in addition to MonPoly in our Apache Flink-based implementation (Sect. 6). This demonstrates our framework's generality. Finally, we (re-)evaluate both versions of the resulting parallel online monitor (Sect. 7). For both, higher parallelism yields significantly improved performance.
All theorems stated in this article, namely those establishing our slicing framework's correctness, have been mechanically checked using the Isabelle proof assistant. Additionally, we provide detailed proofs in this article for the benefit of readers not familiar with Isabelle. Both our implementation [53] and formalization [55] are publicly available. The formal verification of an MFOTL monitor modeled after MonPoly has been addressed in a separate line of work [11,54].

Related work
Our work builds on the slicing framework introduced by Basin et al. [10]. This framework ensures the sound and complete slicing of the event stream with respect to MFOTL formulas. It prescribes the use of composable operators, called slicers, that slice data associated with a single free variable, or slice data based on time. As explained in the introduction, we have generalized their data slicers to operate simultaneously on all free variables in a formula. Moreover, the use of MapReduce in the original framework's implementation limited it to offline monitoring. In contrast, our Apache Flink implementation supports online monitoring. Finally, our implementation extends the framework with an automatic strategy selection that results in a balanced load distribution for the slices in our empirical evaluation.
Barre et al. [5], Bianculli et al. [22], and Bersani et al. [21] use task parallelism over subformulas to parallelize propositional offline monitors. The degree of parallelization in these approaches is limited by the formula's size.
Parametric trace slicing [50] lifts propositional monitoring to parametric specifications. To this end, a trace with parametric events is split into propositional slices with events grouped by their parameter instances, which can be monitored independently. Parametric trace slicing considers only non-metric policies with top-level universal quantification. Barringer et al. [6] generalize this approach to more complex properties expressed using quantified event automata (QEA). Reger and Rydeheard [48] delimit the sliceable fragment of first-order linear temporal logic (FO-LTL) that admits a sound application of parametric trace slicing. The fragment prohibits deeply nested quantification and using the "next" operator. These restrictions originate from the time model used, in which time-points consist of exactly one event. Hence, when an event is removed from a slice, information about that time-point is lost. Our time model, based on sequences of time-stamped sets of events, avoids such pitfalls. Parametric trace slicing produces an exponential number of propositional slices (in the domain's size), whereas we use as many slices as there are parallel monitors available.
Kuhtz and Finkbeiner [39] show that the LTL monitoring problem belongs to the complexity class AC 1 (logDCFL) and hence can be efficiently parallelized. However, the Boolean circuits used to establish the lower bound must be built for each trace in advance, which limits these results to offline monitoring. A similar limitation applies to the work by Bundala and Ouaknine [23] and Feng et al. [31], who study variants of MTL and TPTL.
Complex event processing (CEP) systems analyze streams by recognizing composite events as (temporal) patterns built from simple events. These systems allow for ample parallelism. However, their languages are often based on SQL extensions without a clear semantics. An exception is BeepBeep [33,34]: a multi-threaded stream processor that supports LTL-FO + , a first-order variant of LTL. The parallelism in BeepBeep must, however, be arranged manually by the user.
Event stream processing systems have been extensively studied in the database community. We focus on the most closely related works. The hypercube algorithm (also known as the shares algorithm) was proposed by Afrati and Ullman [3] in the context of MapReduce. The algorithm is similar to the triangle counting algorithm by Suri and Vassilvitskii [56] and can be traced back to the parallel evaluation of datalog queries [32]. The hypercube algorithm is optimal for conjunctive queries with one communication round on skew-free databases [20], which do not contain heavy hitters (data values that occur more frequently than a fixed threshold).
The hypercube algorithm and other hash-based partitioning schemes are sensitive to skew. Rivetti et al. [49] suggest applying a greedy balancing strategy after identifying heavy hitters. This approach is restricted to conjunctive queries where all relations share a common join key. Joglekar et al. [37] improve asymptotically over the hypercube algorithm by using multiple communication rounds. Nasir et al. [42,43] balance skew for associative stream operators without explicitly identifying heavy hitters. Vitorovic et al. [58] combine the hash-based hypercube, prone to heavy hitters, with random partitioning [44], resilient to heavy hitters. Their combination only applies to conjunctive queries and limits the impact of skew without improving the worst-case performance. All these approaches are unsuitable for handling MFOTL formulas. Instead we follow a hypercube variant that is worst-case optimal in the presence of skew [38]. The heavy hitters must be known in advance in this approach. In contrast to the earlier algorithm by Beame et al. [19], it is sufficient to consider the heavy hitters of each attribute in isolation.

Metric first-order temporal logic
We briefly recall the syntax and semantics of our specification language, metric first-order temporal logic (MFOTL) [14].
We fix a set of names E and for simplicity assume a single infinite domain D of values. The names r ∈ E have associated arities ι(r ) ∈ N. An event r (d 1 , . . . , d ι(r ) ) is an element of E × D * . We call 1, . . . , ι(r ) the attributes of the name r . We further fix an infinite set V of variables, such that V, D, and E are pairwise disjoint. Let I be the set of nonempty intervals Formulas ϕ are constructed inductively, where t i , r , x, and I range over V ∪ D, E, V, and I, respectively: Along with the Boolean operators, MFOTL includes the metric past and future temporal operators  (previous), S (since),  (next), and U (until), which may be nested freely. We define other standard operators in terms of this minimal syntax: truth := ∃x.
MFOTL formulas are interpreted over streams of timestamped events. We group finite sets of events that happen databases. An (event) stream ρ is an infinite sequence (τ i , D i ) i∈N of databases D i with associated time-stamps τ i . We assume discrete time-stamps, modeled as natural numbers τ ∈ N. The event source may use a finer notion of time than the one used for time-stamps: databases at different indices i = j may have the same time-stamp τ i = τ j . The sequence of time-stamps must be non-strictly increasing (∀i. τ i ≤ τ i+1 ) and always eventually strictly increasing (∀τ. ∃i. τ < τ i ).
The relation v, i | ρ ϕ ( Fig. 1) defines the satisfaction of the formula ϕ for a valuation v at an index i with respect to the stream ρ = (τ i , D i ) i∈N . The valuation v is a mapping V (ϕ) → D, assigning domain elements to the free variables of ϕ. Overloading notation, v is also the extension of v to the domain V (ϕ) ∪ D, setting v(t) = t whenever t ∈ D. We write v[x → y] for the function equal to v, except that x is mapped to y.
Let S be the set of streams. Although satisfaction is defined over streams, a monitor will always receive only a finite stream prefix. We write P for the set of prefixes and for the usual prefix order on streams and prefixes. For a prefix π and i < |π |, π [i] denotes π 's i-th element.

Slicing framework
We introduce a general framework for parallel online monitoring based on slicing. Basin et al. [10] provide operators that split finite logs offline into independently monitorable slices, based on the events' data values and time-stamps. Each slice contains only a subset of the events from the original trace, which reduces the computational effort required to monitor the slice. We adapt this idea to online monitoring. Our framework is abstract. We start with a characterization of an online monitor's input-output behavior (Sect. 4.1). Slicing's fundamental property is that it preserves this behavior (Sect. 4.2). We then refine the framework and focus on the data in the events, since slicing with respect to time is more suitable for offline monitoring (Sect. 4.3).

Monitor functions
Abstractly, a monitor function M ∈ P → O maps stream prefixes to verdict outputs from some set O. A monitor is an algorithm that implements a monitor function. An online monitor receives incremental updates of a stream prefix and computes the corresponding verdicts. We consider time-stamped databases to be the atomic units of the online monitor's input. The monitor may produce the verdicts incrementally, too. To represent this behavior at the level of monitor functions, we assume that verdict outputs are equipped with a partial order , where o 1 o 2 means that o 2 provides more (or the same) information as o 1 . We also assume that M is a monotone map from the poset P, , i.e., stream prefixes ordered by the prefix relation, to the poset O, . This captures the intuition that as the monitor function receives more input, it produces more output, and, depending on the partial order , it does not retract previous verdicts.
The standard application of monitors for runtime verification is detecting violations of a safety property of the form  ∀x 1 . . . x n . ϕ. To do this, one can monitor the negation ¬ϕ to obtain the valuations of the variables x 1 , . . . , x n that satisfy the negation. Such valuations correspond to the violations of the initial safety property. We call monitors that output valuations of the free variables informative.
Intuitively, the verdict of an informative monitor function M ϕ is a set of tuples (v, i), where v is a valuation of the free variables of the MFOTL formula ϕ and i is an index in the event stream. We call these tuples satisfying valuations. Thus, we instantiate O, with (V (ϕ) → D) × N, ⊆ when we work with an informative monitor function. By using the subset relation as the partial order on verdicts, the granularity at which an online implementation can incrementally output its verdict is at the level of satisfying valuations. The following definition makes the above intuition more precise.

Definition 1 An informative monitor function
Soundness restricts the output to valuations that are satisfied independently of future events: the monitor may output a tuple (v, i) only if it is a satisfying valuation for all streams ρ extending the prefix π . This property is sometimes called impartiality [40]. Our definition of completeness is a weak form of anticipation [40]: once a valuation v is satisfied at an index i on every possible extension of the prefix π , the monitor must eventually output this fact. However, we allow the output to be delayed, which is generally necessary for formulas with future modalities. The delay may be unbounded with respect to either time or the number of databases alone. We therefore require that for any choice of the infinite stream extension ρ π , there is another prefix π ρ such that M ϕ (π ) contains the satisfying valuation (v, i). Informative monitor functions are not unique because the output delay is not fixed.
As concrete examples, the MonPoly monitor [15] implements an informative monitor function for a practically relevant fragment of MFOTL [14]. MonPoly's output delay depends only on the future operators' intervals in the monitored formula. The DejaVu monitor [36] internally computes an informative monitor function for a past-only fragment of MFOTL, where all intervals are [0, ∞). It represents valuations as binary decision diagrams (BDDs), but does not output them. Instead, DejaVu's verdicts consist only of the indices where violations occurred. Since DejaVu does not support future operators, its verdict output is never delayed.
We briefly compare our informative monitor functions with another common type of monitor functions from the literature where O is the set {?, ⊥, } and the partial order is the reflexive closure of {(?, ⊥), (?, )} [18,45]. The verdict ⊥ means that the monitored prefix is a bad prefix, i.e., all its infinite extensions violate the formula. Similarly, denotes a good prefix, while ? indicates an inconclusive result. Every nonempty result from M ϕ (π ) corresponds to a ⊥ verdict for the formula  ∀x 1 . . . x n . ¬ϕ (due to soundness), whereas an empty result could either mean ? or .

Abstract slicing
Parallelizing a monitor should not affect its input-output behavior. We formulate this correctness requirement abstractly using the notion of a slicer for a monitor function. The slicer specifies how to split a stream prefix into independently monitorable substreams, called slices, and how to combine the verdict outputs of the parallel submonitors into a single verdict.

Definition 2
A slicer for a monitor function M ∈ P → O is a tuple (K , M, S, J ), where K is a set of slice identifiers, the submonitor family M ∈ K → (P → O) is a K -indexed family of monitor functions, the splitter S ∈ P → (K → P) splits prefixes into K -indexed slices, and the joiner J ∈ (K → O) → O combines K -indexed verdicts into a single one, satisfying: For an input prefix π , let S(π ) denote the collection of its slices. Each slice is identified by an element k ∈ K , which we write as a subscript. We require the splitter S to be monotone so that the submonitors M k , which may differ from the monitor function M, can process the sliced prefixes incrementally. Composing the splitter, the corresponding submonitor for each slice, and the joiner as shown in Fig. 2 yields the parallelized monitor function J λk. M k (S(π ) k ) . This function is correct if and only if it computes the same verdicts as M.
For example, parametric trace slicing [48,50] can be seen as a particular slicer for monitor functions that arise from sliceable FO-LTL formulas [48,Section 4]. Thereby, K is the Cartesian product of finite domains for the formulas' variables. The elements of K are thus valuations and the splitter is defined as the restriction of the trace to the values occurring in the valuation. The submonitor M k is a propositional LTL monitor and the joiner simply takes the union of the results (which may be marked with the valuation).
The splitter S as defined above is overly general. A concrete instance of S may determine each event's assignment to slices based on all previous events. In practice, we would like an efficient implementation of S. For example, parametric trace slicing determines the target slice for an event by inspecting events individually (and not as part of the entire prefix). We call a splitter with this property event-separable. Event-separable splitters are desirable because they cater for a parallel implementation of the splitter itself.

Joint data slicer
We now describe an event-separable slicer for informative monitor functions M ϕ . Our joint data slicer distributes events according to the valuations they induce in the formula. Recall that the output of M ϕ consists of all valuations that satisfy the formula ϕ at some index. For a given valuation, only a subset of the events is relevant to evaluate the formula. We would like to evaluate ϕ separately for each valuation to determine whether it is satisfied by that valuation, as this would allows us to exclude some events from each slice. However, there are infinitely many valuations in the presence of infinite domains. Therefore, the joint data slicer uses finitely many (possibly overlapping) slices associated with sets of valuations, which taken together cover all possible valuations.
We assume without loss of generality that the bound variables in ϕ are disjoint from the free variables V (ϕ). Given an event e = r (d 1 , . . . , d ι(r ) ), the set matches(ϕ, e) contains all valuations v ∈ V (ϕ) → D for which there is a subformula Intuitively, v is in matches(ϕ, e) if the event e is possibly relevant for evaluating ϕ over the valuation v.

Definition 4
Let ϕ be an MFOTL formula and f ∈ (V (ϕ) → D) → P(K ) be a mapping from valuations to nonempty sets of slice identifiers. The joint data slicer for ϕ with splitting strategy f is the tuple The splitting strategy f associates valuations to slices (more precisely, slice identifiers). Accordingly,Ŝ f assigns the event e to all slices k for which there exists v ∈ matches(ϕ, e), i.e., a valuation v for which e may be relevant, with k ∈ f (v). The joiner J f takes the union of the verdicts from all slices, keeping only those verdicts that the corresponding slice is responsible for. Note that {v | k ∈ f (v)}×N is the set of all verdicts whose valuation is associated with the slice k.
The following example demonstrates why the intersection in the definition of J f is needed for some formulas, for example those involving equality. Intuitively, these formulas may be satisfied if and only if certain events are absent. The problem occurs if the input prefix contains these events, but a slice does not.
where a is a constant, and consider a stream ρ with the prefix π = (0, {P(a)}) . Obviously, v, 0 | ρ ϕ for all v. However, the event P(a) will be omitted from each slice k that does not have an associated valuation mapping x to a. (A splitting strategy with such a slice exists whenever |K | ≥ 2.) Hence v[x → a], 0 | ρ ϕ for all v and all extensions ρ of the slice S f (π ) k to a stream. The result will be unsound if we do not filter the erroneous satisfying valuations v[x → a] that are necessarily output by the k-th submonitor (due to its completeness).
, the parallelized monitor that uses the joint data slicer, is an informative monitor function, i.e., it is monotone, sound, and complete. As a first step, given a formula ϕ and a set of valuations R, we define the formula's relevant events with The following lemma justifies this name: if we restrict the databases in a stream to (a superset of) the formula's relevant events with respect to R, the satisfying valuations within R remain unchanged.

a set of valuations R, and a set of events E, with E
Proof Proof by structural induction over the formula ϕ, generalizing over v, R, and i. We only show the base cases, which are the most interesting ones, and the step case for ∃. The other step cases all follow easily from the induction hypothesis because the evaluation only depends on the evaluation of the recursive subformulas (covered by the induction hypothesis) and the time-stamps in the streams. Note that the latter are the same in ρ and σ .
The step marked with * is justified as follows. Either ). This in turn implies that r (v(t 1 ), . . . , v(t n )) ∈ E ϕ (R) ⊆ E using the fact that v ∈ R and the lemma's assumption.
The step marked with * is justified using the induction hypothesis for the formula ψ, The relevant events provide an alternative characterization of the joint data slicer's splitter: is an informative monitor function.

Proof
The monotonicity of M f ϕ follows directly from M ϕ 's monotonicity. For soundness, fix i, v, and π and assume For completeness, fix i, v, π , and ρ = ( Because ρ π , we have v, i | ρ ϕ by our assumption, and The monitor functions M ϕ and M f ϕ may differ. However, both are informative, i.e., they produce correct verdicts (and eventually all verdicts by completeness) for the formula ϕ. Yet they may output verdicts with different delays. In general, the joint data slicer is only a slicer for M f ϕ but not for M ϕ .
Proof Monotonicity follows from Lemma 1; correctness follows from M f ϕ 's definition.
The joint data slicer is also a slicer for the original monitor function M ϕ , i.e., it produces the same output as the original monitor function, under an additional assumption on M ϕ .
This assumption is satisfied by MonPoly's and DejaVu's concrete monitor functions: The indices at which these monitors output satisfying valuations depend only on the sequence of time-stamps, which slicing does not affect. It follows from Lemma 2 that they are sliceable. ∨Q(x, y)). We apply the joint data slicer with K = {1, 2} and a splitting strategy f that maps the valuation x = 5, y = 7 to the first slice and all other valuations to the second slice. We obtain the following slices for the prefix π = (11, {P(7, 5)}), (12, {P(5, 1), Q(7, 5)}), (21, {P(5, 7), Q(5, 7)}) : The events P(5, 7) and P (7,5) are duplicated across the slices because both x = 5, y = 7 and x = 7, y = 5 are matching valuations for either event. The joiner is crucial for the slicer's correctness in this example. Because of the subformula P(y, x), the first slice receives the event P(7, 5) but not the event Q (7,5), which is sent to the second slice instead. This results in the spurious verdict x = 7, y = 5 at index 0, which the joiner's intersection filters out.
The data slicer used in the offline slicing framework [10] is defined for a single free variable x and a collection This single variable slicer is a special case of our joint data slicer. To see this, define f (v) to be the set of all k with v(x) ∈ S k . At least one such k must exist because the S k cover the domain. In contrast, some instances of the joint data slicer cannot be simulated by composing single variable slicers. This limitation affects formulas where the same predicate symbol appears in multiple atoms that each miss at least one free variable to slice on. As a result, single variable slicers are ineffective for some formulas as they add unnecessary data duplication.

Example 3
Consider the formula P(x) ∧  P(y) and the splitting strategy that maps v to the slice (v(x) mod 2, v(y) mod 2) such that there are four slices in total. Any single variable slicer will send each P event to all slices, and this extends to their composition. The joint data slicer sends each event P(d) to exactly three slices, excluding the slice (z, z), where (z mod 2) = (d mod 2). This example generalizes to other splitting strategies as we show in Example 7 in Sect. 5.3.
Finally, we revisit the intersection with {v | k ∈ f (v)}×N in the definition of J f . Examples 1 and 2 demonstrate the need for it in general. A valid question is for which formulas and splitting strategies can the intersection be omitted, i.e., when can we replace J f with J (s) = k∈K s k ? For example, this replacement is necessary when using DejaVu as a submonitor (see Sect. 6). We give a sufficient condition stemming from the following lemma. The lemma ensures that a formula's satisfying valuations on streams restricted to relevant events with respect to a given set of valuations R come from precisely this set of valuations R. 3. no event name occurs twice in ϕ; and If v, i | ρ ϕ for some i ∈ N, a valuation v, and a stream Proof (Sketch) By induction on the structure of safe formulas. The base cases are straightforward using the assumptions (2) and (3). Note that safe formulas only allow negation to occur in formulas of the form (¬ϕ) ∧ ψ (i.e., ¬(ϕ ∨ (¬ψ)), (¬ϕ)S I ψ, and (¬ϕ)U I ψ with all the free variables of the negated subformula ¬ϕ being contained in the free variables of ψ. This ensures that the satisfying valuations of these formulas are a subset of the satisfying valuations of ψ, allowing for a straightforward use of the induction hypothesis. The case ϕ∧ψ (where both subformulas are not negated), requires joining the satisfying valuations of ϕ and ψ. Condition (4) makes sure that this join operation produces a valuation in R.
The safety assumption requires that any negated subformula is guarded by a non-negated subformula, such that ϕ can be monitored using finite relations [14,54]. (Safe formulas are called monitorable in these references.) The safety assumption is standard for monitors operating on finite tables. For instance, the MonPoly monitor only supports safe formulas [14]. In contrast, DejaVu supports unsafe formulas for the past-only non-metric fragment of MFOTL [36]. Observe that condition (2) of Lemma 3 rules out the formula from Example 1. We conclude this section with J 's main property.
Proof The left-to-right inclusion is obvious. For the right-

Automatic slicing
The joint data slicer is parameterized by a splitting strategy. Ideally, the chosen strategy optimally utilizes the available computing resources: computation costs should be evenly distributed and any overhead kept low. In this section, we present our approach to automatically selecting a suitable strategy. It is inspired by results from database theory and leverages stream statistics to optimize the submonitors' event rates, i.e., the number of events in a time period.
In online monitoring, the monitor's throughput must be high enough to process the incoming events with bounded delay, especially if its buffering capacity is limited. The goal of slicing is to supply the submonitors with substreams that can be monitored more efficiently than the entire event stream. Under the assumption that slicing and the communication to the submonitors do not pose a bottleneck, the parallel monitor will thus achieve a higher throughput than the sequential monitor. Another related benefit is the improved worst-case latency in the presence of bursty event streams, where the events are not distributed evenly in time. Low latency is important in online monitoring to obtain timely verdicts.
The key problem we solve is to find a splitting strategy that achieves the above goal. Ideally, the improvements in throughput and worst-case latency scale with the number of submonitors. To approximate this ideal within our slicing framework, the splitting strategy should minimize the event rates observed by a fixed number of submonitors. This in turn maximizes the parallel monitor's throughput if we make the simplifying assumption that the submonitors' throughput solely depends on their input event rate. Under the same assumption, the submonitors require less memory. We do not optimize the communication cost in this article. However, the number of slices is a parameter that affects the communication cost due to data duplication.

Recap of the hypercube algorithm
Our automatic splitting strategy is based on the observation that the hypercube algorithm [3,32,56], which is used to parallelize relational queries in databases, can be generalized to the online monitoring of MFOTL formulas.
We start by recalling the standard notion of full conjunctive queries [1], which represent a substantially less expressive language than MFOTL. The computational properties of conjunctive queries are well understood. In particular, researchers have devised and analyzed (near-)optimal distributed algorithms for computing conjunctive queries [2,3,19,20,37,38]. Afterwards, we focus on the hypercube algorithm and recall previous results. The terminology we use has been adjusted slightly to match the monitoring setting.
A database instance (or database for short) represents a finite set of events. This coincides with the definition that we previously gave for the stream elements in MFOTL's semantics. In the database context, we also call the names r ∈ E relation names. A relation D(r ) in a database D is the set of all events in D with the name r . Its size |D(r )| is the cardinality of the set D(r ). The degree of a value d ∈ D with respect to an attribute i ∈ {1, . . . , ι(r )} of the relation name r is the number of events r (d 1 A query q is a syntactic expression in a given query language. It defines a mapping q(D) from databases D to finite sets of valuations over some finite set of variables V (q) ⊂ V. An atom is an expression r (y 1 , . . . , y ι(r ) ), where r ∈ E, and the variables y i are elements of V. The image of an atom a = r (y 1 , . . . , y ι(r ) ) under a valuation v is the event v(a) = r (v(y 1 ), . . . , v(y ι(r ) )). We write V (a) for the set of variables {y 1 , . . . , y ι(r ) }. A full conjunctive query q is a finite set of atoms. Such a query maps to valuations that have as their domain the variables occurring in the query's atoms, i.e., V (q) = a∈q V (a). The semantics of q is then given by Note that we overload q above and refer to it both as a set (denoting the query's syntax) and as a mapping (denoting the query's semantics). In the following, we assume that there is a linear ordering x 1 , . . . , x n on the variables V (q).
The basic hypercube algorithm [3] computes a full conjunctive query q on a distributed, MapReduce-like system [28] with p parallel workers.  3. In the reduce phase, the workers evaluate q locally on the events that they received in the first phase. The query's result is the union of all local results, which may optionally be sent to a centralized worker.
In general, the basic hypercube algorithm duplicates events, namely those matching an atom that does not contain a variable x i with p i > 1. The total number of events that each worker receives (and on which it computes q) depends on the input database and the shares. Beame et al. [20] analyze the maximum worst-case load of the workers, given fixed relation sizes and shares. They define the load as the total size of the messages (in bits) received by a worker before the algorithm's reduce phase. Based on the number of workers p, Beame et al. distinguish between skewed and skew-free databases. A database is skewed if it contains heavy hitters, which are values whose degree with respect to some attribute i and relation name r exceeds |D(r )|/ p. For skew-free databases, they show that the maximum load generated by the hypercube algorithm is asymptotically bounded (up to a factor polylogarithmic in p) by L = a∈q |D(r )|/ x i ∈V (a) p i with high probability. (In fact, they prove the bound for a more general notion of skew-free databases.) Beame et al. [20] also show that the shares can be optimized using linear programming. The input to the optimization is the full conjunctive query and the relation sizes of the database on which the query should be computed. Using a single round of communication and the optimized shares, the hypercube algorithm matches the lower bound for the maximum load that is necessary to compute the query. A single round of communication means that only one communication step is allowed after the initial communication phase. Afterwards, each worker can only perform local computations. The lower bound holds under the assumption that p workers can send arbitrary messages over private channels, have unbounded computational power, and have access to a common source of randomness.
However, optimizing the shares with linear programming does not yield integer values in general. As an alternative, Chu et al. [26] propose a simple exhaustive search over all possible integer shares, selecting the shares that minimize L. We present a modified version of their algorithm in Sect. 5.3 (Algorithm 2).  (x 2 , x 3 ), R(x 3 , x 1 )} on the same database as before, the optimal shares are p 1 = p 2 = p 3 = p 1/3 . Thus each worker receives approximately 3m/ p 2/3 events. If p is not a cubic number, we must approximate p 1/3 by a combination of integers. E.g. for p = 16, the algorithm by Chu et al. selects p 1 = 4 and p 2 = p 3 = 2 (or a permutation of these numbers).
Next, we show how the events are distributed to the workers for q T . We assume p = 64 (hence p 1 = p 2 = p 3 = 4) and simplify the hash functions to h(x) = x mod 4 for the purpose of this example. The slices are thus identified by three coordinates between 0 and 3, with one coordinate for each variable x 1 , x 2 , and x 3 . The events P (0, 1), Q(1, 7), and R(7, 0) are sent to the worker with coordinates 013. This ensures that the valuation x 1 = 0, x 2 = 1, x 3 = 7 ∈ q T (D) is produced by at least this worker.
When the database is skewed, i.e., it contains heavy hitters, the basic hypercube algorithm sketched above is not optimal: applying a hash function h i with share p i > 1 to a heavy hitter does not distribute the value evenly over the coordinates [ p i ]. Koutris et al. [38] propose an extension, which we simply call the hypercube algorithm, that is worst-case optimal also for skewed databases. Our automatic splitting strategy adjusts this algorithm to the online monitoring setting (Algorithm 1 in Sect. 5.3). Koutris et al. assume that all heavy hitters in the database are known in addition to the relation sizes. In the database setting, as well as in offline monitoring, this is a reasonable assumption since computing statistics is asymptotically dominated by querying.
In the hypercube algorithm, a copy of the basic algorithm is executed in parallel for every subset H ⊆ V (q) of the query's variables. Each copy uses its own set of shares p H ,i and hash functions h H ,i , but with the constraint that p H ,i = 1 if x i ∈ H . A valuation v is heavy in variable x if there exists an atom a = r (y 1 , . . . , y ι(r ) ) ∈ q and i where y i = x and v(x) is a heavy hitter in the attribute i of r . We write heavy(q, v) for the set of variables in which v is heavy. The event e is processed by those instances of the basic hypercube algorithm that are associated with the variable sets heavy(q, v) for which there exists an atom a ∈ q with v(a) = e. For every H , the corresponding shares can be optimized as in the basic hypercube algorithm by considering the residual query q H , which is obtained from q by removing all occurrences of variables in H .

Example 5
Suppose that the database D from Example 4 is skewed. We analyze the optimal shares for the triangle query q T and instances of the basic hypercube algorithm for the variable subsets If no variable has a heavy hitter (H 1 ), the shares from Example 4 apply. The remaining variable sets have symmetric solutions. For the algorithm instance H 2 , the optimal shares are p H 2 ,1 = 1 and p H 2 ,2 = p H 2 ,3 = p 1/2 . Each worker then receives at most 1/ p 1/2 of the events for which only x is assigned a heavy hitter. For the algorithm instance H 3 , the optimal shares are p H 3 ,1 = p H 3 ,2 = 1 and p H 3 ,3 = p, so at most 1/ p of the corresponding events are sent to the workers. Finally for the algorithm instance H 4 , one must broadcast the events to all workers. Note that there can be at most p different heavy hitters per attribute. Therefore, there are at most 3 p 2 events to which the set H 4 applies. The overall fraction of events received by each worker is asymptotically equal to the maximum of the three cases, which is O(1/ p 1/2 ).

Stream statistics for slicing
To adapt the hypercube algorithm to the monitoring setting, we first generalize the notions of relation size and heavy hitters to event streams. Our automatic splitting strategy is selected based on these statistics. Since streams are unbounded, we consider non-overlapping time intervals of a fixed size Δ. Non-overlapping means that all intervals begin at multiples θ · Δ. We call θ ∈ N the interval's time index. The interval size Δ is a parameter of our model.
The choice of Δ represents a tradeoff. Larger values smooth out irregularities in the stream and thus reduce the variability of the characteristics. The downside is lower precision, which can impact monitoring latency. For example, consider a stream where the events are spaced uniformly and can be monitored without additional latency. In the worstcase input with the same event rate, all events in an interval arrive simultaneously, such that one of the events is delayed by the combined processing time of all events. The larger Δ is, the larger is the difference between this maximal latency and the best case.
Recall that an event stream (τ i , D i ) i∈N is an infinite sequence of time-stamped databases. Given an arbitrary event stream and time index θ , the r -event rate γ θ (r ) is the average number of events with name r ∈ E and a time-stamp in the interval I θ = [θ · Δ, (θ + 1) · Δ) per time unit, i.e., As before, D i (r ) denotes the set of events with the name r in the database D i . The event rate at time θ is γ θ = r ∈E γ θ (r ), and the relative r -event rate is γ θ (r ) = γ θ (r )/γ θ . For all names r ∈ E and attributes i ∈ {1, . . . , ι(r )}, the frequency Let f ∈ (V (ϕ) → D) → P(K ) be a splitting strategy as in Definition 4. The load λ θ (k, f ) of the slice identified by k ∈ K is the average rate of events in that slice relative to γ θ , i.e., The maximum load λ θ ( f ) is taken over all slices, λ θ ( f ) = max k∈K λ θ (k, f ).
We consider the problem of finding a splitting strategy that minimizes the maximum load for all event streams with given relative r -event rates, heavy hitters, and number of submonitors. Since these rates and the load are relative to the overall event rate γ θ , we thus maximize the throughput of the parallelized monitor and the utilization of the submonitors. We do not aim at optimal splitting strategies for arbitrary MFOTL formulas. Instead, we are interested in heuristics providing strategies that are effective in practice. Moreover, we restrict our discussion to event streams with constant relative r -event rates and heavy hitters (constant with respect to θ ). Equivalently, the choice of the splitting strategy applies to a single interval of size Δ. We therefore omit the index θ and write γ (r ), λ( f ), and so forth. We have started to address timevarying statistics in a separate work [52].

Slicing using the hypercube algorithm
We instantiate our joint data slicer (Sect. 4.2) with a strategy that is derived from the hypercube algorithm for database queries (Sect. 5.1). Observe that monitoring an MFOTL formula without any temporal operators corresponds to evaluating a database query for each index in the event stream. In this case, the subproblem of computing the satisfying valuations at any given index on parallel workers (i.e., submonitors) is solved by the hypercube algorithm. We show below that the mapping phase of the algorithm can be rephrased as a splitting strategy for the joint data slicer. Since we have established this slicer's correctness for all MFOTL formulas, we can thus apply the hypercube approach to temporal formulas and event streams, too.
Recall that several copies of the basic hypercube algorithm are executed in its heavy hitter-aware extension. Each copy sends the event e to a set T (e) of workers (Equation (1) in Sect. 5.1), which depends on the query q. We will now run all copies in parallel on a single set K of workers. For every variable subset H ⊆ V (q), we assume a bijection ξ H : [p H ,1 ] × · · · × [p H ,n ] → K . We can then describe the mapping phase of the extended algorithm by the single equation such that e is sent to the workers in T (e). Note that the righthand side of the equation has the same structure as the one for the joint data slicerŜ f (e) in Definition 4 once we replace matches(q, e) with matches(ϕ, e). Both these sets contain the valuations for which the event e is potentially relevant, i.e., for which the containment in the query result and the satisfaction of the formula ϕ, respectively, may depend on e.
To complete the transition from queries to MFOTL formulas, we determine the equivalent of heavy(q, v) for ϕ. Recall that the set heavy(q, v) contains a variable x if the image of an atom in the query q under v contains a heavy hitter in the corresponding relation. The variable is treated differently because it might not be possible to distribute the relation evenly by hashing the variable. We see that heavy(q, v) depends on the heavy hitters in all events e with v ∈ matches(q, e). Let heavy var (ϕ, x) be the union of all H(r , i) for which there is a subformula r (y 1 , . . . , y ι(r ) ) in ϕ with y i = x. We then define heavy(ϕ, v) as {x | v(x) ∈ heavy var (ϕ, x)}.
The following set is nonempty and thus a valid splitting strategy (see Definition 4): We call f the hypercube strategy for ϕ given h H , j , ξ H , and H.   Algorithm 1 outputs the slice identifiers to which the joint data slicer (Sect. 4.3) sends a given event according to the hypercube strategy f . We write for the partial map that is undefined everywhere and codom(h) for h's codomain. Concretely, Algorithm 1 computes the union of f (v ) for all v matching the event. To this end, it computes a partial valuation v for each of the formula's predicates by matching the event with the predicate. The valuation v assigns values to those variables that occur in the predicate. The algorithm subsequently iterates over all full valuations v (which assign to all free variables) that extend v. This is done in two steps because the set of these valuations may be infinite. First, the algorithm iterates over all H = heavy(ϕ, v ), of which there are finitely many. We skip those sets H that contain a variable x with heavy var (ϕ, x) = {} because there is no valuation v where H = heavy(ϕ, v ). Second, for each H , the algorithm constructs the finite set of coordinates  (h H ,1 (v (x 1 )), . . . , h H ,n (v (x n ))) directly by enumerating the codomain of h H , j if x j is not assigned by v.
What remains is to choose the hash functions h H , j and the mappings ξ H . As with databases, we select h H , j uniformly at random with a given codomain [ p H , j ]. The shares p H , j thus parametrize a randomized family of splitting strategies. We select the hash functions anew for every run of the parallel monitor, such that they are independent of the input trace. The mappings ξ H can be arbitrary; in practice, we map coordinates to slice identifiers in [ p]:

Example 7
Assume that there are no heavy hitters in the event stream and that p = q 2 for some q ∈ N. Let ϕ = P(x 1 ) ∧  P(x 2 ) with shares p 1 = p 2 = q. We conceptually arrange the slices in a square with side length q. Each P event is assigned to one coordinate in the square's first dimension by the first atom, and to another coordinate in the second dimension by the second atom. Each coordinate is associated with q slices, and there is a single slice that agrees on both coordinates. Therefore, 2q − 1 slices receive the event. The load is approximately λ = (2q − 1)/q 2 . The average event rate per slice is lower than the event rate of the input stream if λ < 1, i.e., q ≥ 2. This improves over any combination of single variable slicers (see Sect. 4.3).

Example 8
We extend the triangle query q T and the database from Example 5 to the formula ϕ = (( [0,10] 10] R(x 3 , x 1 ) and some event stream with γ (P) = γ (Q) = γ (R) = m, having H(P, 1) = {0} as the only heavy hitter. We can reuse the optimal shares from Example 5 because q T and ϕ consist of the same atoms, and the stream statistics correspond to the database statistics. Let p = 64. We simplify the hash functions to the modulus (e.g., h {x 1 },2 (x) = x mod 8, since p {x 1 },2 = p 1/2 = 8). Before applying the mappings ξ H , we obtain the following assignment of events to coordinate vectors (h H ,1 (v(x 1 )), h H ,2 (v(x 2 )), h H ,3 (v(x 3 ))). Note that there are no coordinates for all other H , since heavy var (ϕ, x) is nonempty only for x = x 1 . If these events are within 10 time units of each other, the valuation x 1 = 0, x 2 = 1, x 3 = 7 will be recognized successfully as satisfying: the events P (0, 1), Q(1, 7), and R(7, 0) are all part of the slice with the identifier ξ {x 1 } (017) = 57.
We apply an additional optimization to the hash functions. The shares for two variable subsets H 1 = H 2 may be equal and hence there is no need to distinguish them. This occurs if the variables in the symmetric difference of H 1 and H 2 receive a share of 1. If we choose the hash functions independently, however, there is a large probability that the slice sets computed with H 1 and H 2 differ for a given event. We reduce this unnecessary event duplication by using the same hash functions for H 1 and H 2 , as shown in Example 9 below.  [26] to optimize the shares. The transfer is based on the following observation: applying our hypercube strategy algorithm to an interval of an event stream incurs the same load as using the hypercube algorithm (Sect. 5.1) on the database constructed from that interval. This database is the (multiset) union of all databases in the stream that belong to the interval. 2 Therefore, relation sizes correspond to r -event rates γ (r ). We overapproximate the load by summarizing the partial loads induced for each choice of the variable set H . We further simplify the analysis by using the rate γ (r ) for each H , even though only a subset of the r -events may be sliced according to this H . Let r (y 1 , . . . , y ι(r ) ) ≤ ϕ denote the fact that r (y 1 , . . . , y ι(r ) ) is a subformula of ϕ. The maximum load λ is bounded from above by Input: ϕ with free variables x 1 , . . . , x n ; number of submonitors p, relative event rates (γ (r )) r Output: parameters (h H ,i ) ( p H ,1 , . . . , p H ,n ), where a share vector is called valid if 1≤i≤n p H ,i ≤ p, and p H ,i = 1 for all x i ∈ H . Note that we allow the shares' product to be smaller than p, which may be beneficial if p cannot be factorized optimally [26]. The maximal number of submonitors p is the input to the optimization, together with the relative r -event rates γ (r ). We choose the share vector with the smallest value for Cost( p H ) = r (y 1 ,...,y ι(r ) )∈ϕ γ (r ) x i ∈{y 1 ,...,y ι(r ) }∩V p H ,i , thereby minimizingλ. We adopt a heuristic by Chu et al. [26] and break ties by choosing the vector with smallest maximum share max i p H ,i . This favors a more even distribution of shares to increase resilience against heavy hitters that are not accounted for in the statistics provided. Once the shares have been computed, Algorithm 2 samples random hash functions RandomHash(q) with codomain[q]. It implements the optimization mentioned above, where the hash functions with the same codomain are reused.

Discussion
Algorithm 1, which computes the hypercube strategy, iterates over all combinations of the formula's predicates with the subsets of its free variables. For each combination, it enumerates up to p slice identifiers. Therefore, Algorithm 1's complexity is bounded by O(|ϕ| · 2 n · n · p), where |ϕ| is the size of the formula ϕ and n is the number of free variables in ϕ. We assume n, p ≥ 1 and that all operations that involve D and slice identifiers in [ p] are computed in O(1) time, including the hash functions. The linear factor p is unavoidable: events may need to be broadcast to all p slices, e.g., if their arity is zero. The exponential complexity in n stems from the generic treatment of heavy hitters.
A possible optimization is to enumerate only subsets of those variables x i which have a share p H ,i > 1 for some H . This does not decrease the complexity for all formulas though. By bounding the number of possible share combinations with product q from above by n log 2 q , we find that Algorithm 2's complexity is in O(|ϕ|·(4 n ·n+2 n · p·n log 2 p )). The 4 n factor can be improved to 2 n by avoiding the innermost loop in line 8 and by iterating over the list of p H in lexicographic order instead (lines 5-10). We omit this optimization for clarity. Note that Algorithm 2 runs only once when the monitor is initialized, whereas Algorithm 1 is invoked for every event.
The minimum possible load achieved using the hypercube strategy depends on the pattern of free variables in the formula's atoms. A detailed discussion is provided by Koutris et al. [38]. The ideal case is a formula in which all atoms with a significant event count share a variable, together with a stream that never assigns a heavy hitter to that variable. Then the load per slice is 1/ p. Atoms with missing variables, and equivalently variables with heavy hitters, increase the fraction to 1/ p z for some exponent z > 1.
The (worst-case) optimality of the hypercube algorithm for conjunctive queries does not extend to full MFOTL. This already becomes evident for simple non-temporal formulas with disjunctions, such as P(x 1 , If γ (P) = γ (Q) and in the absence of heavy hitters, our approach will have load ( √ p + 1 2 )/ p ≈ 1/ √ p with p submonitors. However, the formula is equivalent to (P(x 1 , x 2 )∧ Q(x 1 )) ∨ (P(x 1 , x 2 ) ∧ Q(x 2 )), and thus we can process each disjunct independently. By using the optimal hypercube strategy for each disjunct (with shares p 1 = p and p 2 = p, respectively), we would obtain a total load of 2/ p, which is asymptotically better. The load can be further improved to 3/(2 p) by using the same hash function for x 1 in the first and x 2 in the second disjunct, such that the Q events are not duplicated.
Overall, it is unclear how this technique can be generalized to MFOTL formulas with arbitrarily nested temporal operators. In general, optimality for arbitrary formulas is out of reach because it would require us to decide MFOTL: if the formula is contradictory, the best possible slicer simply drops all events. We therefore settle for a more pragmatic solution and only focus on syntactic aspects of the formulas' structure.
We assumed that the submonitors' throughput does not depend on the events. It was therefore sufficient to minimize the load to optimize the throughput. This simplification is not always appropriate for monitors like MonPoly. The reason is that MonPoly constructs intermediate results, whose size depends on the monitor's input and which affects the complexity of further operations inside the monitor. It might be possible to achieve even higher throughput by taking the events' distribution and its impact on the monitoring performance into account. We leave such optimizations for future work.
In contrast to offline monitoring, stream statistics (such as γ (r ) and H(r , i)) cannot be obtained for the entire stream. Still, our approach assumes that these statistics are already available before the start of monitoring. In practice, this is a reasonable assumption since organizations have access to historical data that can serve as a good source of representative stream statistics before staring online monitoring.
Moreover, the statistics may change over time. In this case, one must obtain stream statistics during monitoring. This can be done using approximate algorithms [27], which have minimal impact on monitoring's performance. Furthermore a reasonable extension of the slicing framework is to adaptively modify the splitting strategy whenever the statistics change significantly. Thus, the monitor could start with a default strategy and refine it as more data is processed. (Event-separable slicers as defined in Sect. 4.2 cannot be adaptive because they must behave uniformly on the event stream.) We have already made first steps towards computing stream statistics online [30] and performing adaptive slicing [52].
Our approach affects only the event rate, but not the index rate, which is the number of databases per unit of time. The index rate impacts the performance of monitors such as Mon-Poly because each database triggers an update step. For a syntactic fragment of MFOTL, MonPoly reduces the number of update steps skipping empty databases [10]. In this case, we could already filter empty databases in the splitter.

Implementation
We implemented a parallel online monitoring framework based on the joint data slicer and built on top of the Apache Flink stream processing framework. The source code consists of roughly 3100 lines of Java and Scala and is publicly available [53]. Given a formula, our framework instantiates a parallel online monitor, which then reads events from a TCP socket or a text file, monitors the events in parallel, and writes all satisfying valuations to an output socket or file. The parallel monitor delegates the monitoring of individual slices to external tools, called submonitors. Our implementation supports the tools MonPoly [15] and DejaVu [36] as submonitors.
To instantiate a parallel online monitor, our framework uses the Flink API to construct a dataflow graph, whose nodes are stream operators. These operators retrieve data streams from external sources, apply processing functions to stream elements, and output the elements to sinks. Operators can execute in parallel. Stream elements can be partitioned according to user-specified keys. At runtime, Flink deploys the graph to a distributed computing cluster. We chose Flink for its low latency stream processing and its support for fault-tolerant computing. Fault tolerance is ensured using a distributed checkpointing mechanism [24]. The system recovers from failures by restarting from regularly created checkpoints. Operators must therefore expose their state to the framework to enable checkpointing.
The inputs to our monitoring framework are the formula, the number and type of parallel submonitors, the stream statistics for the shares' optimization, and the heavy hitter values. The framework precomputes the shares using Algorithm 2 and creates a parallel monitor instance as the dataflow graph shown in Fig. 4, where each node is labeled with a Flink operator (e.g., flatMap) and a description of its functionality.
During the dataflow's execution, the input events are read, line by line, as strings. We support both MonPoly's and DejaVu's input formats, as well as the CSV format used in the RV competition [7]. The parser then converts the input lines into an internal datatype that stores the event name and the list of data values. The parser's results are flattened into a stream of single events because a single line in MonPoly's format may describe several events at once.
After parsing, the splitter computes the set of target slices for each event. To do so, it executes Algorithm 1 using the optimized shares, precomputed by the framework, and heavy hitter sets as well as the heavy hitter values. For each event and each of its target slices, a copy of the event is sent to the next operator along with the target slice identifier. Then, the stream is partitioned into slices based on the slice identifiers and the slices are sent to the parallel submonitors. We use the custom externalProcess operator in each parallel flow. This operator is responsible for initiating and interacting with an external process, in our case MonPoly or DejaVu. The operator prints, in MonPoly or DejaVu format, one database at a time to the standard input of the external process. (For DejaVu, which expects exactly one event at a time, empty databases are encoded as an event with a name that does not occur in the formula.) The operator simultaneously reads verdicts from the standard output of the process and applies the intersection from J f 's definition (Definition 4), thereby filtering the monitor's output. Finally, all remaining verdicts are combined into a single stream, which is written to an output socket or file.
The above communication with the external process is asynchronous with respect to the Flink pipeline, which prevents these operations from blocking other operators. Flink's AsyncWaitOperator supports asynchronous requests to external processes, but it does not manage their state. To optionally provide fault-tolerance, we must checkpoint the submonitors' states because they summarize the events seen so far. Our implementation of the externalProcess operator extends the AsyncWaitOperator with an interface to retrieve and restore an external state.
We have extended MonPoly with control commands that implement the interface for retrieving and restoring an external state. Whenever Flink instructs the externalProcess operator to create a checkpoint, the operator first waits until all prior events have been processed. Then, the command for saving the state is sent to the external process. In response, MonPoly writes its state to a temporary file. The part of the monitor's output received after the checkpoint instruction's arrival at the externalProcess operator is also included in the checkpoint. This ensures that no output is lost when other operators create their own checkpoint concurrently. We did not implement a state interface for DejaVu, since we opted to use DejaVu in a black-box manner to demonstrate our framework's generality. Therefore, our parallel monitor is currently not fault-tolerant if DejaVu is used as a submonitor. We conjecture that implementing the state interface in DejaVu is possible with modest effort.
DejaVu monitors closed formulas only and reports violating instead of satisfying valuations. Therefore, when using DejaVu, our framework first closes the input formula ϕ by adding a prefix of existential quantifiers. Then it negates the closed formula before passing it to the parallel monitor. Thus it ensures that DejaVu's output is consistent with MonPoly's output whenever they are used as submonitors within our framework. The splitter uses the original formula ϕ because it is only effective if there are free variables. As the output of DejaVu consists only of the violating indices for the closed and negated formula, we cannot compute the intersection from J f 's definition with ϕ's valuations. Hence, we must use the simplified joiner J , which is correct under the assumptions of Theorem 3. This limits the applicability of our approach using DejaVu to monitor certain formulas, and we cannot account for heavy hitters because otherwise the hypercube strategy would not satisfy condition (4) of Lemma 3.
The parts of the dataflow preceding the submonitors currently operate sequentially. This is a bottleneck that limits scalability, since all input events must be processed sequentially by the splitter. Despite this limitation of our implementation, the splitter and the surrounding operators could be parallelized too: Our splitter processes events separately because it implements the event-separable joint data slicer (Sect. 4.2). A parallel splitter would be particularly effective if the event source itself is distributed. However, we must ensure that events arrive at the submonitors in chronological order. This order is no longer guaranteed if the splitter is partitioned into concurrent tasks. In a separate line of work [12], we propose a possible solution that buffers and reorders events before forwarding them to each submonitor.

Evaluation
We structure our evaluation to answer the following research questions, which assess the scalability, practicality, overhead, and generality of our framework. The scalability (RQ frm and RQ rate ) of our framework is its ability to handle growing event rates by using more submonitors. This includes the framework's ability to leverage its knowledge about the event stream to further improve monitoring performance (RQ stats and RQ skew ). The framework is practical (RQ real ) if it can be used in a real-world setting, i.e., to scalably monitor a real event stream. The overhead of the framework is the fraction of its time and memory usage that is not spent on running the submonitors (RQ oh and RQ ft ). Finally, the framework's generality is its ability to be used with different first-order (sub)monitors (RQ gen ).
To answer the above questions, we organize our evaluation into two families of experiments, each monitoring a different type of input stream, either synthetic or real-world. The synthetic streams are used to analyze the effects of individual parameters, such as the event rate, whereas the real-world streams attest to our framework's ability to scalably solve realistic problems. Figure 5 summarizes the parameters used for each experiment, which we explain next.

Synthetic Experiments.
In the experiments with synthetic streams (Fig. 5), we monitor the three formulas star, linear, and triangle and their past-only, non-metric variants star-past, linear-past, and triangle-past (Fig. 6). Different occurrence patterns of free variables in the formulas are used to test RQ frm . The formulas cover common patterns in database queries [20], which we additionally extend with temporal operators.
We focus on variable occurrence patterns over other formula features (e.g., formula size) since they affect our framework directly, rather than just the submonitors.
We have implemented a stream generator tailored to each of the three formulas. The generator takes a random seed and synthesizes streams with configurable characteristics. Specifically, the synthesized streams on average have con-stant characteristics across all time indices θ . The streams contain binary events labeled with P, Q, or R and have configurable event rates and index rates. This setup allows us to test RQ rate . Figure 5 summarizes the event rates used in our experiments. Note that we evaluate only those combinations of event rates and number of submonitors that do not take too long to execute. Specifically, we limit individual monitoring runs to 5 minutes of total execution time. For example, in the Synthetic MonPoly . experiments, we monitor the star formula with the standalone MonPoly instance on streams with event rates up to 20,000 (denoted as 20 k in Fig. 5).
To test RQ stats and RQ skew , the generator can also synthesize streams with configurable relative event rates (γ θ (P), γ θ (Q), γ θ (R)) and force some event attribute values to be heavy hitters. Attribute values are sampled from two possible types of distributions. Non-heavy hitter values are selected uniformly at random from the set {0, 1, . . . , 10 9 −1}; heavy hitter values are drawn from a Zipf distribution. The Zipf distribution's probability mass function is p(x) = x −z / 10 9 n=1 n −z for x ∈ {1, 2, . . . , 10 9 }, i.e., the larger the exponent z > 0 is, the fewer values have a large relative frequency. To prevent excessive monitor output, all Zipfdistributed values of R events are increased by 10 6 . The distribution type (uniform or Zipf) and the exponent z are defined per variable x (the exponent is thus denoted z x ) and can be supplied as inputs to the generator.
All synthetic streams in our experiments are generated with relative event rates γ θ (P) = 0.01 and γ θ (Q) = γ θ (R) = 0.495 and with attribute values sampled uniformly at random. In the Synthetic heavy hitters experiments (Fig. 5), we also generate streams with heavy hitter values in valuations of variable a in the star formula and variable b in the linear and triangle formulas, with their Zipf exponents set to 2.

Real-world Experiments.
To test RQ real , we use logs from Nokia's Data Collection Campaign [13]. The campaign collected data from the mobile phones of 180 participants and propagated the data between three databases, db1, db2, and db3. The phones uploaded the data directly to db1, then a synchronization script script1 periodically copied the data from db1 to db2. Next, db2's triggers anonymized and copied the data to db3. The participants could query and delete their own data from db1. Deletions were propagated to all databases.
To obtain streams suitable for online monitoring, we have developed a tool (called replayer) that replays log events and simulates the event rate at the log creation time, which is captured by the events' time-stamps. The tool can also replay the log proportionally faster than its event rate, which is useful for evaluating the monitor's performance while retaining the  log's other characteristics. Since the log from the campaign spans a year, to evaluate our tool in a reasonable amount of time, we pick a one day fragment with a high average event rate from the log, starting at time-stamp 1,282,921,200. We use the replayer to accelerate the fragment up to 5000 times. The fragment contains roughly 9.5 million events with an average event rate of 110 events per second. Using the acceleration, we have subjected our tool to streams of over half a million events per second. The logs used [57] and the scripts that synthesize and replay streams [53] are publicly available.
We monitor the formulas insert, delete, and custom (Fig. 6). The formulas insert and delete come from Nokia's Data Collection Campaign, where they proved to be challenging to monitor. Specifically, the two formulas are the negated versions of the ins-1-2 and del-1-2 formulas from Basin et al.'s formalization [13], which require a large amount of memory when monitored by a single MonPoly instance. We used our knowledge of the data set also to craft the past-only, non-metric custom formula with an expensive temporal join involving the (very frequently occurring) insert event.
Since we monitor only a one day fragment of the Nokia log, we must initialize our monitor with the appropriate state to obtain the correct output. Therefore, we monitor each formula once on the part of the log preceding the chosen fragment and spanning an appropriate amount of time as defined by each formula's temporal reach. We store the monitor's state obtained at the end of the proceeding fragment and initialize the monitor with the stored state in the experiments.
We have additionally computed the relative event rates for all events, and identified all heavy hitter values in the one day fragment of the Nokia log. We run our framework both with and without this information to answer RQ stats and RQ skew .

Monitors.
To test RQ oh and RQ gen , we use MonPoly and DejaVu as parallel submonitors within our framework, and also as standalone monitors for comparison. To accommodate DejaVu, which implements a slightly different monitor function than MonPoly, we need to adapt the parameters of our two families of experiments (see the Synthetic DejaVu and Nokia DejaVu experiments in Fig. 5). First, we use the formulas star-past, linear-past, triangle-past, and custom (Fig. 6), which belong to the past-only non-metric fragment of MFOTL supported by DejaVu. The formulas are closed and negated prior to invoking DejaVu, since it only monitors closed formulas and just reports violations. DejaVu expects input streams without time-stamps and with databases containing exactly one event. Thus, we modify the streams in our experiments accordingly: each database with more than one event is linearized, i.e., translated into a sequence of singleton databases with all time-stamps set to 0. The verdicts of the used formulas are not affected by this transformation. Moreover, we run the experiments both with and without Flink's fault tolerance mechanism to determine its impact on performance (RQ ft ). This is only done when MonPoly is the submonitor, since DejaVu does not support checkpointing.

Measurements.
We ran all our experiments on a server with two sockets, each containing twelve Intel Xeon 2.20 GHz CPU cores with hyperthreading, which effectively gives us 48 independent computation threads.
To assess our framework's scalability, we measure the (maximal) latency and throughput achieved during our experiments. Latency is the difference between the time a monitor consumes an event and the time it is done processing it. Throughput is the number of events that a monitor processes in a unit of time. We use the wall-clock time values provided by the UNIX time command to measure the total execution time, i.e., the time between the moment when the replayer starts emitting events to the monitor and the moment the monitor processes the last emitted event. We also measure the execution time and maximal memory usage of each submonitor. To measure the latency during execution, our replayer injects a special event, called a latency marker, into the stream. Every second, the replayer generates a latency marker, which is tagged with the current time. The marker is then propagated by our framework, preserving its order with respect to the databases containing other events from the input stream. We measure the latency at the framework's output by comparing the current time with the time in the marker's tag. Besides measuring the current latency, we also calculate the maximum latency up to the current point in the experiment.
Since MonPoly's unit of input is a database of events (rather than a single event), it does not perform any processing before it receives an entire database. Its particular input format allows MonPoly to detect that the currently received database is complete only once the first event from the next database is received. This means that our latency measurements as described above would treat the timestamp difference between two consecutive databases in the input as the monitor's processing latency. Thus, we task our replayer to additionally send watermark events as part of the input, signaling to MonPoly whenever the currently received database is complete. This effectively allows us to measure the monitor's exact processing time latency, excluding any delay introduced by the delays already present in the input.
When the latency is higher than one second, the latency marker gets delayed too and a timely value cannot be produced. Flink reports zeros for the current latency in this case, while we consider the latest non-zero value. This significantly reduces the noise in our measurements.
In addition to online experiments, where we use our replayer to simulate event streams, we also execute all our synthetic experiments offline. Specifically, we directly sup-ply the monitored log as a file to the monitor. The monitor consumes the log at a rate defined by its current processing speed. We can then calculate our framework's throughput as the ratio of the total number of events and the measured offline execution time. The stage [29] (offline or online) at which we run our monitor in each of the experiments is specified in Fig. 5.
Since we focus on performance measurements, we discard the tool's output during all of our experiments. Each run of a monitor with a specific configuration is repeated three times and the collected metrics are averaged to minimize the noise in the measurements. Figure 7 shows the results of using our framework with MonPoly to monitor synthetic streams. We show the results when fault tolerance is enabled, since they are less favorable for our framework. Plots labeled with Tool N denote that our framework used N instances of Tool as submonitors. Omitting the number of submonitors indicates a standalone run of the Tool. Our experiments demonstrate our framework's low overhead (RQ oh ): a standalone run of a Tool exhibits the same performance as a run of our framework with one submonitor (Tool 1 ). Figure 7a shows the achieved throughput (top), the maximum latency (middle), and the maximal memory consumption across all submonitors (bottom) when monitoring the formula star with different numbers of submonitors. For example, our tool exhibits a latency of 27 s for an event rate of 15,000 events per second if a single submonitor is used. Similar latency is exhibited with 4 submonitors when monitoring events rates above 45,000 events per second. In contrast, using 16 submonitors achieves sub-second latency for all event rates in our experiments. With an increasing number of submonitors, each submonitor receives fewer events and hence uses less memory, while collectively the submonitors handle larger throughput. This experiment answers RQ rate : our tool handles significantly higher event rates by using more parallel submonitors. Figure 7b shows the achieved throughput (top), the maximum latency (middle), and the maximal memory consumption (bottom) of our tool when monitoring star, triangle, and linear formulas using 4 submonitors. The plots show six graphs, where each graph shows the results of monitoring one of the three formulas over a stream with an index rate of either 1 or 1000. Since the index rate affects the performance of MonPoly [14], the overall framework is also affected (RQ rate ). The event rate gain due to parallel monitoring depends on the variable occurrence patterns in the monitored formula (RQ frm ). Namely, the variable pattern in the star formula is the one that exhibits the best scalability due to variable a's occurrence in all the formula's atoms.

Results.
In the experiments described so far, we did not supply our framework with the relative event rates for the event names   Figure 7c positively answers RQ stats by showing that our tool's performance substantially increases when using 4 and 8 submonitors and when the statistics about the stream are known in advance. We use Tool N stats to denote that our framework runs Tool on N submonitors with relative event rates provided ahead of time. Figure 8 shows the results of the same experiments as in Fig. 7 but now using our framework with DejaVu as the submonitor. Fault tolerance was disabled in these experiments. Similarly as before, the experiments show that our framework can handle higher event rates by using more parallel submonitors (RQ rate ). Regarding RQ gen , our results demonstrate improved throughput, latency, and memory consumption with two different first-order monitors. Both Figs. 7a, 8a answer RQ oh : they show that our framework achieves better performance than MonPoly and DejaVu on their own, except when only a single submonitor is used, where it exhibits essentially the same performance. Figure 9 summarizes the results of using our framework with MonPoly to monitor the real-world log from the Nokia case study (RQ real ). The event and index rates are defined by the log; we only control the acceleration used by the replayer. As we anticipated earlier, the custom formula is the hardest to monitor (top right plot), followed by the delete, and insert formulas, respectively. The other plots focus on the delete formula as it comes from the real use case and was not crafted by us. In contrast to the synthetic experiments, our framework's performance does not improve beyond 4 submonitors. However, if one considers the acceleration (up to 5000) and the log's average event rate (110 events per second), our framework can process event rates higher than 500,000 events per second on average. At this point, the centralized parsing and slicing become the main performance The top left and middle plots in Fig. 9 contrast the performance overhead for fault tolerance (RQ ft ). The maximal latency is most visibly affected when the framework uses a single submonitor. The bottom three plots show how the latency changes over time during monitoring. These plots correspond to three individual runs while monitoring the delete formula. The leftmost plot shows the monitoring of the formula with respect to the stream sped up 1000 times, with fault tolerance disabled. The middle and rightmost plots show runs with fault tolerance enabled for the accelerations of 1000 and 2000. The regularly occurring spikes in the latency graphs are caused by Flink's state snapshot algorithm, which is invoked every 10 s. Figure 10 compares the performance of our framework using MonPoly and DejaVu as submonitors when monitoring the custom formula on the log from the Nokia case study. Namely, MonPoly has lower maximum latency and in both cases our framework improves the latency (RQ gen ) when more submonitors are used. Figure 10's right-most plot shows how our framework improves DejaVu's current latency when monitoring the custom formula. The regular increases in latency seen in each run are due to DejaVu's internal garbage collection, which tries to reduce its memory usage when storing previously seen parameter values [35].
Interestingly, using our framework with a single submonitor (MonPoly 1 ) and without fault-tolerance lowers the maximum latency compared to a standalone run of Mon-Poly (top left plots in Figs. 9 and 10). We conjecture that this results from the more efficient parsing and filtering of irrelevant events in our framework.
Finally, Fig. 11a shows the number of events sent per submonitor when no skew is present in the stream. In the presence of skew, the event distribution is much less uniform (Fig. 11b). When our framework is aware of the variables in the formula whose instantiations in the stream are skewed, it can balance the events evenly (Fig. 11c), effectively reducing the maximum load of the submonitors (RQ skew ).

Conclusion and future work
Our work takes a substantial step towards efficient, parallel online monitoring of event streams with respect to policies written in expressive first-order languages. This entailed generalizing the offline slicing framework [10] to support online monitoring and the simultaneous slicing with respect to all free variables in the formula. Our work also builds a bridge to related research on query processing for databases and data streams. We adapted hash-based partitioning techniques from databases to obtain an automatic splitting strategy. We implemented a general approach to automatic slicing in Apache Flink and instantiated it with two existing tools for monitoring events with data, namely MonPoly and DejaVu. Our results demonstrate a significant performance improvement. For example, 16-fold parallelization allows us to increase the event rate from 10,000 to 75,000, while retaining sub-second maximum latency (Fig. 7a).
In this article, we assumed that the stream's statistics are fixed. However, the automatic splitting strategy can be dynamically reconfigured by redistributing the submonitors' states coupled with the online collection of the statistics. We have already made some progress in implementing this extension and analyzing the tradeoff between the reconfiguration costs and the cost of using an imperfect splitting strategy [30,52]. We also plan to refine our automatic splitting strategy to account explicitly for communication costs and to evaluate our approach in a distributed cluster. To achieve maximal scalability, it will be necessary to parallelize the splitter and to process events from multiple independent input streams [12].
Funding Open Access funding provided by ETH Zurich.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.