Automated Analyses of IOT Event Monitoring Systems

,


Introduction
AWS IoT Events is a managed service for managing fleets of IoT devices. Customers use AWS IoT Events in diverse use cases such as monitoring self-driving wheelchairs, monitoring a device's network connectivity, humidity, temperature, pressure, oil level, and oil temperature sensing. Customers use AWS IoT Events by creating a detector model that detects events occurring on IoT devices and notifies an external service so that a corrective action can be taken. An example is an industrial boiler which constantly reports its temperature to a detector. The detector tracks the boiler's average temperature over the past 90 minutes and notifies a human operator when it is running too hot.
Each detector model is defined as a finite state machine with dynamically typed variables and timers, where timers allow detectors to keep track of state over time. A model processes inputs from IoT devices to update internal state and to notify other AWS services when events are detected. Customers can use a single detector model to instantaneously detect events in thousands of devices. Ensuring well-formedness of a detector model is crucial as ill-formed detector models can miss events in every monitored device.
Starting from a survey that identified sources of well-formedness problems in customer models, we identified most common mistakes made by customers and detect them using type-and model-checking. To use a model-checker for checking well-formedness of a detector model, we formalize the execution semantics of a detector model and translate this semantics into the source-language notation of the JKind model checker [1]. Model checking [2][3][4][5][6][7][8][9] verifies desirable properties over the behavior of a system by performing the equivalent of an exhaustive enumeration of all the states reachable from its initial state. Most model checking tools use symbolic encodings and some form of induction [6] to prove properties of very large finite or even infinite state spaces.
We have implemented type-checking and model-checking as an analysis feature in the production AWS IoT Events service. Our analyzers have reported well-formedness property violations in 22% of submitted detector models. 93% of customers of AWS IoT Events have checked their detector models using our analyzers. Our analyzers report property violations to customers with an average latency of 5.6 seconds (see Section 4).
Our contributions are as follows: 1. We formalize the semantics of AWS IoT Events detector models. 2. We identify six well-formedness properties whose violations detect common customer mistakes. 3. We create fast, push-button analyzers that report property violations to customers.

Overview
Consider a user of AWS IoT Events who wants to monitor the temperature of an industrial boiler. If the industrial boiler overheats, it can cause fires and endanger human lives. To detect an early warning of an overheating event, they want to automatically identify two different alarming events on the boiler's temperature. They want their first alarm to be triggered if the boiler's reported temperature is outside the normal range for more than 1 minute. They want their second alarm to be triggered if the temperature is outside the normal range for another 5 minutes after the first alarm. A user might try to implement these requirements by creating the (flawed) detector model shown in Figure 1. This detector receives temperature data from the boiler and responds by sending a text message to the user. The detector model contains four states:  Understanding the bug: Every state in the detector model consists of actions. An action changes the internal state of a detector or triggers an external service. For example, the GettingTooHot state consists of an action that starts a timer. The user can edit these actions with an interface shown in Figure 2. This action starts a one minute timer named Wait1Min. Note that timers are accessible from every state in the detector model. Even though the Wait1Min timer is created in the GettingTooHot state of Figure 1, it can be checked for expiration in all the four states of Figure 1.
The detector model in Figure 1 has a fatal flaw based on a typo. The user has written timeout("Wait1Min") instead of timeout("Wait5Min") when transitioning out of TooHot. This is allowed as timers are globally referenceable. However, it is a bug because each global timer has a unique name and the Wait1Min timer has already been used and expired. This makes StillTooHot unreachable, meaning the 2nd alarm won't ever fire, since a timer can expire at most once.
Related Work Languages such as IOTA [10], SIFT [11], and the system from Garcia et. al [12] use trigger-condition-action rules [13] to control the behavior of internet of things applications. These languages have the benefit of being largely declarative, allowing users to specify desired actions under different environmental stimuli. Similar to our approach, SIFT [11] automatically removes common user mistakes as well as compiles specifications into controller implementations without user interaction, and IOTA [10] is a reasoning calculus that allows custom specifications to be written both about why something should or should not occur. AWS IoT Events is designed explicitly for monitoring, rather than control, and our approach is imperative, rather than declarative: detector models do not have the same inconsistencies as rule sets, as they are disambiguated using explicit priorities on transitions. On the other hand, customers may still construct machines that do not match their intentions, motivating the analyses described in this paper.

Technique
In this section, we present a formal execution semantics of an AWS IoT Events detector model and describe specifications for the correctness properties.
Formalization of Detector Models Defining the alphabet and the transition relation for the state machine is perhaps the most interesting aspect of our formalization. Since detector models may contain global timers, timed automata [14] might seem like an apt candidate abstraction. However, AWS IoT Events users are not allowed to change the clock frequency of timers, nor specify arbitrary clock constraints. These observations allow us to formalize the detector models as a regular state machine, with timeout durations as additional state variables.
Formally, we represent the state machine for a detector model M as a tuple ⟨S, S 0 , I, G, T, E E , E X , E I ⟩, where: It is assumed that the sets I, G, and T are pairwise disjoint, and we define the set V ≜ I ∪ G to represent input and global variables in the model.
We denote by V the set of values for global (G) and input (I) variables; V ranges over the values of primitive types: integers, decimals (rationals), booleans, τ ::= int | dec | str | bool κ ::= event(e, a * ) µ ::= transition(e, a * , s) Fig. 3: Types, expressions, actions, and events in IoT Events Detector Models and strings. Integers and rationals are assumed to be unbounded, and rationals are arbitrarily precise. We use N as the domain for time and timeout values. Sets V ⊥ and N ⊥ are extended with the value ⊥ to represent an uninitialized variable.
The grammar for types (τ ), expressions (ϵ), actions (α), events (κ), transitions (µ) and input triggers (ι) is shown in Figure 3. In the grammar, metavariable e stands for an expression, l stands for a literal value in V, v stands for any variable in V, t is a timer variable in T, a is an action, and i is an input in I. The unary and binary operators include standard arithmetic, Boolean, and relational operators. The timeout expression is true at the instant timer t expires, and the isundefined expression returns true if the variable or timer in question has not been assigned. Actions (α) describe changes to the system state: setTimer starts a timer and sets the periodicity of the timer, while the resetTimer and clearTimer reset and clear a timer (without changing the periodicity of the timer). The setGlobal action assigns a global variable. Events (κ) describe conditions under which a sequence of actions occur.
We define configurations C for the state machine as: Each configuration C = s, i, t, g tracks the following: a state s ∈ S in the detector model, the input valuation i ∈ (I → V ⊥ ) containing the values of inputs, the timer valuation t ∈ (T → (N ⊥ × N ⊥ )) for user-defined timers. Each timer has both a periodicity and (if active) a time remaining, and the global valuation g ∈ (G → V ⊥ ) for global variables in the detector model.  To define the execution semantics, we create a structural operational semantics for each of the grammar rules and for the interaction with the external environment, as shown in Figure 4. We distinguish semantic rules by decorating the turnstiles with the grammar type that they operate over (ϵ, α, κ, µ, E I , and ι). The variables e, a, k, m, i stand in for elements of the appropriate syntactic class defined by the turnstile. For lists of elements, we decorate the syntactic class with * (e.g. ⊢ α * ), and the variables with 'l' (e.g. al). We use the following notation conventions: Given C = ⟨s, i, t, g⟩, we say C.s = s, and similarly with the other components of C. We also say C[s ← s ′ ] is equivalent to ⟨s ′ , i, t, g⟩, and similarly with the other components of C.
Expressions (⊢ ϵ ) evaluate to values, given a configuration. We do not present expression rules (they are simple), but illustrate the other rule types in Figure 4. For actions (⊢ α ), the setTimer rule establishes the periodicity of a timer and also starts it. The resetTimer and clearTimer rules restart an existing timer given a periodicity p or clear it, respectively, and the setGlobal rule updates the value of a global variable. Events (κ) are used by entry and exit events for states. The list rules for actions (α * ) and events (κ * ) are not presented but are straightforward: they apply the relevant rule to the head of the list and pass the updated configuration to the remainder of the list, or return the configuration unchanged for nil. Transition event lists (µ * ) cause the system to change state, executing (only) the first transition from the list whose guard e evaluates to true. Finally, the top-level rule ⊢ ι describes how the system evolves according to external stimuli.
A run of the machine is any valid sequence of configurations produced by repeated applications of the ⊢ ι rule. Timeout inputs increment the time to the earliest active timeout as described by the matchesEarliest predicate: The subtractTimers function subtracts t i from each timer in C, and the clearTimers function, for any timers whose time remaining is equal to zero, calls the clearTimer action 4 .

Well-formedness Properties
To find common issues with detector models, we surveyed (i) detector models across customer tickets submitted to AWS IoT Events, (ii) questions posted on internal forums like the AWS re:Post forum [15], and (iii) feedback submitted via the web-based console for AWS IoT Events. Based on this survey, we determined that the following correctness properties should hold over all detector models. For more details about this survey, please refer to Appendix A.
The model does not contain type errors: The AWS IoT Events expression language is untyped, and thus, may contain ill-typed expressions, e.g., performing arithmetic operations on Booleans. A large class of such bugs can be readily detected and prevented using a type inference algorithm. The algorithm follows the standard Hindley-Milner type unification approach [16][17][18] and generates (and solves) a set of type constraints or reports an error if no valid typing is possible. We use this type inference algorithm to detect type errors in the detector model. Every type error is reported as a warning to the customer. When our type inference successfully infers types for expressions, we use them to construct a well-typed abstract state machine using the formalization reported in Section 3. For the remaining well-formedness properties we use model checking. We introduce one or more indicator variables in our global abstract state to track certain kinds of updates in the state machine, and then we assert temporal properties on these indicator variables. Because we use a model checker that checks only safety properties, in many cases we invert the property of interest and check that its negation is falsifiable, using the same mechanism often used for test-case generation [19].

Every Detector Model State is Reachable and Every Detector Model
Transition and Event can be Executed: For each state s ∈ S, we add a new Boolean reachability indicator variable v s reached to our abstract state that is initially false and assigned true when the state is entered (similarly for transitions and events). To encode the property in a safety property checker, we encode the following unreachability property expressed in LTL and check it is falsifiable. If it is provable, the tool warns the user.
Every Variable is Set Before Use: In order to test that variables are properly initialized, first we identify the places where variables are assigned and used. In detector models, there are three places where variables are used: in the evaluation of conditions for events and transitions, and in the setGlobal action (which occurs because of an event or transition). We want to demonstrate that the variables used within these contexts are never equal to ⊥ during evaluation. In this case, we can reuse the reachability variables that we have created for events and transitions to encode that variables should always have defined values when they are used. We first define some functions to extract the set of variables used in expressions and action lists. The function V ars(e) : ϵ → v set simply extracts the variables in the expression. For action lists, it is slightly more complex, because variables are both defined and used: Every event or transition can be executed at most once during a computation step, so we can use the execution indicator variables to determine when a variable might be used.
Input Read Only on Message Trigger: This property is covered in the previous property, with one small change. To enforce it, we modify the translation of the semantics slightly so that at the beginning of each step, prior to processing the input message, all input variables are assigned ⊥.
Message Triggered Between Consecutive Timeouts: We conservatively approximate a liveness property (no infinite path consisting of only timeout events) with a safety property: the same timer should not timeout twice without an input message occurring in between the timeouts. This formulation may flag models that do not have infinite paths with no input events, but our customers consider it a reasonable indicator. We begin by defining an indicator variable for each timer t i (of type integer rather than Boolean): v i timeout and initialize it to zero. We modify the translation of updateTimers to increment this variable when its timer variable equals zero, and modify the translation of the message rule to reset all v i timeout variables to zero. The property of interest is then:

Experiments
In this section, we evaluate the performance of model-checking safety properties on detector models, with a focus on model checking latency. Low analysis latency is crucial because our tool warns customers of property violations while they are editing their detector model. Our type inference implementation runs with an average latency of 10 milliseconds on all the detector models in our experiments. Since type inference is much faster than model checking and can be successfully run on all detector models, we do not evaluate it in this section. AWS IoT Events has a commercial feature [20] which uses the type checking and model checking described in Section 3. The feature's implementation first infers types using the type inference algorithm. Next, it translates the detector model into the Lustre language [21]. The translation of IoT Events into Lustre is straightforward and directly follows from the semantics presented in Section 3. The safety properties described in Section 3.1 are attached to the model, along with location information. Then the feature analyzes the model using the JKind [1] tool suite, an open-source industrial model-checker. If JKind invalidates a safety property, the feature decodes the location from the safety property and includes it in the warning.
To evaluate this implementation, we randomly selected 210 detector models previously analyzed by the commercial feature. We checked the five properties described in Section 3.1 in parallel on a c4.8xlarge EC2 instance running Amazon Linux 2 x86 64 using JKind version 4.4.1, with a timeout of 60 seconds.
Of the safety properties that we were able to translate to Lustre, JKind resolved 96% within our timeout of 60 seconds, with 80% completing in less than 10 seconds. Table 1 shows that checking the no-unreachable-action safety property requires the most time to complete. The detector models analyzed in the evaluation include models for monitoring self-driving wheel chairs, monitoring device connectivity, humidity, temperature, pressure, oil level, oil temperature, doors, motion, refrigerator temperature, dough fermentation, and vehicle speedsensing. They consisted of between 1-7 states and from 0-14 state changes. The no-unreachable-action safety property is checked on every action, generating an average of 17 safety properties per detector model, the most of any kind of safety property. This large number of properties to be checked on every detector model caused checking the no-unreachable-action safety property to have the highest average latency (5.6 seconds per analysis). Table 1 shows that about 13% of the properties could not be translated to Lustre. In 2% of the detector models, translation failures arose due to type errors or incorrect use of the AWS IoT Events expression language in the detector model. The remaining translation failures occurred due to either: (1) use of operations not supported by Lustre, (2) no types being inferred for inputs or variables in the detector model, or (3) use of non-linear arithmetic, which is unsupported in JKind. Bitwise functions, strings, and array data types are supported in the AWS IoT Events expression language but not in Lustre. This language gap prevented us from translating 19 of the 210 detector models. Failing to infer a type for a variable in the detector model prevented translation of 6 of the 210 detector models. JKind's lack of support for non-linear arithmetic prevented model-checking 2 of the 210 detector models. We are actively working to support more functions, string and array data types, type annotations, and non-linear arithmetic in our model-checking of detector models.

Conclusion
Our analyzers have been running in the AWS IoT Events production service since December 2021. Since then, 93% of AWS IoT Events customers have used our implementation to check their detector models for well-formedness, without needing to have any knowledge of the underlying type checking and model checking. Our analyzers successfully complete for 85% of real-world detector models and we are actively working on improving this support as explained in Section 4. Overall, our implementation has reported well-formedness property violations in 22% of submitted detector models in the production service, with an average latency of 5.6 seconds. We find giving customers push-button access to fast verification without requiring any knowledge of the underlying techniques enables adoption of automated reasoning-based tools.