Predictive intelligence to the edge through approximate collaborative context reasoning

We focus on Internet of Things (IoT) environments where a network of sensing and computing devices are responsible to locally process contextual data, reason and collaboratively infer the appearance of a specific phenomenon (event). Pushing processing and knowledge inference to the edge of the IoT network allows the complexity of the event reasoning process to be distributed into many manageable pieces and to be physically located at the source of the contextual information. This enables a huge amount of rich data streams to be processed in real time that would be prohibitively complex and costly to deliver on a traditional centralized Cloud system. We propose a lightweight, energy-efficient, distributed, adaptive, multiple-context perspective event reasoning model under uncertainty on each IoT device (sensor/actuator). Each device senses and processes context data and infers events based on different local context perspectives: (i) expert knowledge on event representation, (ii) outliers inference, and (iii) deviation from locally predicted context. Such novel approximate reasoning paradigm is achieved through a contextualized, collaborative belief-driven clustering process, where clusters of devices are formed according to their belief on the presence of events. Our distributed and federated intelligence model efficiently identifies any localized abnormality on the contextual data in light of event reasoning through aggregating local degrees of belief, updates, and adjusts its knowledge to contextual data outliers and novelty detection. We provide comprehensive experimental and comparison assessment of our model over real contextual data with other localized and centralized event detection models and show the benefits stemmed from its adoption by achieving up to three orders of magnitude less energy consumption and high quality of inference.


Introduction
We envisage an IoT environment, where things at the edge of the network convey locally inferred knowledge to the IoT applications. We focus on a setting that involves networks of distributed wireless devices (e.g., sensor nodes and actuators, smart meters) capable of sensing and locally processing & reasoning about events. Each node performs measurements and locally extracts and infers knowledge over these measurements in light of event reasoning, e.g., wireless sensors spread on a geographical area are responsible for inferring fire or flood incidents. The fundamental requirement to materialize predictive intelligence at the edge of the network is the autonomous nature of nodes to locally perform data sensing & inference, and disseminate only inferred knowledge (e.g., minimal sufficient statistics) to their neighbors and concentrators. Nodes convey intelligence to concentrators for event inference.
Many critical IoT applications have been developed on top of contextual data streams captured by nodes for events identification and reasoning. Events are related to critical aspects, e.g., security issues or violations of predefined constraints. For instance, in security and environmental monitoring applications, a monitoring infrastructure is imperative to apply an efficient mechanism to derive alerts when specific criteria are satisfied [1,8,10,12,23]. We can identify two main orientations in terms of data acquisition, transfer and contextual reasoning: -Orientation 1: Centralized Context Reasoning. Nodes transfer their measurements to a concentrator, e.g., a sink node, back-end-system, Cloud center, which the latter processes data and possesses the intelligence to infer events, and -Orientation 2: Collaborative Context Reasoning. Nodes locally process data, locally infer knowledge, and have the intelligence for event reasoning in a collaborative manner.
In this paper, we elaborate on the second orientation through a collaborative, intelligent, and adaptive model for local data processing and event reasoning. This federated reasoning among nodes involves three perspectives of the captured information: (i) predicted context, (ii) contextual inference of outliers, and (iii) context fusion based on expert knowledge. These different Context Perspectives (CPs) are aggregated into a Type-2 Fuzzy Sets inference engine, which locally concludes on an event. Then, through a proposed knowledge-centric nodes clustering scheme in a federated way, nodes disseminate only pieces of inferred knowledge among them to unanimously reason about an event based on their local view. In turn, representative nodes of such collaborating clusters locally reason about context and then report the aggregated inference to the concentrators. The concentrators form a contextual event map and apply strategies to handle the inferred events, e.g., warn/trigger flood first responders. The key excellence is that our model combines local context processing & inference to the network edge with knowledge-centric nodes clustering. The challenge is to collaboratively process & infer events by minimizing the false alarms / erroneous inference that affect decision making, i.e., unsuitable decisions of handling hazardous phenomena.

Related work
Event processing & inference is adopted to support the development of IoT applications [8]. From the sensing and processing perspective, normally in the literature, the sensing devices monitor a specific area and deliver the captured data to a back-end system for processing, event inference, and alerts/decision making [10,24]. Analysis on architectural solutions and case studies on event inference mechanisms is discussed in [3]. The back-end system in [18] adopts aggregate methodologies for event inference, while in [17] it supports IoT applications for air quality monitoring in indoors environments. Such system collects contextual data from temperature, humidity, light, and air quality sensors and then centrally infer events. Moreover, the centralized context reasoning systems in [6,12], and [23] provide early inference of forest fire events based on visionenabled sensors, home monitoring based on the received signal strength of sensors, and surveillance of critical areas, respectively. In wireless sensors network deployments, e.g., [2,5,11,16], the back-end systems centrally provide event inference for specific areas by minimizing false alerts.
From the quality of inference perspective, event inference utilizing the principles of approximate reasoning like Fuzzy Logic (FL) is proved a useful technique for delivering high quality of inference. The model in [7] predicts the peak particle velocity of ground vibration levels. Such model adopts a FL-based inference scheme and utilizes the parameters of distance from blast face to the vibration monitoring point. The FL-based context reasoning model in [22] estimates the radiation levels in the air. The adoption of FL aims to handle missing values and, thus, deriving a mechanism capable of delivering alerts. The FL-based fusion model in [35] reduces uncertainty and false-positives within the process of fault detection. In [4], a specific FL-based inference system is proposed for ambient intelligence environments. Such system learns the users' behavior in light of being adapted to the users' profiles. In [13,14], the authors propose a centralized reasoning system that derives immediate identification of events based only on univariate data. Such system adopts data fusion and prediction for efficiently aggregating sensors measurements. Then, the system adopts FL for handling the uncertainty on the event reasoning.
In all the aforementioned efforts, the edge devices transfer their data to a back-end system, where the latter based on certain computing and reasoning paradigms, e.g., data aggregation, FL-based reasoning, infers events and provides alerts/warnings to IoT applications. The clear major difference of our collaborative machine learning mechanism compared to the aforementioned efforts is the localized event processing & inference at the network edge instead of a centralized reasoning approach. In all research efforts, the back-end system centrally undertakes the responsibility of event reasoning [15] and alerts generation once all contextual data are delivered throughout the network [19].
Our federated reasoning approach drastically departs from the centralized predictive intelligence paradigm to a fully distributed intelligence perspective. Our challenge is to push the intelligence for event processing & inference to the edge nodes equipped with computing and sensing capabilities provide partial awareness on an event. By enhancing this local event inference with different CPs, our mechanism (i) avoids raw data transfer from IoT nodes to a back-end system, (ii) favors of conveying the minimal inferred knowledge from the edge to concentrators by introducing a knowledge-centric nodes clustering, (iii) minimizes the false alarm rate by introducing advanced approximate inference over the CPs, and (iv) reduces the communication overhead induced by transferring humongous data volumes from sensors to concentrators through localized inference. In our orientation, the edge nodes do not share and/or relay contextual information. Instead, they conditionally transfer inferred prices of knowledge, if necessary, in light of high quality of inference. Furthermore, from the quality of inference perspective, our mechanism adopts Type-2 Fuzzy Sets over multivariate contextual data instead of univariate data Type-1 Fuzzy Sets as e.g., in [13], to cope with the induced uncertainty of event knowledge representation.

Research excellence & contribution
To the best of our knowledge, our collaborative machine learning mechanism is a first attempt to materialize the concept of federated reasoning by conveying predictive intelligence for real-time event inference to the edge of the network. This is achieved by exploiting at most the computing & sensing capabilities of IoT nodes based on different CPs. Our vision of intelligent edge computing is materialized by conditionally deliver inferred knowledge from the network edge with high quality of inference and not transferring data to the back-end-system. In combination with the proposed knowledge-centric clustering scheme, our novel mechanism is robust in terms of erroneous event inference (false alerts) and reduces the communication overhead between nodes and back-end system. The obtained outcome of this research is: (i) accurate event inference close to the source of the contextual information, (ii) significantly low communication overhead by localized belief-centric groupings, thus, avoiding data transfer to the back-end systems, and (iii) energy-efficient and robust inference in terms of imprecise and faulty data streams.
The major technical contributions of this research are: -A temporal nearest-neighbors exponential smoothing model for localized context prediction; -A conditionally growing adaptive vector quantization model for localized context outliers inference based on the Adaptive Resonance Theory; -A time-optimized stochastic novelty detection & adaptation model based on the Optimal Stopping Theory. We provide the theoretical analyses for the abovementioned statistical learning and optimization models; -A collaborative knowledge-centric nodes clustering scheme and a Type-2 FL-based event inference combining predicted and fused context with outliers identification; -Asymptotic time and space complexities of the proposed algorithms and collaborative methods and a comprehensive evaluation of the nodes energy consumption in terms of communication and computation/processing cost; -Performance and comparative assessment of our mechanism with: (i) the local voting scheme and (ii) the centralized aggregation-based event detection schemes achieving up to three orders of magnitude less energy consumption in an IoT environment.

Organization
The paper is organized as follows: Section 2 presents the rationale and overview of our federated reasoning approach. Sections 3 and 4 introduce the local context prediction and outliers detection, respectively. Section 5 proposes a novelty & adaptation mechanism, while Sections 6 and 7 introduce context fusion and Type-2 FL-based inference. Section 8 discusses on the collaborative knowledge-centric nodes clustering. Section 9 reports on the asymptotic time and space complexities of the proposed algorithms and methods and discusses the nodes energy consumption in terms of communication and computation/processing cost. Section 10 presents a comprehensive performance and comparative assessment with other event identification mechanisms. Section 11 concludes the paper.

Overview
We model the topology of an edge network of sensing and computation nodes (nodes) by an undirected communication graph as shown in Fig. 1  Let N i = {j ∈ N |{i, j } ∈ E} denote the set of neighbors of node i. Let also a set of concentrator nodes C = {1, . . . , c} that act as sink nodes for a specific subset of nodes in N . Concentrators gather (digested) context knowledge from certain nodes in order to provide to the IoT applications the corresponding reasoning results by those nodes on the presence of an event of interest. The concentrators could directly connect to a fixed Internet infrastructure, e.g., cloud platform for predictive analytics. The nodes monitor a specific area by sensing multiple contextual variables like ambient temperature, humidity, wind speed, and perform local reasoning to infer on an event of interest, e.g., a fire or flood event. We assume that nodes observe the same phenomenon. The degree of occurrence or degree of belief of an event, notated by μ i , is locally inferred by node i. This belief is disseminated by node i to its neighbors N i to further enhance the contextual knowledge of its neighborhood. This leads to a clustering of nodes according to their view, thus contributing to distributed event reasoning.
The nodes clustering is achieved by the election of a node, referred to as Cluster Head (CH), based only on the disseminated degrees of belief. Groups of nodes are formed each one involving a unique CH. Each CH aggregates its members' degrees of belief and communicates with its concentrator delivering an inference result. In this case, no centralized process is adopted for clustering and data aggregation on event identification. The CHs convey aggregated knowledge to concentrators, thus, minimizing the messages circulated in the network. Note, the messages exchanged among members and CHs are not raw data. Instead, they are pieces of inferred context represented by the degrees of belief as it will be elaborated later. The overall proposed architecture is shown in Fig. 1 (left).

Rationale
Our multi-perspective collaborative context reasoning model for each node builds on top of a local FL-based inference engine (Type-2 FL System; introduced later) that combines three perspectives of context: (i) current fused context, (ii) predicted context, and (iii) outliers context. This model locally derives the degree of belief μ i for node i each time a vector of contextual values is captured; hereinafter, referred to as context vector. A node i orchestrates the following reasoning processes to infer an event: -Context Fusion evaluates the event inference rule defined by experts from the current context vector. -Context Prediction utilizes the trend of historical context vectors experienced on node i for a short-term forecast of context. -(CP1) the current belief of an event by evaluating the expert's knowledge over x(t). -(CP2) how much x(t) is deviated from the predicted context given a short history, -(CP3) in what degree x(t) is considered as an outlier given the statistical distribution of patterns. Figure 1 (right) shows the all context processing and reasoning processes for node i: from context sensing to local event inference.
Concerning CP1, our model evaluates the belief of event from the current context. Since CP1 constitutes a rule-based baseline solution for event inference, we move a step further to incorporate knowledge from CP2 and CP3. As we show in our evaluation, the fusion of these CPs results to more sophisticated event reasoning.
Concerning CP2, node i stores the most recent m vectors x(t − m), x(t − m + 1), . . . , x(t − 1). Based on this history, node i predicts the context vector at time t,x(t) with respect to the conditional expectation conditioned on the recent observed history, i.e., Node i then captures the actual context x(t) and the prediction error is e(t) = x(t) −x(t) , where · denotes the Euclidean norm. The rationale in CP2 is that the prediction error gives an insight of how the actual vector is deviated from the expected vector based on a short-term history experienced on node i. If the current context deviates from the expected context then this instantaneously indicates that the observed recent normal state changes. However, we should take into consideration the statistical patterns from the entire history of context vectors to enhance our belief on event inference. Concerning CP3, node i incrementally estimates the probability distribution of context p(x). This unknown distribution is approximated by specific pattern vectors w k ∈ R d , k ∈ [K], 1 which represent the so-far observed vector space D ⊂ R d . The number K of those patterns is not necessarily fixed and is initially unknown. Each pattern w k is the representative of the (convex) vector subspace D k ⊂ D. The p(x) is approximated by patterns based on the probability p(x|w k ) of observing x being derived from subspace D k represented by w k . As it will be discussed, this probability depends on the distance between x and w k . The rationale in CP3 is that node i infers whether current x deviates significantly from the (so far) statistical patterns. In turn, node i assesses whether x lies outside or not the observed vector space utilizing the assignment probability p(w * |x) ∝ p(x|w * )p(w * ), with respect to its closest pattern w * , i.e., w * = arg min k∈ [K] x − w k . (2) As will be discussed, the assignment probability p(w * |x) quantifies the instantaneous belief that context is (i) 1 k ∈ [K] is a compact notation for k = 1, . . . , K adopted in the paper. either outlier, (ii) or novelty, thus, expanding our current knowledge, (iii) or a normal instance of the space D, thus, updating our current knowledge. Node i, to support such reasoning, is equipped with a time-optimized mechanism to incrementally update/adjust to possible novel vector subspaces identified, thus, augmenting its current knowledge. This augmentation is achieved by increasing the number of patterns to better reflect the new vector subspaces, thus, minimizing the risk of false consideration of outliers, which correspond to false alarms under event inference. Before proceeding with the three CPs, we provide some preliminaries on unsupervised statistical learning and optimal stopping theory adopted in our analysis.

Adaptive vector quantization
Adaptive Vector Quantization (AVQ) refers to an unsupervised learning (clustering) algorithm [31] that partitions a d-dimensional space R d into a fixed number of K subspaces. AVQ distributes K patterns w 1 , . . . , w K in R d . A pattern w k represents a subspace of R d . AVQ learns as w k changes in response to random vector x ∈ R d . Competition selects which w k the vector x modifies. The k-th pattern 'wins' if w k is the closest to x. During partition, vectors x are projected onto their closest patterns and patterns adaptively move around the space to form optimal partitions (subspaces of R d ) that minimize the Expected Quantization Error (EQE):

On-line machine learning & stochastic gradient descent
Stochastic Gradient Descent (SGD) [27] is widely adopted in on-line machine learning as an optimization method for incrementally minimizing an objective function J (a), where a ∈ A is a parameter from a parameter space A and a * ∈ A minimizes J . SGD leads to fast convergence to a * by adjusting the estimated a so far in the direction (negative gradient −∇J ), which improves the minimization of J . SGD gradually changes a upon reception of a new training sample. The standard gradient descent algorithm updates a as: a = −η∇ a E[J (a)], where the expectation is approximated by evaluating J and its gradient over all training pairs and η ∈ (0, 1). On the other hand, SGD simply does away with the expectation in the update of a and computes the gradient of J using only a single training sample at step t = 1, 2, . . .. The update of a t at step t is given by: In SGD, the learning rate {η t } ∈ (0, 1) is a step-size schedule, which defines a slowly decreasing sequence of scalars that satisfy: Choosing the proper learning schedule is not trivial; a practical method is the hyperbolic schedule: η t = 1 t+1 [27].

Optimal stopping theory
The Optimal Stopping Theory [28] (OST) deals with the problem of choosing the best time instance to take the decision of performing a certain action. This decision is based on sequentially observed random variables in order to maximize the expected reward. For random variables X 1 , X 2 , . . . and measurable functions Y t = ψ t (X 1 , X 2 , . . . , X t ), t = 1, 2, . . . and Y ∞ = ψ ∞ (X 1 , X 2 , . . .), the problem is to find a stopping time τ to maximize E[Y τ ]. The τ is a random variable with values in {1, 2, . . .} such that the event {τ = t} is in the Borel field (filtration) F t generated by X 1 , . . . , X t , i.e., the only available information we have obtained up to t: The decision to stop at t is a function of X 1 , . . . , X t and does not depend on future observables X t+1 , . . .. The problem is to find the optimal stopping time t * such that the supremum E[Y τ ] is attained: i.e., The (essential) supremum ess sup τ ≥t E[Y τ |F t ]} is taken over all stopping times τ such that τ ≥ t. The optimal stopping time t * is obtained through the principle of optimality [30]. The theorem in [39] refers to the existence of the optimal stopping time.

Context prediction
The major concept of this CP is to interpret the deviation between the excepted context and the actual context on node i as a reliable indication of an event. Context prediction ( Fig. 1(right)) involves a multidimensional time-series vector forecast at node i to locally predict the upcoming context x(t + 1) given a sliding history window of m observed vectors x(t − m), . . . , x(t − 1) and the current context x(t).
We enhance the multivariate Holt-Winters Double Exponential Smoothing (DES) with a h-Nearest Neighbors smoothing (hNN) at time t, 1 ≤ h ≤ m. DES takes into account the possibility of a time series exhibiting some form of trend with an updated slope component. In our case, we attempt to capture the temporal correlation of the noisy contextual data by exploiting the values of the temporal data nearest neighbors. The proposed temporal smoothing functionality over DES encapsulates the correlation of values ahead of time, which aligns with our idea of event reasoning using instantaneous context deviation. This deviation should involve the trend and slope, already captured by DES, and the temporal correlation of consequent contextual values. By involving this temporal correlation between recent past and future values, we enhance event reasoning.
Our idea is to substitute each value x i with the average x i of the hNN backward and forward values, ∀i. That is given the values x i (k), k = t − h + 1, . . . , t − 1, the corresponding temporal hNN smoothed values x i (k) are: Once x i values are smoothed then the forecast of the i-th variable at time t, x i (t) is achieved using DES, ∀i. Evidently, when h = 1, then our approach is reduced to DES, i.e., without dealing with the temporal NN smoothing. In turn, we obtain: where x i (t) is the actual smoothed value from our hNN method at t as in (7), y i (t) and y i (t − 1) are the intercepts at time t and t − 1, respectively. The u i (t) and u i (t − 1) are the slopes (time series trends) at time t and t − 1, respectively. The δ and κ are smoothing constants in (0,1). The δ value is used to smooth the new actual and trend-adjusted previously smoothed intercept, while the κ value is used to smooth the trend. The smoothing constants determine the weight given to most recent past values and control the weight of smoothing. Values close to 1 give weight to more recent values and near to 0 distribute the weights to consider values from the more distant past within the window. We set δ = 0.7 and κ = 0.9 as in [32]. x(t) = y(t) + u(t) and e(t) where the factor d − 1 2 is a normalization factor over the Euclidean norm · to get a value in [0,1] given that x ∈ [0, 1] d , i.e., each x i value is scaled in [0,1].

Context outliers inference
This CP infers whether the current context is an outlier, which highly impacts the event reasoning ( Fig. 1 (right)). We study the case where outlier context deviates significantly from the up-to-now statistical patterns learned locally on a node. If this deviation occurs regularly, then our model considers the possibility of a novelty, thus, to adapting new knowledge; see Section 5.

Conditionally growing context vector quantization
Consider a node i, which captures context vectors x drawn from a space D. Based on those vectors, we identify the vector subspaces D k , k ∈ [K], estimate their patterns w k and their number K, where p(x) can be approximated. This is achieved by incrementally partitioning the space D = ∪ K k=1 D k . We study an incremental AVQ for partitioning D into K (unknown) subspaces D k . The quantization of D operates as a mechanism to project x to the closest pattern w k . Node i incrementally minimizes the EQE: We seek the best possible approximation of vectors x out of a set {w k } K k=1 of (finite) K patterns such that x is projected to its closest pattern w * ∈ D * ⊂ {x ∈ D : x − w * = min k x − w k }. We incrementally minimize J in (11) with the presence of a random x and update only the closest pattern w * . However, the number of subspaces (and, thus, patterns) K > 0 is completely unknown and not necessarily constant. The key problem is to decide on an appropriate K value to minimize (11).
In the literature a variety of AVQ methods exists which are not suitable for incremental implementation, because K must be supplied in advance. We propose a conditionally growing AVQ algorithm (i) in which the patterns are sequentially updated and (ii) is adaptively growing, i.e., increases K if a criterion holds true. Given that K is not available a-priori, our algorithm minimizes J with respect to a threshold ρ. Initially, the vector space has a unique (random) pattern, i.e., K = 1. Upon the presence of x, our algorithm (i) finds the closest pattern w * and (ii) updates w * only if the condition q − w * ≤ ρ holds true. Otherwise, x is currently considered as a new pattern, thus, increasing K by one. This conditional quantization leaves random vectors to self-determine the resolution of quantization. Evidently, high ρ would result to coarse space quantization while low ρ yields fine-grained quantization. The parameter ρ is associated with the stability-plasticity dilemma also known as vigilance in Adaptive Resonance Theory [29]. In our case, ρ represents a threshold of similarity between vectors and patterns, thus, guiding us in determining whether a new pattern should be formed. To give a physical meaning to ρ, we express it through a set of percentages a i ∈ (0, 1) of the value ranges of each x i . Then, ρ = [a 1 , . . . , a d ] and if we let a i = a, ∀i, then ρ = (ad) 1/2 . High a over high dimensional space results in a low number of patterns and vice versa. The outcome is a set of K patterns W = {w k } K k=1 . The incremental minimization in (11) given a series of x(t), t ∈ T, is achieved by SGD. Our algorithm processes successive x(t) until a termination criterion Γ (t) ≤ γ . Γ (t) refers to the distance between successive estimates of the patterns at steps t − 1 and t. The algorithm stops at the first t where: The update rules of patterns w k are provided in Theorem 2.

Theorem 2
Given context x and its closest pattern w * ∈ W, the patterns {w k } K k=1 converge to the optimal estimates if updated as: Proof For proof, see Appendix A.1.
A fundamental characteristic of our quantization algorithm is that each pattern w k ∈ W corresponds to the centroid E[x|x ∈ D k ] of those vectors x assigned to w k . This is utilized for estimating the probability of an outlier as discussed in Section 5.

Theorem 3 (Centroid Convergence) Ifx is the centroid of the vector subspace D k and pattern w k is the closest pattern
Our Algorithm 1 processes a (random) context vector one at a time. In the initialization phase, there is only one pattern w 1 , i.e., K = 1, which is the first vector. For the t-th context x(t) and onwards, t ≥ 2, the algorithm: (i) updates the closest pattern to x(t) (out of K patterns) given that the distance is less than ρ, otherwise (ii) a new pattern is added (increasing K by one). The algorithm stops updating the patterns at the first step t where Γ (t) ≤ γ . At that time and onwards, the algorithm returns the set of patterns W and no further modification is performed.

Outliers inference
We study how CP detects a change in the patterns space D k , ∀k, based only on {w k } from Section 4.1. Consider an incoming x to node i. The CP rationale lies in two components: First, decide whether x is an outlier with respect to the current quantization of D. Second, track overtime the number of such outliers and decide that subspaces have changed when this number becomes high.
Consider the probability assignment p(w k |x) of x to a pattern. Since we do not have any prior knowledge about p(w k |x), we apply the principle of maximum entropy: among all possible probability distributions, we choose the one that maximizes the entropy [34] given an optimal quantization of D. Specifically, p(w k |q) conforms to the Gibbs distribution: where β ≥ 0 will be explained later. Assuming that each w k has the same prior p(w k ) we obtain that: Note, p(w k |x) explicitly depends on the distance of context with patterns. By varying the parameter β, the probability assignment p(w k |x) can be completely fuzzy (β = 0, each vector belongs equally to all patterns) and crisp (β → ∞, each vector belongs to only one pattern, or precisely uniformly distributed over the set of equidistant closest patterns). As β → ∞ this probability becomes a delta function around the pattern closest to x. The probability p(w * |x) quantifies the belief that x is an outlier with w * being its closest pattern in the quantized space.

Context space change detection
The probability assignment given that x is assigned to w * . To decide whether x can be properly represented by w * , we associate w * with a dynamic vigilance ρ * > 0, which depends on the distance of the assigned x to w * . This vigilance is a normalized distance ratio of x − w * 2 out of the average distances of all context vectors x , = 1, . . . , L, that were assigned to w * : Based on this ratio, if ρ * is less than a threshold ρ > 0, x is properly represented by its closest pattern. Otherwise, x is deemed to be an outlier. A ρ value normally ranges between 2.5 and 5 [33]. Hence, for x, which is assigned to w * , we define as outlier indicator of x with respect to w * the random variable: Let us now move to keeping track of the outlier indicators I (x(1)), . . . , I (x(t)) overtime focusing on their closest pattern w * : w * = arg min k x(t) − w k , ∀t. To simplify the notation, we set I t = I (x(t)). A cumulative sum of I t 's with a high portion of 1's causes node i to consider that p(x|w * ) might have changed. Upon observation of x, node i observes for pattern w * the random variables {I 1 , . . . , I t }. Node i detects a change in p(x|w * ) based on the cumulative sum S t of the I 1 , I 2 , . . . , I t up to t-th assigned vector: I t is a discrete random process with independent and identically distributed (i.i.d.) samples. Each I t follows an unknown probability distribution depending on the distance of x to w * . I t has finite mean E[I t ] < ∞, t = 1, . . ., which Predictive intelligence to the edge through approximate collaborative context reasoning depends on x(t) − w * 2 and the expectation of an outlier indicator is: Our knowledge on that distribution, which is not trivial to estimate, will provide insight to judge whether p(x|w * ) has changed in the subspace determined by w * . We should 'follow' the trend of that change by either updating w * , to continuously represent its subspace or, create a new pattern in the novel vector subspace.
By observing I t and sum S t up to t, the challenge here is to decide how large the sum should get before deciding that p(x|w * ) has changed. Should we decide at an early stage that p(x|w * ) has changed, this might correspond to 'premature' decision; a relatively small number of 'outliers' might not correspond to change in p(x|w * ). Should we 'delay' our decision then we might get erroneous event inference (high false alarm rate), since we avoid adapting w * to 'follow' the trend of the vector subspace change.
The rationale for this CP has as follows: To decide when p(x|w * ) has changed we could wait for an unknown finite horizon t * in order to be more confident on a change. During the t * horizon, we only observe the cumulative sum S τ , τ = 1, . . . , t * . We propose a stochastic optimization algorithm that postpones a vector space change decision through additional observations of I τ . At time t * , a decision on a possible p(x|w * ) change has to be taken. The problem is to find the optimal stopping time t * in order to ensure that p(x|w * ) has changed from those x(t) assigned to w * at t > t * .
We define our confidence Y t of a decision on a change of p(x|w * ) based on the cumulative sum S t in (17). Y t is directly connected to the performance improvements that a timely decision yields. Y t is a random variable generated by the sum of I τ up to t, S t = t τ =1 I τ , discounted by a risk factor α ∈ (0, 1): Our algorithm has to find t * in order to (i) either start adapting w * after considering that p(x|w * ) has changed or (ii) create a new pattern, with respect to vigilance ρ (see Section 4) for those vectors arrive at t > t * . If we never start this adaptation, our confidence that we follow the new trend (patterns) is zero, Y ∞ = 0. This indicates that we do not 'follow' the trend of a possible change over the subspace and/or do not augment further our knowledge on possibly new vector subspaces. Furthermore, we will never start adapting w * at some t with S t = 0, since there is no piece of evidence of any outlier up to t. As I t assumes unity values for certain times then S t increases at a high rate, thus indicating a possible change due to a significant number of outliers. Our problem is to decide how large the S t should get before we start adapting w * or augment our current knowledge on the underlying vector space distribution by adding extra patterns. We have to find a time t > 0 that maximizes our confidence, i.e., when the supremum is attained. The semantic of the risk factor α has as follows. High α indicates a conservative adaptation model; it requires additional observations for concluding on a change decision. This, however, comes at the expense of possible outliers prediction inaccuracies during this period, since the w * might not be a representative of its corresponding assigned vectors. Low α denotes a rather optimistic model, which reaches premature decisions on a p(x|w * ) change. This means that once we concluded on a change, we have to adapt w * by actually exploiting every incoming vector assigned to w * and/or considering x as a new pattern. This continues until the updated w * converges.
We propose a solution for the problem in (20). Firstly, we prove the existence of t * in our case, then report on the corresponding optimal stopping time, and finally elaborate on the optimality of the proposed solution. A decision taken at time t is: -either to assert that a change on p(x|w * ) holds true and, then, start the adaptation of w * or inserting x as a new pattern, -or continue the observation process at time t + 1 and, then, proceed with a decision.
Based only on S t = t τ =1 I τ we determine a stopping time that maximizes (20).
Theorem 4 An optimal stopping time t * for the problem in (20) exists.
Proof For proof, see Appendix A.3.
In our case, I t are non-negative, thus, the problem is monotone [28]. This means that t * , since it exists by Theorem 4, is obtained by the 1-stage look-ahead optimal rule (1-sla) [28]. That is, we should start adapting w * at the first For our monotone stopping problem with observations Theorem 4).

Theorem 5 The optimal stopping time t * for the problem in
Proof For proof, see Appendix A.4.

C. Anagnostopoulos, K. Kolomvatsos
To derive t * from Theorem 5 we need to estimate the expectation E[I ] = P ({I = 1}). Empirically, the probability P ({I = 1}) can be experimentally calculated by those assigned vectors whose ratio of the distances from their closest patterns out of the total variance of the distances is at least ρ; refer to (15). Moreover, we provide an estimate for P ({I = 1}) based on our quantization algorithm in Section 4. The probability of {I t = 1} refers to the conditional probability of x(t) being an outlier given that it is assigned to w * with p(w * |x(t)). The P ({I t = 1}) is, therefore, associated with the probability that the distance x(t) − w * 2 > θ, with scalar: If we define the vector z(t) = x(t) − w * then we seek the probability density distribution of its squared Euclidean norm z(t) 2 . Therefore, based on the centroid convergence in Theorem 3, w * refers to the centroid: under the assumption of normally distributed random components follows a non-central squared be the monotonic, log-concave Marcum Q-function, with parameters κ 1 , κ 2 , and κ 3 . Then, we obtain that CDF χ 2 by substitution in the Q function: For an analytical expression of (23), refer to Appendix A.7. Hence, the optimal stopping time is obtained once we substitute E[I ] in Theorem 5 by the P ({I = 1}) estimated in (23).

Context adaptation
Once node i has detected a change in at least one vector subspace then it initiates a process that adapts the patterns by modifying w * as follows. A change in a vector subspace indicates that new patterns can be formed or existing patterns should be updated. Node i for every incoming x appearing at t > t * updates either w * to follow the trend or create a new pattern w K+1 = x as described in Algorithm 1.
Algorithm 2 shows the change detection and adaptation process.

Expert knowledge context fusion
This CP evaluates the belief of an event based on experts' knowledge ( Fig. 1 (right)). Consider the context x at node i. Each variable x j , j = 1, . . . , d in x affects the event reasoning in a different way, as interpreted by human expert knowledge. For instance, consider the identification of a fire event. A fire event can be inferred based on temperature x 1 , humidity x 2 , and (ionization) smoke x 3 measurements, i.e., x = [x 1 , x 2 , x 3 ]. A human expert can express a fire event through an increment on temperature and smoke, with humidity remaining at relatively low levels. Let the row vector x P be constructed by variables from x that proportionally affect the presence of an event, i.e., the event is expressed by an increment on the values for those variables. Similarly, let the row vector x N be constructed by the variables from x that do not proportionally affect the presence of the event, i.e., the event is expressed by a decrease on the values of those variables. In this case, we obtain x = [x P ; x N ], where in our example we have that . This classification of the x j variables into the x P and x N vectors is provided directly by the human interpretation of an event. Based on this representation, we introduce a vector fusion function that produces a unified view on the event identification. We introduce the normalized 'state' v j ∈ [0, 1] of each x j from x P and x N : The state v j indicates whether x j ∈ x P (or ∈ x N ) has reached its maximum (or minimum) value and, thus, it partially expresses the existence of an event. Define Function f fuses the current context vector into a scalar indicating the presence of an event through the normalized states v j . The λ 1 , λ 2 ∈ R parameters are application specific. Through the adopted sigmoid function, we can either eliminate or pay more attention on the value of a given variable x j to the fusion result. For instance, we count a high impact of v j when its value is only above threshold λ 1 by setting λ 2 → 0 (tuning the steepness of the sigmoid function).

Fuzzy contextual knowledge base
Based on the CPs in Sections 3, 4, and 6, node i locally achieves event inference at time t by considering (i) the current context fusion f (v(t)) in (25), (ii) the current assignment probability p(w * |x(t)) in (14) w.r.t. to closest pattern w * , and (iii) the current deviation e(t) in (10) for x(t). We attempt to fuse these CPs through a finite set of Fuzzy Inference Rules (FIR). Each FIR reflects the degree of belief for a specific event inferred locally on node i. For instance, a FIR is: 'when the local sensed temperature is high then the degree of belief for a fire event might be also high'. We propose a T2FLS, which defines the fuzzy knowledge base of FIRs for node i. In this work, we do not rely on a Type-1 FLS (T1FLS) as such an inference model has specific drawbacks when applied in dynamic environments and, more interestingly, when the construction of the FIRs involves uncertainty due to partial knowledge in representing the output of the inference result [21]. In our case, this corresponds to the uncertainty of defining the occurrence of an event based only on the local available knowledge: current context, predicted context, and possible outliers. The limitation in a T1FLS is on handling uncertainty in representing knowledge through FIRs [9,21]. In a T1FLS, the experts define exactly the membership degree of the involved input and output variables in a FIR, e.g., the characterization of a value as 'high' or 'low'. However, when even the definition of a membership function involves uncertainty, the experts cannot be certain about the membership grade. In such cases, uncertainty is observed not only on the environment of the examined problem, e.g., we classify a value as 'high' or 'low' or the degree of belief as 'high', but also on the description of the term e.g., 'high', itself in a FIR.
In a T2FLS, the membership functions that characterize the terms of the three CPs are themselves 'fuzzy', which leads to the definition of FIRs incorporating such uncertainty [21]. This approach seems appropriate in our case as FIRs cannot explicitly reflect knowledge on whether incoming measurements correspond to the occurrence of an event. Our FIRs take into consideration the uncertainty in the definition of an event by the human expert enhanced with the CPs: deviation of predicted context and outliers inference. Such FIRs refer to a non-linear mapping F(f (v), p(w * |x), e) between the three CPs (inputs) and one output, i.e., the degree of belief μ i ∈ [0, 1]. The antecedent part of a FIR is a linguistic conjunction of the CPs and the consequent part is the degree of belief that event actually occurs. The structure for a FIR is as follows: where A 1k , A 2k , A 3k and B k are membership functions for the k-th FIR mapping the values of f (v), e, p(w * |x) and μ i into unity intervals, respectively, by characterizing these values through the linguistic terms: low, medium, and high. If a linguistic term, e.g., 'high', was represented through one fuzzy set in a T1FLS then we would use one membership function g(x) ∈ [0, 1] mapping the real value (input) x ∈ [0, 1] to a discrete set of pairs (x j , g(x j )), e.g., In a T2FLS, each term A 1k , A 2k , A 3k and B k in FIRs is represented by two membership functions corresponding to lower and upper bounds [20]. For instance, the term 'high', unlike in a T1FLS, whose membership for each x is a number g(x), is represented by two membership functions. That is, each value x is assigned to an interval [g L (x), g U (x)] corresponding to a lower and an upper membership function g L and g U , respectively. E.g., the membership of x = 0.25 is the interval [0.05, 0.2]. The interval areas [g L (x j ), g U (x j )] for each input x j reflect the uncertainty in defining the term, e.g., 'high', which is useful when it is difficult to determine the exact membership function for each term or in modeling the diverse opinions from different CPs in defining the occurrence of an event, in our case. If g L (x) = g U (x), ∀x, we obtain a FIR in a T1FLS. Following the above FIR structure, each A jk , j = 1, 2, 3, and B k , for each k-th FIR, corresponds to a set of intervals. The interested reader could also refer to [20] for fuzzy reasoning in T2FLS.

Determination of local degree of belief
A μ i value close to unity denotes the case where the belief is at high levels, i.e., there is a high belief that a hazardous phenomenon, like fire or flood, occurs in the area of interest based on the agreement of the three CPs (all of them assume values close to unity). The opposite stands when μ i tends to zero. We consider three fuzzy linguistic terms for the FIRs: Low, Medium, and High. Low represents that a variable (input or output) takes values close to 0, while High depicts the case where a variable takes values close to 1. Medium depicts the case where the variable takes values around 0.5. For instance, a Low fuzzy value for e indicates that the current and predicted context are close enough, thus, current context follows the trend of its recent historical context. A High fuzzy value for p(w * |x) denotes that the current context does not significantly deviate from its regular statistical pattern. A High fuzzy value for f (v) indicates a positive inference on the presence of an event as represented by an expert's knowledge. For each fuzzy term, human experts define the upper and the lower membership functions. Here, we consider triangular membership functions g L and g U as they are widely adopted in the literature. Our T2FLS is generic, thus, any type of membership functions can be adopted to better suit to the application domain. Table 1 shows the proposed fuzzy knowledge based for event inference. 2 Upon receiving the current context x(t), node i produces its corresponding (i) fused context f (v), (ii) deviation e(t) and (iii) assignment probability p(w * |x). Then, the T2FLS is activated as follows: (Step 1) calculation of the interval (based on the membership functions) for each input; (Step 2) calculation of the active interval of each FIR; (Step 3) performance of 'type reduction' to combine the active interval of each FIR and the corresponding consequent.
Step 3 produces the interval of the consequent, and accordingly, the defuzzification phase 3 determines a scalar value for the local degree of belief μ i at time t. The most common method for 'type reduction' is the center of sets type reducer [21], which generates a Type-1 Fuzzy Set as output, which is then converted in a scalar value for the μ i after defuzzification. When the μ i is over a pre-defined belief threshold ∈ [0, 1], the T2FLS engine infers locally an event occurrence with degree of belief μ i .

Belief-centric clustering
In our federated reasoning approach, groups of nodes are formated based on their local degrees of belief μ i , i ∈ N . The clustering process is repeated at a clustering era T 1 , T 2 , T 3 , . . ., T n ∈ T. The T n is a variable time index in T, which is triggered by node i which locally believes in an event presence in the first instance (i.e., μ i ≥ ), thus, asking for the opinions of its local neighbors before reaching a conclusion. In each group, a node is elected as the Cluster Head (CH) and is responsible to exchange the aggregated degrees of belief (discussed later) with a concentrator from set C after a belief revision/update of the initial opinion on an event presence. Hence, the number of messages circulated in the network is reduced as it is not necessary for each node to relay messages to a concentrator. The election process concerns a node i to become a CH if it experiences the highest μ i related to an observed phenomenon among its neighbors N i . The aim of the CH is to notify its members about its appointment as a CH, thus, avoiding redundant message dissemination. The CH node, after its appointment, aggregates the degrees of belief of its neighbors resulting to an enhanced neighborhood contextual knowledge by unanimously inferring a possible event.
The primary objectives of the federated election process are: -(i) Appointment of a subset of nodes as CHs responsible for determining and disseminating an unanimous (aggregated) degree of belief to the concentrators. -(ii) Dynamically changing the CH appointment to nodes. Evidently, this prolongs the network lifetime by changing CH appointments and, thus, balancing energy consumption for the event inference process and transmission of message to the members and concentrators. -(iii) Termination of the election process within a constant number of iterations (exchanged messages).
It should be noted that the description of the CH replacement process (i.e., objective (ii)) is beyond the scope of this paper. It is also worth noting that we do not make any assumption about the spatial distribution of IoT nodes in the area. Every node can act as either CH or member. This requires the need for an efficient CH election algorithm.

Belief-centric cluster-head election
A baseline solution for the election process involves nodes exchanging their μ i to all neighbors. The node with the highest μ i is elected to become the CH of the neighborhood. However, this solution requires a significant number of messages exchanged among nodes. Moreover, since the election process is re-initiated after a time interval T , then a high energy budget is required for that type of communication. There are certain election algorithms which could be adopted. In our case, neighboring nodes exchange their μ i values and then 'elect' the CH. To this end, we follow the concept of the CH election algorithm in [37] by modifying the election criteria to reflect the knowledge exchange over a neighborhood. At each node, the election process requires a number of iterations L > 0. In every iteration, nodes send and receive specific small-sized messages from neighbors containing their degrees of belief. Before node i starts the election process, it configures a local probability of becoming a CH ξ i , hereinafter referred to as Election Probability (EP), as a function of μ i , i.e., ξ i = max (ξ min , μ i ), where ξ min is a minimum EP for each node: ξ i is not allowed to fall below the ξ min , e.g., 10 −3 . This restriction is essential for terminating the election process in L = O(1) iterations; see Lemma 1. Node i with a high EP ξ i starts the following process: it sends announcement messages of the form ξ i , i to the N i neighbors to be a CH. A node j with a low EP ξ j delays the transmission of announcement messages and considers itself 'non-CH' if it has heard from ξ i , i with ξ i > ξ j . During iteration , 1 ≤ ≤ L, every node i decides to become a CH with EP ξ i . Through the process, node i can either be elected to become a CH according to its EP ξ i or remain at the same status (i.e., non-CH) according to overheard announcement messages within its neighborhood N i . A node j selects its CH node i to be the node with the highest μ i ; this is achieved by the comparison of ξ i and ξ j . Every node i then multiplies its EP ξ i with a factor of χ > 1, and goes to the next step + 1 and so on, i.e., ξ i ( + 1) = min(χξ i ( ), 1).
If node i decides to become a CH since its EP ξ i has reached unity, it sends an announcement message 'CH i' to its neighbors N i . A node j ∈ N i , then, considers itself 'non-CH' if it has heard from node i a 'CH i' message and terminates the election process. Note, this election process is completely distributed. Node i either decides to become a CH since μ i is the highest among its neighbors, or be a member which awaits a message by its unique CH.

Lemma 1 The belief-centric election process requires O(1) iterations.
Proof For proof, see Appendix A.5.
The number of iterations for each node does not depend on the number of neighbors and is bounded by a constant. Indicatively, when ξ min = 10 −3 and χ = e then a node needs at most eight iterations to elect or be elected as a CH.

Lemma 2 The message exchange complexity in the beliefcentric election process is O(1) per node and O(|N |) for the network.
Proof For proof, see Appendix A.6.

Aggregated degree of belief & federate event reasoning
Once node i is appointed as a CH, it locally determines the average degree of belief of its members j ∈ N i : Theμ i reflects a degree of consensus of the neighborhood on event inference. CH i, based on the pair (μ i ,μ i ), determines an aggregated degree of beliefμ i . We adopt a reward-idle methodology to reason on the aggregated degree of beliefμ i , which will be delivered by CH i to its C. Anagnostopoulos, K. Kolomvatsos concentrator. If CH i and its neighbors unanimously agree on the presence of an event, i.e., if the logical expression: holds true then we reward CH i's belief on the event by sending to the concentratorμ i = μ i . When CH i and its neighbors unanimously agree on the absence of an event, i.e., if it holds true that: thenμ i is the average value of all degrees of belief: and the CH i does not notify the concentrator. If there is a disagreement between CH i and its neighborhood, i.e., if it holds true that: then CH i notifies its concentrator after regulating its local opinion by a factor of r ∈ (0, 1) towards the neighbors' average belief, i.e., The concentrator then acquires knowledge for a specific region of the area of interest about the appearance of an event and to what extend this local inference from nodes {i, N i } is of high belief by receivingμ i . Note, since μ i ≥ max j ∈N i {μ j }, there will be never the case: (μ i < )∧(μ i > ).

Computational complexity, energy & communication cost
In this section we present the time and space computational complexities for both Algorithms 1 and 2 and the energy and communication cost of the processes for each node i: (i) event inference (local derivation of degree of belief μ i , (ii) election and clustering era (appointment of cluster-heads CH and cluster members), (iii) derivation of aggregated degree of belief from the CHsμ i , and (iv) report to the concentrators from CHs.

Computational complexity
We report on the time and space complexities of the processes that are needed for each node i to locally infer the degree of belief. such processes include: (i) context vector quantization for patterns derivation, (ii) change detection of the quantized data subspace, (iii) context adaptation, and (iv) degree of belief inference including context prediction and fuzzy inference.

Time & space complexity for context vector quantization
The Algorithm 1 is an incremental partitioning algorithm which updates its closest current pattern w * based on the incoming context vector x(t) at time instance t. The closest pattern update stops when the algorithm has converged with respect to a convergence threshold γ . That is, the patterns' updates are stopped at the first time instance (vector observation) t such that: During the training phase, at every observation x(t) at time instance t, the algorithm finds the closest pattern w * to the context vector x(t). This requires O(dK) time per observation for searching for the closest pattern out of the current K patterns {w k } K k=1 . The whole training process requires O(dKt ) time. After convergence, i.e., at time instance t > t the structures of the patterns are used for outliers detection and, in certain cases, for adaptation based on the optimal stopping time methodology in Section 5. In this phase, the calculation of the probability p(w * |x) requires O(d log K) given a k-d tree structure for searching the closest pattern. The space complexity of Algorithm 1 refers to the storage of the K d-dimensional patterns w k , which is O(dK).

Time & space complexity for change detection and adaptation
The Algorithm 2 is an incremental algorithm, which processes the sensed context vector x(t) at time instance t to determine whether there is a context change detection after several observations. Algorithm 2 requires a pre-calculation of the K scalars θ k , k ∈ [K] using (22). Those scalars derive from the variances of the data subspaces represented by the patterns w k from Algorithm 1. This requires O(dK) time for the K variances θ k . The Algorithm 2, at every time instance t, calculates the outlier indicator I t using (16), which requires O(1), given the closest pattern w * , which requires O(d log K). When the optimal stopping criterion holds true, which is determined in O(1) time, then the closest pattern w * is either updated or a new pattern is inserted in the pattern set W in O(1). In the adaptation, the dynamic vigilance ρ * is updated in O(dK). The space complexity of Algorithm 2 refers to the storage of the K scalars (variances) θ k and the K dynamic vigilances ρ * k , which is O(K). Table 2 shows the asymptotic time and space complexities for the Algorithms 1 and 2.

Time & space complexity for degree of belief inference
Each node i upon sensing a d-dimensional context vector x(t) at time instance t performs event inference to derive Predictive intelligence to the edge through approximate collaborative context reasoning  Table 2, a node i requires O(d(m + log K) + R) to provide the degree of belief μ i on a event including any possible adaptation. It is worth noting that, node i requires O(d(K + m)) space to store the patterns and the most recent context vectors. Given the belief-centric clustering, a node i after local inference can initiate a clustering era for determining the aggregated degree of belief. In each clustering era, every node i requires O(1) messages to either be appointed as a CH or not (member of the cluster); see Lemmas 1 & 2. For a CH node, the calculation of the aggregated degree of belief depends on the cardinality of its neighborhood, i.e., number of cluster members, which requires O(|N |) using (30). Every CH node then transmits to its concentrator the aggregated degree of belief requiring O(1) message (network communication). Table 3 summarizes the overall asymptotic complexities per node for the engaged processes: event inference, belief-centric election and report of the aggregated degree of belief to the concentrator.

Communication cost & computation energy consumption
The nodes must accomplish their assigned sensing and inference tasks by using the limited energy resources carried by them. The energy refers to a number of operations: (i) wireless communication, (ii) sensing the environment, and (iii) local computation. In our study, the energy and communication model reflects three facets: energy for communication required for the belief-centric clustering process, energy for computation, i.e., event inference and degree of belief derivation, and communication energy of the clusterheads to report the aggregated degree of belief to their assigned concentrators.
Each node i consumes processing power for locally inferring the degree of belief μ i of a possible event as described in Section 7.2. We notate with E μ i the energy cost in Joule per CPU instructions corresponding to the executable inference algorithm for local degree of belief per node i. Moreover, when nodes initiate a clustering era, then some nodes are appointed as cluster-heads computing their EP values. During a clustering era, a node is either dynamically appointed as a CH or acting as a member. In a clustering era, the energy for in-cluster communication E c,i in Joules per bit transmission (TX) and reception (RX) is the energy consumption incurred on node i by transmitting (TX) and receiving (RX) election messages. After the election, each CH node has to calculate its neighbors' aggregate beliefμ i with energy cost Eμ i in Joule per CPU instructions and then transmit (TX) this value to its assigned concentrator, thus, incurring an additional communication cost E CH,i .
Let the CH indicator J i = 1 if node i is appointed as a CH after clustering era; otherwise J i = 0 when node i is a cluster member. Then, we define the total cumulative energy consumption C i per node i as the cumulative computation consumption for event inference and/or aggregated degree of belief, and communication consumption for clustering and transmitting the aggregated degree of belief to the concentrators (in the case of CHs only) up to time instance t, that is: where and where E 0 is the energy cost for node i transiting from idle to standby operational modes [36]. Up to time instance t, the communication and computation costs for all nodes and the overall cost are, respectively: For the sensing, communication and computation energy consumption, we adopted the energy model from the Mica2 sensor board. 4 This energy model assumes an energy of two AA batteries that approximately supply 2200 mAh with effective average voltage 3V. It consumes 20mA if running a sensing application continuously. The communication cost for transmitting (TX) a bit is 720 nJ/bit and receiving (RX) a bit is 110 nJ/bit. Moreover, the packet header of the communication protocol adopted by Mica2 is 9 bytes (MAC header and CRC) and the maximum payload is 29 bytes. Therefore, the per-packet overhead equals to 23.7% (lowest value). For each transmitted data value, i.e., a value component x of a d-dimensional vector x and the EP value in an election message, the assumed payload is set to 4 bytes (floating point number) and 2 bytes, respectively. Finally, the energy cost for single CPU instructions (energy per instruction) is 4 nJ/instruction in Mica2. Table 4 shows all the energy consumption in nJ per bit, for communication, and in nJ per CPU instruction, for computation.

Performance metrics
We assess the performance of our mechanism in terms of: (i) probability of false (erroneous) event inference φ ∈ [0, 1], (ii) event time index τ ∈ T of recognizing an event, (iii) communication overhead (number of aggregated degree of belief messages) M required for CHs to inform the concentrators for event inference, (iv) energy consumption for event inference C p and communication cost C c per node i and the total IoT environment, and (v) efficiency of our mechanism in delivering event inference with a low false rate being communication and energy aware.
The false probability φ represents the rate of erroneous inference (false alerts) that the mechanism generates defined as the ratio of the number of false alerts out of a total number of inference results. Note, event inference is obtained at every time t ∈ T corresponding to the reception of context vector x at any node i. A value of φ → 1 4 http://www.tinyos.net/scoop/special/hardware#mica2platform indicates high rate of false alerts, thus, no conclusion can be drawn for the true state of the phenomenon.
The event time index τ ∈ T refers to the time index of the measurement that actually corresponds to an event. Through that metric, we assess how 'close' to the real case an event is inferred by our mechanism; not at early stages in order to avoid false alerts and not many stages after the real event. The τ is evaluated by the rate of the identification for real events.
The number of messages M refers to the total number of messages (μ values) sent from CHs to their concentrators including the total number of messages sent for the beliefcentric clustering. The lower the M is, the lower energy resources in terms of communication are spent. Let us notate the lifetime of the entire network as T (in terms of energy) and N CH be the set of CHs, i.e., |N CH | |N |. Since at each clustering era T 1 , T 2 , . . ., our mechanism assigns certain nodes as CHs then, in the network lifetime, T T clustering eras are realized, where T is the expected number of clustering initiations out of the total number of observations. By adopting our belief-centric clustering, only N CH messages ofμ values are delivered to the concentrators to keep the concentrators up-to-date about the event inference along with O(|N |) messages circulated locally for building the clusters as proved in Lemma 2. Hence, it holds true that: Without clustering, all nodes would send their μs to the concentrators, thus, in this case we would obtain M = T T |N |. The energy consumption C p refers to the energy consumed for computational processing per node i to locally infer the degree of belief after observing a d-dimensional context vector. The energy consumption C c refers to the Predictive intelligence to the edge through approximate collaborative context reasoning communication overhead cost for nodes during the clustering eras due to messages exchange for CH election. These messages include the EP values. Moreover, this cost includes the energy consumption for the appointed CHs to transit the aggregated degrees of belief (from their neighborhood) to their concentrators. The energy model for computation and communication derives from the Mica2 energy model presented in Section 9.2. Finally, we define as efficiency the total amount of energy consumed from our mechanism C = C p + C c to deliver event inference with a low false rate φ. We desire to obtain a low energy expenditure along with a low false rate. We compare our mechanism with other mechanisms in terms of energy consumption (communication and computation) and efficiency, as shown in Section 10.5.

Experiment setup
We experiment with a real multivariate dataset [38] adopted from the Microsoft research open datasets. 5 The dataset contains meteorological data retrieved in the cities of Beijing and Shanghai. The collected context variables are: temperature, humidity, barometer pressure and wind strength. In our experiments, we adopt 2-dim. context vectors with x 1 = 'temperature' and x 2 = 'humidity' recorded by |N | = 50 nodes deployed in the field and observe 50,000 context vectors. We consider one observation at each discrete time instance t ∈ T and assume one concentrator acting also as the back-end system for those nodes. All vectors are scaled, i.e., x ∈ [0, 1] × [0, 1].
In the dataset, no hazardous events are identified, i.e., the probability of a true event is zero. To define an event, we exploit the expert knowledge in [25] stating that: a high temperature, e.g., around 600 Celsius, along with a low humidity, e.g., below 30%, defines a fire incident. Firstly, we consider injecting 'faulty' values to examine whether our mechanism produces erroneous inference/false alerts. Our target is to obtain φ → 0. To simulate a setting where nodes deliver faults/outliers, we randomly inject faulty measurements as indicated by the 'faulty rules' in [26] with some fault probability p F > 0. On a node i, an actual temperature value x 1 at time t will be replaced as x 1 ← (1 + a F )x 1 and for humidity x 2 ← a F 1+a F x 2 , with a F ∈ {2, 3, 5} and assume different faulty probabilities p F ∈ {5%, 10%, 20%, 40%, 60%, 80%}. In addition, we inject a set of fire events represented by a state temperature value v 1 close to 1 and a state humidity value v 2 close to zero as depicted in [25]. Note, we increase the temperature value and decrease the humidity value corresponding to the same context vector. The event time index τ k of a predefined fire event E k is pre-recorded. We define 10 fire events randomly spanned in the dataset where: the time duration of an event is drawn from the Exponential distribution with average time event-duration 10 time units. Through this setup, we examine whether our mechanism is capable of (i) inferring the events E k given fault probability p F and (ii) producing a time index of E k as close to τ k as possible, i.e., if the proposed mechanism identifies E k at the right time.
The parameter values are presented in Table 5 and, specifically the default values are: belief threshold = 0.7, convergence threshold γ = 0.001, context history m = 10, h = 5 in hNN DES, vigilance percentage a = 0.1 and vigilance threshold is ρ = (ad) 1/2 = 0.44 for 2-dim. context, initial learning rate η = 0.5, assignment probability factor β = 0.1, risk factor α = 0.95, opinion factor r = 0.5, and the number of FIRs is R = 27. The justification of those values is discussed in the remainder.

Comparison models
We compare our mechanism, hereinafter referred to as Model (M), with the local Voting Scheme (VS) and the centralized Aggregation Scheme (AS).
In the local VS model, a node i locally infers an event at time t based only on the expert knowledge fusion function, i.e., when it holds true that: thus, neglecting all other CPs to reason about the final decision. Then, each node i transmits only its inference result (event vote) to a central node gathers, which centrally infers an event based on the majority of votes.
In the centralized AS model, each node i transmits its current context data vector x i (t) to the central node. The central node, then, aggregates all the received context data  vectors from the |N | nodes and centrally infers an event based on: where v i (t) is the state context vector corresponding to node i's context and g{·} is the average operator.

Quality of event inference
We analyze the event inference performance of model M for different values of nodes |N |, faulty probabilities p F , fusion parameter λ 2 , belief threshold , and vigilance ρ. In Table 6, we examine the robustness of model M in terms of false rate φ for different values of faulty probability p F , 5% ≤ p F ≤ 80%, and number of nodes |N |. Model M is robust assuming a very low φ (less that 1.5%) even for data streams involving a huge number of faulty values i.e., p F = 80%. This indicates the capability of model M to reason under uncertainty as treated by the involvement of the three CPs. Moreover, the knowledge fusion of the local degrees of belief depends on the number of opinions, i.e., the number of nodes involved in the event reasoning. The higher the |N | is, the lower the φ becomes. The reason is that each node i locally process context and infers an even w.r.t. the three CPs and shares its local view/degree of belief with its neighbors through our CH-based consensus approach. Then, by voting among those aggregated degrees of beliefμ, which actually related to an event (sent only by CHs), the back-end system clearly concludes on that event with high accuracy. Model M takes into consideration the groups' perspectives, i.e., an event is locally agreed on a CH only when a large percentage of neighboring nodes support that event presence. When |N | increases, the team is more 'compact' meaning that much more nodes support an event presence with more certainty in contrast to the case where |N | is small. In the case where only one node i is present, model M is based on node i's belief, thus, false alerts could arise more easily (as node i could have a faulty view on an event presence).
In addition, we examine the impact of |N | on the time lag τ from the actual event time index and the identified/inferred time index. Model M obtains an average time lag τ = 2.2 time units with standard deviation σ τ = 0.77 for 5 ≤ |N | ≤ 50. This indicates that all events are identified in very near real-time. Table 7 presents the effect of the expert knowledge fusion (CP1) on producing false alerts. Recall that expert knowledge fusion depends on parameters λ 1 and λ 2 that affect the result for f (v). From these two parameters, λ 1 'defines' the threshold value of the fusion function as provided by the expert, while λ 2 defines the steepness of the function. We experiment with the steepness λ 2 ∈ {2.0, 4.0, 6.0} for fixed threshold λ 1 . We observe in Table 7 that a high λ 2 results to a high false rate φ, while when λ 2 = 2.0, false rate φ is limited (equal or very close to 0). A high λ 2 leads to a more 'relaxed' identification of the event. However, this leads to an increased number of false alerts by overestimating the CP1 f (v) at the expense of the other two CPs (error e and assignment probability p(w * |x)), which is passed to the T2FLS engine. A low λ 2 value regulates the impact of CP1 on the other two CPs, thus, model M exploits all CPs to avoid high rate of erroneous inference results.
We also examine the impact of the belief threshold on model M in terms of false rate φ. Table 8 shows the results   Table 8 we observe results for = 0.9. In this case, φ is minimized especially when |N | > 5. A high makes event inference more insensitive and difficult to discriminate, thus, a limited number of nodes agree on an event presence. This behavior has obvious consequences on the identification of real events as τ is getting high. We set = 0.7 in our experiments as explained later. In addition, we experiment with the average number of context patterns K per node that are required to quantize the vector data space to materialize the CP2 and CP3. Table 9 shows the number of K patterns (mean value avg(K) and standard deviation σ K out of |N | = 50 nodes) that quantize the context spaces needed for outliers and novelty detection against the vigilance percentage a, i.e., ρ = (ad) 1/2 . A low a value, which corresponds to low ρ, results in high quantization resolution in terms of patterns; a high number of patterns are generated to better represent the vector space. This, however, comes at the expense of a high number of patterns that are needed to be stored on a node. But, even in the case of a = 0.1, this number is significantly low (K ∼ 49). Hence, to achieve highly accurate inference results and maintain the model M up-to-date w.r.t. novelty vector subspaces, we set a = 0.1.

Communication & computation cost
In terms of communication overhead (number of messages circulated in the IoT environment), we examine the capability of model M to achieve low false rates by avoiding transferring context data to the concentrator, but only the minimal sufficient knowledge for event reasoning in terms of belief threshold . Table 9 shows the impact of belief threshold on: (i) the number of messages M, (ii) the average number of CHs |N CH | per clustering era, and (iii) the number of clustering eras T . A value of close to the cutoff value of 0.5 results to many clustering eras (T > 1000), thus, many messages are sent between clusters and from CHs to concentrators along with high φ value (see Table 8). Evidently, a value > 0.5 is adopted to 'narrowing' and clarifying the inference results. On the other hand, with a high , model M increases its tolerance to assess an event presence thus being communication efficient. However, in this case, events are difficult to identify, which does not reflect the actual situation on the IoT network. To balance between communication load, accuracy of inference, and capability of event identification, we set a belief threshold = 0.7 in our experiments. For = 0.7, model M initiates T = 10 clustering eras in which 9% of nodes (CHs) transfer their aggregated knowledge to concentrators achieving a low φ value. In all these 10 clustering eras, the nodes successfully detect all 10 events.
In terms of energy consumption due to the computational cost of local inference and communication cost for each clustering era, we present in Fig. 2 (left) the total cost C M in nJoule and its breakdown in the processing   (34) and (37), respectively, for |N = 50| nodes, after observing 10,000 context vectors each. We can observe that the computational consumed energy is the lowest energy expenditure compared to the communication cost, which indicates the advantage of the distributed inference, thus reducing the network overhead. This is attributed to the fact that the energy required for TX and RX pieces of data is higher than the energy consumed for local computations in each node. Moreover, our proposed Algorithms 1 and 2 are on-line, incremental learning algorithms, thus providing a lightweight solution for localized context inference, which is our major goal: 'to push on the intelligence to the edge of the network', reducing unnecessary data transfer to the back-end system and/or to the concentrators. Since each node i can locally reason about contextual event then, instead of transmitting actual sensed contextual multidimensional data towards a centralized system (as it will be discussed later in the comparative assessment Section 10.5), it transmits only if needed the local inference results, i..e, the degree of belief. Moreover, the proposed clustering scheme involves localized message exchange among neighboring nodes, which further reduces the network overhead by avoiding transiting data values from the edge of the network to the concentrators. Even in this localized information dissemination process, the nodes are transmitting only inferred knowledge, i.e., local degrees of belief and not data values. The appointed CHs are the only responsible ones to transmit the aggregated degrees of belief to the concentrators, where they corresponds to the 9% of the total number of nodes in the network. By pushing this intelligence to the edge, the computational cost consists of 30% of the total consumed energy, while the remaining portion is devoted to localized communication during the clustering eras plus the communication of the CHs with the concentrators, as shown in Fig. 2 (right). Figure 3 (left) shows the total consumed energy C M for different number of nodes |N |, while Fig. 3 (right) illustrates the impact of the vector quantization (vigilance percentage a) in each node i on the processing/computation cost C p out of the total cost C M for different number of nodes. It is worth mentioning that when we increase the resolution of the vector quantization, i.e., the number of patterns K that can be estimated during the vector quantization process (Algorithm 1) then the node i spends more energy for computation. This corresponds to identifying the closest pattern and to calculate the assignment probability. Obviously, the more patterns each node derives from the quantization process the higher the quality of inference, however, at the expense on the computational energy consumption. Nonetheless, the quality of inference is related with reducing the false rate φ. By achieving a significant low φ value, i.e., φ < 0.001, our mechanism requires a vigilance percentage a = 0.35. In this case, the processing ratio is approximately 30% of the total energy consumption. There is then a trade-off between quality of inference (due to high quality of vector quantization) and required energy for achieving this high quality. Our mechanism is flexible to tune this trade-off (as shown in Fig. 6) and attempts the lowest false rate by being energy efficient (in both: communication and computation) compared with the VS and AS models described in Section 10.5.

Comparative assessment
We compare model M with the models VS and AS, where their inference policies are provided in (38) and (39), respectively, focusing on: (i) quality of event inference, (ii) energy consumption in terms of computational cost and communication overhead, and (iii) efficiency.

Comparison in quality of inference
In the quality of inference we evaluate the false rate for each model given a probability of faulty data values to examine their robustness. In the comparison experiments, we take |N | ∈ {5, 10, 50}. Table 10 shows the false rate φ for |N | ∈ {5, 10, 50} and different p F values. We observe that model M outperforms VS and AS models when p F ≥ 40%. This is interesting as it shows that model M achieves a bounded erroneous inference probability even when nodes experience multiple faulty measurements. For p F = 80% indicating high uncertainty, model M achieves 80.00% and 82.76% fewer false alerts compared to VS and AS, respectively. We can also observe from Table 10

Comparison in energy consumption, cost & efficiency
Model M obtains significant low false rates with a significant low number of messages sent from CHs to the back-end systems, compared to the VS and AS models. Specifically, for model M there are T = 13 clustering eras out of the total 5 · 10 4 observations, and we obtain number of messages M = (2.32 · 10 6 , 24.6 · 10 4 , 3.42 · 10 3 ) for model AS, model VS and model M, respectively (we obtain, on average, |N CH | = 4.65 cluster heads per clustering era). This indicates that, for even uncertain and faulty data streams, i.e., p F > 40%, model M achieves 83.67% lower false rate from both AS and VS models by requiring three and one less orders of magnitude in communication overhead, respectively.
In terms of energy consumption in computation and communication, Fig. 4 (left) show the total cost C M , C V S , and C AS for model M, VS, and AS, respectively, for |N | = 50 in logarithmic scale. It is obvious that our model saves energy by at least two orders of magnitude compared to the localized inference model VS and the centralized inference model AS. This indicates the vision of pushing intelligence to the edge of the network with exploiting the computing capability of the nodes to infer events, thus avoiding data transfers from the source of information to the back-endsystems. Moreover, even in the case of the localized VS model, our model requires significantly less energy (two orders of magnitude) since the 'instant' inference achieved by a node executing the VS model appears to be in many   times erroneous compared to our model. We capture this by introducing intelligent context reasoning processes such that the CHs will only infer an event if the neighboring nodes reach a consensus, thus minimizing the false rate. This, however, requires some additional computational cost and communication. But, as illustrated in Fig. 4 (right), the ratio of the consumed energy by our model is ∼ 10 −3 and ∼ 10 −2 of the consumed energy by the centralized and localized models, respectively.
In Fig. 5 (left) we examine the scalability capability of our model in terms of the number of CHs as a percentage of the total number of nodes comparing with the AS and VS models. Specifically, we present the total consumed energy (computation and communication) starting from a CH percentage |N CH | of 10% to 100% of the total number of nodes |N |. We can observe the significantly low impact on the total cost compared with the other models. Moreover, the case where |N CH | = |N | depicts the capability of inferring an event as accurately as possible by each of the nodes, thus minimizing the φ value. It is worth comparing this scalability performance with the VS model, where all the nodes are acting independently based on the inference policy in (38). This indicates the capability of our model not only to scale with the number of CHs but also to deliver inference results corresponding to high quality of inference. Figure 5 (right) shows the impact of the belief threshold on the consumed energy for all the models with |N | = 50. The higher the value the less insensitive each model is to accurately inferring an event. However, this comes at a lower cost, since both the model M and model VS avoid inferring events, thus, reducing the communication with the back-end-system (transmitting the inference results from the CH nodes in model M and from the individual nodes in model VS). Evidently, the model AS is not influenced by this threshold since the nodes just deliver the sensed contextual vectors and do not perform any computations. On the other hand, this results to a high impact on the quality of inference, which is quantified by the φ value. Given a significant low value of φ, we set = 0.7, which results to three orders and two orders of magnitude less energy consumption achieved by our model, comparing with the AS and VS models, respectively. In that case, we define the efficiency indicator to examine the consumed energy of each model and its corresponding performance in terms of quality of inference. Figure 6 shows the total energy consumption for all models against false rate φ for data faulty probability p F ∈ {40%, 80%}. It is worth noting the efficiency of our model compared to the other models AS and VS, which achieves very low false rate with significantly the lowest total energy consumption. The AS model appears to be the least efficient in terms of energy consumption and the achieved false rate, while model VS is moderate efficient for p F = 40%. When the faulty probability is high, then the model VS increases significantly its false rate, due to the lack of any reasoning algorithm to deal with high faulty data values, while it also consumes significantly more energy than model M. The model AS cannot reduce its energy consumption even if the data faulty probability decreases since that model does not take into consideration any characteristic of the captured contextual data streams (it only forwards data vectors to the back-end-system). Our model appears very robust in terms of efficiency even if the p F is high. Overall, our concept of pushing predictive intelligence and data processing to the edge devices benefit: (i) accurate event inference close to the source of the information, (ii) significantly low communication overhead by localized belief-centric groupings, thus, avoiding data transfer to the back-end systems, and (iii) energy-efficient and robust inference in terms of data faulty probability.

Conclusions
We propose a novel federated event reasoning scheme by pushing predictive intelligence to the edge of the IoT network. This is achieved by an energy-efficient, real-time event reasoning mechanism, where data processing and predictive intelligence is pushed to the edge devices equipped with sensing and computing capabilities. Edge predictive intelligence and collaborative reasoning is materialized by the autonomous nature of nodes to locally perform data sensing & inference, and convey only inferred knowledge to their neighbors and concentrators. Nodes possess intelligence to reason about events, thus avoiding transferring raw data, while the complexity of inference is physically distributed to the sources of contextual information. Nodes are capable of locally processing and inferring events from contextual data streams enhanced with different context perspectives: predicted context, outliers context inference, and context fusion. The approximate event inference of each node is derived through Type-2 Fuzzy Logic inference to handle uncertainty. Finally, a knowledge-centric clustering scheme is introduced, where the clusters of nodes are formed according to their degrees of belief. The cluster heads are then disseminate the minimal sufficient knowledge to the concentrators / systems for event inference.
We provide mathematical analyses of our the statistical learning and stochastic optimization models, asymptotic complexities and energy consumption models for computation and communication cost, evaluate the model's performance and provide a comprehensive comparative assessment with other local & centralized event inference mechanisms. It is evidenced that the idea of exploiting the computing and sensing capabilities of nodes to 'intelligence at the edge' is deemed appropriate for real-time applications in IoT environments.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. closest w k with probability 1, that is, P ( x−w k ≤ ρ) = 1, which means that no other patterns are generated. Therefore, P ( x − w k ≥ ρ) ≤ E[ x−w k ] ρ or: based on Markov's inequality. To obtain P ( x − w k ≤ ρ) → 1 we have either ρ → ∞ or E[ x − w j ] → 0. However, ρ is a real small number, since it interprets the concept of neighborhood, then we require that E[ x − w j ] → 0, i.e., E[(x − w k )] = 0 or E[ w k ] = 0, which completes the proof.

A.2 Proof of Theorem 3
Let the pattern w k reach equilibrium, i.e., w k = 0, which, in this case this holds with probability 1. Then, from the update rule in Theorem 2 by taking the expectation of both sides of w j = 0 at equilibrium we have that: By solving E[ w k ] = 0, w k =x = D k xp(x)dx i.e., the centroid of all vectors in D k .

A.3 Proof of Theorem 4
We have to prove that the optimal stopping time t * exists and is derived from the principle of optimality: i.e., prove that (i) lim t→∞ sup t Y t ≤ Y ∞ a.s. and (ii) E[sup t Y t ] < ∞. Note that I t are non-negative and from the strong law of numbers (

A.5 Proof of Lemma 1
Consider a multiplication factor χ > 1 and that node i starts with the minimum EP of being a cluster head, i.e., ξ i = ξ min > 0. Since at each iteration step the node just multiplies its current EP ξ i with χ then, in the worst case, that node will be either a cluster head or a member when the process stops at the first iteration step L such that χ L−1 ξ min ≥ 1. That is, the maximum number of iteration steps are L = min{ > 0 : χ −1 ξ min ≥ 1}. Hence, the required number of iterations is L = log χ 1 ξ min +1, which maps to O(1) iterations. Now, if node i starts the election process with ξ i > ξ min then O (1) iterations are the maximum number of steps for the election process.

A.6 Proof of Lemma 2
In the election process, a node which is about to become a cluster head generates at most L = O(1) messages. On the other hand, a node which is about to become a member delays in sending messages and sends one message to just join its cluster head after considering itself as 'non-cluster head'. Obviously, the number of those messages (member messages) is strictly less than |N |, since at least one node will decide to be a cluster head. Hence, the number of messages exchanged in the network is upper-bound by L · |N |, which is O(|N |).