Introduction

Real-time streaming empowers an organization to process live data feed generated through an on-line data production system [1]. In the late 90s, an American scientist Peter J. Denning presented a streaming idea to save in-process bits for solving complex calculations much faster than traditional machine processing. This method helps create, process, and observe the data-stream of an instrument and generate a statistical result set [2]. Nowadays, we find several enhanced forms these days such as, live radio, streaming media, HTTP-based adaptive streaming, instant streaming service, HTTP-live streaming, HD streamed video, Full HD (1080p), and streaming 4k content [3]. On the other hand, record-keeping also used a streaming technique to build various data streams management systems such as STREAM, Aurora, TelegraphCQ, and NiagaraCQ [4].

These management systems store data records using a one-time query, long-running query, dataflow query, and query stream; however, it becomes complex for them to manage large-scale dataset queries in a heterogeneous distributed computing environment [5]. Moreover, this complexity also includes the management of an enormous number of indices in non-tabular datasets that ultimately raises the concept of big data management that could handle large-scale datasets [6]. There are several enterprises out in the market that offers big data management systems such as SQLstream [7], TIBCO [8], IBM [9], Striim [10], and Apache software foundation [11]. Among these, Apache group offers several open-source GPLv3 licensed big data stream engines i.e. Flume [12], Spark [13], Storm [14], NiFi [15], Apex [16], Kafka [17], Samza [18], Flink [19], Beam [20] and Ignite [21], that includes various streaming features as shown in Table 1.

Table 1 IoT-based application attribute feature model

These streaming engines are programmed to handle several forms of data-types, such as structured data, unstructured data, and semi-structured data [22]. These data types are generated through sources that include sensory devices and web-based intelligent portal [23]. Internet of Things (IoT) is a sensory device that consists of an intelligent processor, sensor to detect and store records in its cache storage and an interface to exchange datasets with global networks [24]. This device also generates a continuous flow of data that requires persistent storage to store, and streaming engines categorize its data into three forms, such as unprocessed, processed, and replicas [25]. The unprocessed data is a non-filtered collection that holds an association of tuples with indices only, whereas, the processed data is the extraction of query result onto the unprocessed data. The replica is a block of processed data ready to be exchanged with streaming engines to perform real-time analytics in a distributed computing environment [26], as shown in Fig. 1.

Fig. 1
figure 1

IoT dataset categorization through Stream Engines

IoT devices also generate several metadata events, i.e., monitoring the temperature of factory devices through smart meters, recording a credit card transaction, and detecting an unwanted object in a surveillance camera [27]. These events are a crucial part of metadata along with logs and routing path information and direct streaming queries to identify data tuples in the repository [28]. By default, streaming through Apache engines involve few steps such as (i) stream sourcing, (ii) stream ingestion, (iii) stream storage, and (iv) stream processing [29]. Stream sourcing represents an IoT device that provides a continuous flow of datasets, and stream ingestion consumes the same sourcing data chunk to queue the tasks inside a streaming engine systematically. The stream storage then formulates a micro-batch, a collection of live data feed having an adequate size s in time t sequentially, and stream processing facilitates the system to execute queries and retrieve a real-time result set [30] as shown in Fig. 2.

Fig. 2
figure 2

Default Directed Acyclic Graph (DAG) workflow in Streaming Engine

The data transformation phase divides micro-batch into four further subtypes, i.e., local generation, file system (HDFS) generation, dataset-to-dataset generation, and cache generation [31]. This transformation process is considered relatively lazy because of having an abstract extraction of datasets without any real action. Thus, stream processing requires a task route mapper, that could redirect dataset extractions per query into the respective repository. For this, the streaming engine uses a built-in feature of a directed acyclic graph (DAG) that extracts micro-batches to respective column fields without directed cycles [32]. DAG workflow consists of n MapReduce stages and transforms micro-batches through a scheduler, which transports dataset through resource allocations using stage functions. By default, a simple DAG consists of Stage0→1 stages, whereas, multi-purpose DAG involves Stage0→n stages to transform stream into a dataset as shown in Fig. 2a and b.

This workflow facilitates live queries’ extraction from a micro-batch; however, it does not recognize the type of IoT data tuples during micro-batch formation. Thus, when processing IoT stream events, it encounters four problems, such as (i) homogeneous micro-batches, (ii) dataset diversification, (iii) heterogeneous data tuples, and (iv) linear DAG workflow issue [33].

This article proposes an IoT-enabled Directed Acyclic Graph (I-DAG) for heterogeneous stream events that minimize the processing discrepancy issue in data transformation. The presented I-DAG enhances workflow operation by reading labeled stream tags in heterogeneous event stream containers and scheduling workflow task processing in a spark cluster. Thus, I-DAG contains additional features of processing IoT tuples and managing the existing DAG properties mentioned below.

The significant contributions of I-DAG are highlighted as:

  • A novel event stream tag manager

  • A novel parser to filter heterogeneous event streams in the stream engine

  • An innovative workflow manager that bypasses the unnecessary tasks queued in stages of MapReduce Operation.

    • Stage0→1 I-DAG workflow

    • Stage0→n I-DAG workflow

The remaining paper is organized in the following manner. “Motivation” section discusses the benefits and complications; “IOT-Enabled directed acrylic graph (I-DAG)” section addresses the motivation; “Performance evaluation” section explains the proposed model I-DAG; “Conclusion” section shows experimental evaluation over the spark cluster. “Declaration” section presents the conclusion with future work.

Motivation

I-DAG is an enhancement in the existing workflow of executing event streams in spark clusters. Let us discuss the benefits and complications of a smart meter use case in a smart grid.

Smart meters cope with on-ground streaming that includes continuous submission of record streams for grid analytics. A smart grid evaluates the functional and procedural performance of distribution end units through that stream. It simultaneously observes the performance of smart meters, i.e., stream accuracy, optimal workload management, and proper functioning of components. A smart grid generates a complicated scenario in bi-directional processing, where a system confirms the accuracy of a stream through the functionality of a source object. Thus, a smart grid cannot verify the accuracy of streaming analytics through a transformed dataset only, but also, it must monitor the error accuracy of smart meters. Therefore, it requires a streaming event analyzer that copes with Stage0→n transformations concurrently, and I-DAG provides such features through label-based stream event analytics [34, 35].

Smart meters generate heterogeneous IoT events concurrently through bi-directional streaming that creates asynchronous problems in the smart grid, i.e., outnumbered of metadata than traditional processing and overwhelmed analytical accuracy. Thus, when the I-DAG technique applies, it acquires cache containers to jump few MapReduce tasks that usually a developer skips to include in the programming model [36, 37].

Nowadays, the world is moving towards an unpredictable scale of managing IoT devices and their streaming event analytics. This increment would drastically increase with time, and the demand for resource management would be considered a vital issue that must be managed on a priority basis. At that time, a customized Direct Acyclic Graph for IoT event stream processing would fulfill this demand. This IoT-enabled direct acyclic graph would address future heterogeneous workflow event stream operations in the spark cluster [38].

IOT-Enabled directed acrylic graph (I-DAG)

From a functional perspective, we divide I-DAG into three sub-components:

  • Label-based event streaming

  • Heterogeneous stream transformation

  • IoT-enabled DAG workflow

Label-based event streaming

Let IoT devices events be a sequence of error, backup and information messages with a representation as Ei, Bi and Ii, where each of the message belongs to sensory devices as Devicei in the distributed computing environment as shown in Fig. 3. At each time interval t, streams generated through a function fi holds an array of event messages G[1..(Ei,Bi,Ii)] with G[i]=fi. Therefore, when a new occurrence of event messages arrive, the function representation changes to G[i++] and the individual event message collection at each node could be represented as,

$$ G\left[i++\right]=G\left [\left(E_{i}, B_{i}, I_{i}\right) ++\right] $$
(1)
Fig. 3
figure 3

Label-based Heterogeneous Streaming Workflow

Where, G[i++] is a container managing multiple event messages arrival with x≥0.

In order to approximate the inner function elements of G[i++], implicit vectors such as x(E[1..n]), y(B[1..n]) and z(I[1..n]) are added into the stream instruction set with a proportion of (Ei,x)++, (Bi,y)++ and (Ii,z)++ and returns an output approximation as,

$$ Event_{m}=\sum_{i=1}^{n}E_{i}\ast B_{i} \ast I_{i} $$
(2)

Where, Eventm>0 and represents the container of processed heterogeneous event messages.

Lemma-1: \(SE_{o,p,q}=\sum _{n}^{i=1}\left \{ \left (e_{i}\times n_{o}\right), \left (b_{i}\times n_{p}\right), \left (i_{i}\times n_{q}\right)\right \}\)

The individual data segments of Ei, Bi and Ii arrives at nodes No, Np and Nq through an incremental function G[i++] that assembles segments in formation order. This order summarize stream segments in such a way that G[i++] stores SEo,p,q≤0.

Lemma-2: E[s]=PP(ei,bi,ii)

Since, \(SE_{o,p,q}= \sum _{n}^{i=1}\left \{ \left (E_{i}\times N_{o}\right), \left (B_{i}\times N_{p}\right), \left (I_{i}\times N_{q}\right)\right \}\), but, \(\sum _{n}^{i=1}\left \{ \left (E_{i}\times N_{o}\right)\times \left (B_{i}\times N_{p}\right)\times \left (I_{i}\times N_{q}\right)\right \}\neq N_{o,p,q}\times \left (\sum _{n}^{i=1}\left (E_{i},B_{i},I_{i}\right)\right)\). Therefore, the constraints are residing within the (Ei,Bi,Ii,). Moreover, if i=j=k then E[No,p,q]=E[1]=1 and if opq, then E[No,p,q] are independent and could be retrieved as, \(E\left [N_{o,p,q}\right ]=\tfrac {1}{2}1+\tfrac {1}{2}\left (-1\right)\). After that, the linearity of expectation could be represented as,

$$ {\begin{aligned} E\left[SE_{o,p,q}\right]=E\left[\left(\sum_{i=1}^{n}\left(E_{i}, B_{i}, I_{i}\right)\right)\right]\left(\sum_{i=1}^{n}\left(N_{o}, N_{p}, N_{q}\right)\right) \end{aligned}} $$
(3)
$${\begin{aligned} =E\left[\sum_{o,p,q}^{n}\left(N_{o}, N_{p}, N_{q}\right)\left(E_{i},B_{i},I_{i}\right)\right] \end{aligned}} $$
$${\begin{aligned} &=\sum_{o}^{n}\left(N_{o}\right)E\left [E_{i},B_{i},I_{i}\right] +\sum_{o \neq p }^{n}\left(N_{p}\right)E\left [E_{i},B_{i},I_{i}\right]\\&\quad+\sum_{o \neq p \neq q}^{n}\left(N_{q}\right)E\left [E_{i},B_{i},I_{i}\right] \end{aligned}} $$

Where E[SEo,p,q] manages the heterogeneous events with independent expectation parameters.

Lemma-3: \(V\left [sE_{o,p,q}\right ]\leq 2E\left [sE_{o,p,q}^{}\right ]^{2}\)

Since,

$$V\left[SE_{o,p,q}\right]=E\left[\left(SE_{o,p,q}\right)\right]^{2}-E\left[SE_{o,p,q}\right]$$
$$=\left(\sum_{o,p}^{n}...N_{o}N_{p}\right)\times \left(\sum_{p,q}^{n}...N_{p}N_{q}\right) $$
$${\begin{aligned} &=\sum_{o,p,q}^{n}\left(...N_{o}N_{p}N_{q}\right)\leq 2\left(\sum_{o}^{n}E_{i},B_{i},I_{i}\right)\\&\quad\times \left(\sum_{p}^{n}E_{i},B_{i},I_{i}\right)\times \left(\sum_{q}^{n}E_{i},B_{i},I_{i}\right) \end{aligned}} $$
$$=2 E \left[SE_{o,p,q}\right]^{2} $$

Lemma-4: average T1 and T2 of SEo,p,q

Let A be the output of algorithm-1, so

$$E\left[S\right]=PP\left(E_{i}, B_{i}, I_{i}\right), V\left(A\right)\leq 2E\left [A\right]^{2} $$

and that equals to the,

$$\sigma \left(A\right)=\sqrt{V\left(A\right)}\leq \sqrt{2}E\left[A\right] $$

Therefore, the bound of stream segment could be obtained as,

$$PE\left[\left| A-E\left[A\right]\right|> \varepsilon E\left[A\right]\right] $$

Thus,

$${\begin{aligned} &PE\left[\left| A-E\left[A\right]\right|> \varepsilon E\left[A\right]\right]\\&\leq PE\left[\left| A-E\left[A\right]\right|> \sqrt{2} \varepsilon \sigma \left(A\right)\right] \end{aligned}} $$

In order to reduce the variance, we apply Chebyshev inequality [40] to \( \sqrt {2}\varepsilon > 1\), we get the output as,

$$E\left[A_{i}\right]=PP\left(E_{i}, B_{i}, I_{i}\right), V\left(A_{i}\right)\leq 2E\left [A_{i}\right]^{2} $$

So if B be the average of \(\phantom {\dot {i}\!}A_{i},...,A_{T_{1}T_{2}}\)

$$E\left[B\right]=PP\left(E_{i}, B_{i}, I_{i}\right), V\left(B\right)\leq \frac{2E\left [B\right]^{2}}{T_{1}T_{2}} $$

Now, by Chebyshev’s inequality, as \(T_{1}T_{2}\geq \frac {16}{\varepsilon ^{2} }\),

we get,

$$PE\left[\left| B-E\left[B\right]\right|> \varepsilon E\left[B\right]\right]\leq \frac{V\left(B\right)}{\left(\varepsilon H\left[B\right]\right)^{2}} $$
$$PE\left[\left| B-H\left[B\right]\right|> \varepsilon H\left[B\right]\right]\leq \frac{2H\left[B\right]^{2}}{\left(T_{1}T_{2}\varepsilon^{2}H\left[B\right]^{2}\right)}\leq \frac{1}{8} $$

At this point, streaming bound δ could be obtained but since a dependence of \(\frac {1}{\delta }\) is present, therefore, we apply lower bound inequality Hoeffding [41] on H[B]=PP(Ei,Bi,Ii) and get,

$$PE\left[\left(1-\varepsilon\right)H\left[B\right] \leq B\leq \left(1+\varepsilon\right)H\left[B\right]\right]\geq \frac{7}{8} $$

Now execute median function Z of T1T2 onto \(B,B_{1},...,B_{T_{1}T_{2}}\phantom {\dot {i}\!}\) and we get,

$$ PE\left[\left| Z-H\left[B\right]\right|\geq \varepsilon H\left[B\right]\right]\leq \delta $$
(4)

when,

$$T_{1}T_{2}\geq \frac{32}{9}ln\frac{2}{\delta } $$

The stream approximation could be obtained as,

$$ \left(E_{i}\right)_{N_{o}}=O\left(\frac{1}{\varepsilon^{2}}ln\frac{2}{\delta}\right)_{HN_{i,i}} $$
(5)
$$ \left(B_{i}\right)_{N_{p}}=O\left(\frac{1}{\varepsilon^{2}}ln\frac{2}{\delta}\right)_{BN_{o,p}} $$
(6)
$$ \left(I_{i}\right)_{N_{q}}=O\left(\frac{1}{\varepsilon^{2}}ln\frac{2}{\delta}\right)_{IN_{i,k}} $$
(7)

This stream approximation defines the existence of managing heterogeneous parameters in the I-DAG.

Heterogeneous stream transformation

The distributed stream elements with probability α(t) are sampled at time t with a computing average of,

$$\alpha \left(t\right)=\alpha,contant: error\simeq \frac{1}{\sqrt{\alpha \times t}}\rightarrow 0 $$

and,

$$\alpha \left(t\right)\simeq \frac{1}{\varepsilon^{2}\times t}:error\simeq \varepsilon, constant\ over\ time $$

In order to perform encapsulation, reservoir sampling is used because it allows adding first k stream elements to the sample having total items tth with probability \(\frac {k}{t}\). Thus, for every t and it, the sample probability is evaluated as,

$$ P_{i,t}=PE\left[s_{i}\ in\ sample\ at\ time\ t\right]=\frac{k}{t} $$
(8)

and for t+1, the sample probability becomes,

$$P_{t+1,t+1}=PE\left[s_{t+1}\ sampled\right]=\frac{k}{t+1} $$

This is mandatory because of the inter-connected heterogeneous IoT tuples that are to be incorporated with the internal of time. The processing of t+1 with it eventually reduces the role of si and returns st+1 as,

$$ P_{i,t+1}=\frac{k}{t}\times \left(1-\frac{k}{t+1}\times \frac{1}{k}\right) $$
(9)
$$=\frac{k}{t}\times \left(1-\frac{1}{t+1}\right) $$
$$=\frac{k}{t}\times \frac{t}{t+1}=\frac{k}{t+1} $$

The frequency table of stream events uses the event arrival probability Pi,t+1 into Like space saving of count-min sketch to bring an order between transformed heterogeneous stream events as shown in Fig-3. This space saving function provides an approximation fx′ to fx for every x and consumes memory equals to \(O\left (\frac {1}{\Theta }\right)\). Therefore, when a stream vector G[n] is processed with G[i]≥0 for ∀i ε t, it estimates heterogeneous stream G of G as,

$$G\left[i\right]\leq G'\left[i\right]\ \ \ \forall i $$

and,

$$G'\left[i\right]\leq G\left[i\right]+\varepsilon \left| G\right|{~}_{1}\ \ \ \forall i, with\ probability\ \geq 1-\delta $$

Where, \(\left | G\right |{~}_{1} =\sum _{i} G\left [i\right ]\) and |G| 1streamlength having \(O\left (\frac {1}{\varepsilon ^{2}}ln\frac {2}{\delta }\right)_{HN_{i,i}}\), \(O\left (\frac {1}{\varepsilon ^{2}}ln\frac {2}{\delta }\right)_{BN_{o,p}}\) and \(O\left (\frac {1}{\varepsilon ^{2}}ln\frac {2}{\delta }\right)_{IN_{i,k}}\) memory with \(O\left (ln\frac {n}{\delta }\right)\) update time t.

The heterogeneous events stream \(\sum _{i} G\left [i\right ]\) consists of d independent hash functions h1...hd:[1..n]→[1..w] where, each of the stream element holds memory gp(i) that uses instruction set G[i]+=(Ei,Bi,Ii) having gp(i)+=(Ei,Bi,Ii) for ∀ jε 1..d and the frequency table of heterogeneous events stream could be retrieved as,

$$ G'\left[i\right]=min\left \{ g_{p}\left(i\right)|j=1..d \right \} $$
(10)

This declares that the accessibility of the heterogeneous events stream in enlisted in the I-DAG.

Lemma-5: G[i]≥g[i]

The minimum count of heterogeneous events stream G[i] remains ≥ 0 for ∀ i with a frequency of update(gp(i)). The stream element having hash function Io,p,q=1 if gp(i)=gp(k)=0 could be retrieved as,

$$ H\left[I_{o,p,q}\right]\leq \frac{1}{range\left(g_{p}\right)}=\frac{1}{w} $$
(11)

By definition \(A_{o,p}=\sum _{k}H\left [I_{o,p,q}\right ]\times G\left [k\right ]\), the heterogeneous events stream can be represented as,

$$ A_{o,p}=\sum_{k}H\left[I_{o,p,q}\right]\times G\left[k\right]\leq \frac{\left| G\right|{~}_{1}}{w} $$
(12)

Now, this stream is well connected and could not be ready independently. Therefore, we apply Markov inequality and pairwise independence as,

$$ PE\left[A_{o,p} \geq \varepsilon \left| G\right|{~}_{1}\right]\leq \frac{H\left[A_{o,p}\right]}{\varepsilon \left| G\right|{~}_{1}}\leq \frac{\left(\frac{\left| G\right|{~}_{1}}{w}\right)}{\left(\varepsilon \left| G\right|{~}_{1}\right)}\leq \frac{1}{2} $$
(13)

if \(w=\frac {2}{\varepsilon }\) then,

$$PE\left[G'\left[i\right]\geq G\left[i\right]+\varepsilon \left| G\right|{~}_{1}\right] $$
$$=PE\left[\forall\ j \ :G\left[i\right] +A_{o,p}\geq G\left[i\right]+\varepsilon \left| G\right|{~}_{1}\right] $$
$$ =PE\left[\forall\ j \ : A_{o,p}\geq \varepsilon \left| G\right|{~}_{1}\right] \leq \left(\frac{1}{2}\right)^{d}=\delta $$
(14)
$$if \ d=log\left(\frac{1}{\delta}\right) $$

for fixed value of i as shown in Figs. 4 and 5. Thus, we observe that the events are synchronized to a central container with independence of accessibility.

Fig. 4
figure 4

Homogeneous IoT Events with Node Representation

Fig. 5
figure 5

Heterogeneous Event Stream Transformation

I-DAG workflow

The events generated through IoT devices with a sequential order of PE[∀ j :Ao,pε|G| 1]are scheduled onto the I-DAG that consists of an identifier LocatorIDAG which reads events labels \(\left (E_{i}\right)_{N_{o}}=O\left (\frac {1}{\varepsilon ^{2}}ln\frac {2}{\delta }\right)_{HN_{i,i}}\), \(\left (B_{i}\right)_{N_{p}}=O\left (\frac {1}{\varepsilon ^{2}}ln\frac {2}{\delta }\right)_{BN_{o,p}}\) and \(\left (I_{i}\right)_{N_{q}}=O\left (\frac {1}{\varepsilon ^{2}}ln\frac {2}{\delta }\right)_{IN_{i,k}}\) in the source file and shuffle the pointer between n stages as shown in Fig. 6.

Fig. 6
figure 6

IoT-enabled Directed Acyclic Graph (I-DAG) workflow in Streaming Engine

In order to perform stage predictor evaluation, the workflow targets PE[∀ j :Ao,pε|G| 1]: stage(n)→stage(n+1) with LocatorIDAG: stage(n)→stage(n+1) keeping the error under loss function \(\vartheta :stage\left (n+1\right)\times stage\left (n+1\right)\rightarrow \mathbb {R}\). The predictor error can be obtained as,

$$ {\begin{aligned} {}& H_{stage\left(n\right)}\left[\vartheta\left(P\left[\forall\ j \ : A_{o,p}\geq \varepsilon \left| G\right|{~}_{1}\right]\right.\right. \\& \left.\left.\left(stage\left(n\right)\right),Locator_{I-DAG}\right)\right] \end{aligned}} $$
(15)

This predictor error manages the discrepancies of inter-connection in the I-DAG workflow.

The LocatorIDAGwith an approximated finite heterogeneous event labels can be sampled with SIDAG=((stage(n)1,stage(n+1)1),...,(stage(n)n,stage(n + 1)n)) through \(\frac {1}{n}\sum _{i=1}^{n}\vartheta \left (P\left [ \forall \ j \ : A_{o,p}\geq \varepsilon \left | G\right |{~}_{1}\right ], stage\left (n+1\right)\right)\). The workflow loss function are categorized into two types: (i) regression and (ii) classification. The regression loss on predictor LocatorIDAG is expressed as,

$$ \vartheta\left(a,b\right)=\left(a-b\right)^{2} $$
(16)

and classification loss on predictor LocatorIDAG is expressed as,

$$ \vartheta\left(a,b\right)=0\ if\ a=b, 1\ otherwise $$
(17)

Thus, I-DAG is ready to facilitate the independent heterogeneous IoT entries with prediction locator.

Performance evaluation

I-DAG technique is incorporated into the Spark cluster, having a virtualized distributed environment, as shown in Table 2.

Table 2 Apache spark cluster

Environment

Spark cluster consists of Intel Xeon processor with a core computation capacity of 8 CPU units, 64 GB RAM and persistent storage media of 2 TB Disk and 1 TB SSD. The remaining partial workers consist of the Intel Core i7 processor having 4 Cores, 16 GB RAM, and persistent storage media of 1 TB Disk along with 500 GB SSD. The virtual environment consists of Virtual Box 5.2 installed at five virtual machines, as mentioned in Table 3.

Table 3 Virtual machines over spark cluster

Experiments

The dataset used to evaluate I-DAG belongs to Amazon Web Service (AWS) public datasets repository [4247]. It contains a collection of 4500 files storing stream data having a total volume of 8.6 GB.

The experiments performed on the AWS dataset consists of (i) Events labeling, (ii) Labeling error factor, (iii) Joining Heterogeneous Streams, (iv) Heterogeneous dataframes, (v) Workflow Endurance and (vii) Cluster performance.

Metrics of evaluation

I-DAG consists of two performance metrics, i.e., (i) Merging of disjoint streams and (ii) Stages bypass. The disjoint stream merging overlaps the individual element and strengthens connectivity between heterogeneous streams. The stages bypass reduces unnecessary consumption of RAM and a decrease in redundant garbage values that appear as a result of regular stage processing.

Results

This section discusses the experimental results generated through the proposed approach I-DAG tasks processing.

Events labeling

IoT devices generate events of errors, backup, and record information in the form of text data that the stream engine receives for micro-batch transformation. Event labels mark the stream elements with an I-DAG tag sequence having a hash function. This tagging creates an impact of trust, and it no longer requires a pair in a prefix or postfix, and the transformation function uses the same hash to bundle stream elements into the core engine. This labeling function consists of several sub-routines such as (i) data ingest, (ii)element queuing, (iii)stream chunk tagger, (iv) hash element, and (v) element dispatcher. The data_ingest function fetches an enormous number of individual stream elements from several devices and uses Heap memory to enlist the element arrival into stream engine. The element_queuing feature then assign the indices to respective Heap function entries in FCFS (First come First serve) order. The stream_chunk_tagger method assigns a label \(Stream_{E_{i},B_{i},I_{i}}\) to each of the indexed entry and allocates a hash_element value for identifying any particular index in the stream and finally the dispatcher encapsulates the tags and transform the event streams as shown in Table 4.

Table 4 Heterogeneous events labeling (seconds)

The tagged events are recognized by the stream engine much effectively than regular heterogeneous events, as shown in Fig. 7.

Fig. 7
figure 7

Heterogeneous Events Processing

Labeling error factor

The error in an event labeling process appears due to improper placement of tag. It occurs during the application of events labeling relies upon several reasons such as (i) improper ingest, (ii) queue out of bound, (iii) abnormal tagging, (iv) inaccuracy in tag, and (v) partial release of an element. During the tag formation process, a stream element could lead to improper ingestion due to concurrent in-takes at the same time. The queue responsible for managing stream may lead to a buffer overflow problem if the tagging time interval increases than the usual timeline. Also, the stream could be released without having a proper index and hash function due to continuous inaccurate tag application. The errors in label-based events, as well as healthy stream formation, can be observed in Fig. 8.

Fig. 8
figure 8

Errors during Heterogeneous Event Processing

Heterogeneous streams join

The tagged stream elements require a join operation to combine like events in the stream engine. This requirement is a must because of the live ingestion of heterogeneous stream feed through enormous IoT devices. The functional aspect of a join operation consists of parsing tagged stream elements adjacent to each other so that streaming ingestion must be within the same range of time along with a conjunctive condition that offers to join elements with similar tagging. This conjunction function correlates element n to n+1 through a forward-feed chain in the data transformation environment. The stream element join is executed through syntax \(Stream_{E_{i}}=join\left (parse\left (Tag_{n},Tag_{n+1}\right)\rightarrow \left (\left |Tag_{n},Tag_{n+1}\right |\right)\right)\) keeping group-by phrase as a priority along with aggregate operators. The heterogeneous streams join of error, backup, and information record events through query operators, as observed through Table 5.

Table 5 Heterogeneous stream join through query operator

In the same way, the heterogeneous streams join of error, backup, and information record events through diff operator can be observed through Table 6. The comparative effectiveness of the tagged heterogeneous streams joins, as observed through Fig. 9.

Fig. 9
figure 9

Heterogeneous Streams Join Computing Percentile

Table 6 Heterogeneous stream join through Diff operator

Heterogeneous data frames

The label-based stream elements stored in a heterogeneous data frame that comprises a table having data structured properties. This data table assigns a sequence of indices to the stream elements that declare as equal length vectors. The frame categorizes into several sub-sections, such as (i) header, (ii) data row, and (iii) cell. The header represents the top line of tabular-structure that manages column names only. The data row depicts the stream element having a prefixed index value, and the cell is the stream element member of the row. The data frame supports event labeling transformation through a prior metadata information set of stream elements. Thus, the tagged stream elements are retrieved in a much more efficient manner than traditional stream elements, as shown in Fig. 10.

Fig. 10
figure 10

Heterogeneous Data Frame Computing Percentile

Workflow endurance

The issues encountered through in-process heterogeneous data streams measures the workflow endurance during stage processing. The IoT-enabled workflow uses data frames to learn about tagged stream elements already enlisted in the data table. Therefore, when a stream joins processes on the source file, the table allows the I-DAG workflow to skip unnecessary steps wherever encountered. This step skipping practice is learned very well through two case studies given as (i) Stage0→1 and (ii) Stage0→n. The Stage0→1 consist of two stages having three operations in total that includes flatMap, Map and Reduce. If the label stream element already processed through the Map functionality, it can jump the control from flatMap to Reduce operation. In the case of Stage0→n, when the compiler parses the source file that consists of a schedule, the control bypasses unscheduled operations in the stages. Thus, it reduces the usage of energy consumption and the computing capacity of a cluster along with skipping functional latency issues. The Stage0→1 and Stage0→n performance could be observed through Tables 7 and 8.

Table 7 Heterogeneous stream I-DAG workflow Stage0→1
Table 8 Heterogeneous stream I-DAG workflow Stage0→n

Cluster performance

The parameters measuring cluster performance comprise stage activity that includes map and reduce task processing and the exchange of i/O operations. i-DAG enables a cluster to perform switching in-between stage tasks depending on the source file’s requirement. if the task does not require to produce map values, it bypasses the operation towards the next task, unlike traditional dAG that has to go through each of the individual operation producing i/O latency along with additional operational cost as shown in Fig. 11.

Fig. 11
figure 11

Heterogeneous Event Processing on Spark Cluster

Conclusion

This paper proposes a novel technique that identifies different IoT devices stream events over a graph processing layer in spark cluster. The proposed approach provides a broad analytical perspective of how the stream events are generated, proceeded by their convergence in the heterogeneous form. In the end, the I-DAG workflow processes individual IoT devices’ stream events with a cost-effective mechanism. It reduces graph workload along with decreasing the I/O traffic load in the spark cluster.