1 Introduction

The Internet of Things (IoT) is emerging as a paradigm capable of pushing seamless interconnection of personal and industrial electronic devices among themselves and to online cloud services. This phenomenon includes home automation, environmental sensing as well as most industries such as heavy machinery, automobiles, etc., where each device is connected to monitoring and actuation systems. The IoT can be hence described as a system (of systems) that includes the collection of both tangible and intangible (i.e., Cyber-Physical) components that are interconnected, either wireless or wired. The whole system is usually provided with a connection to the Internet, including cloud-based services. The components may include a range of sensors which can use various types of local area connections and wide area connectivity, and can record environmental, human or other machine-based events and activities, which can then be relayed to other components, such as data collection, information processing and event correlation systems. All those information can be used to support decisions in order to operate the system effectively, to gather data, to troubleshoot problems in the system, or to perform any other relevant tasks (Kramp et al. 2013). The deployment diagram of a typical IoT system is shown in Fig. 1. Multiple connectivity means between entities can be observed; some devices are connected wirelessly, whereas some have a wired connection. Moreover, the sub-systems belong to different manufacturers. However, all of them work together to form an orchestration and perform functions of different complexity and criticality, in which resilience properties can be essential (Flammini 2019).

Similar to Fig. 1, most of the components in an IoT system are manufactured by different vendors. Although many failures within an individual component may be taken care of by their respective manufacturers, it is unlikely that any other failure happening due to interactions between different components in an IoT system, such as a simple connectivity problem, will be considered by the manufacturers when developing their own systems. This means that problems due to interactions between different devices are generally not covered under any warranty.

A popular report from Forrester (Pelino et al. 2018) predicts that there will be a huge growth in the IoT industry in the next years: enterprises will increase their efforts to introduce voice-based services to consumers and new European guidelines will allow commercializing the IoT data. In 2010, the number of objects connected to the Internet surpassed the earth’s human population (Mohammadi et al. 2015). Reports from GartnerFootnote 1 and Juniper ResearchFootnote 2 suggest that the worldwide spending in IoT Security was 1.1 Billion USD and it skyrocketed more than 3.1 Billion USD in 2021. Furthermore, it is projected to shoot up to $6 Billion in 2023, which is a direct result of the increase in vulnerabilities in the systems. The above discussion leads us to believe that diagnosing and troubleshooting failures in an IoT system will be a huge problem in the near future, unless all the entities within the system belong to the same company.

IoT applications operate in complex distributed systems. One of the hardest challenges when such systems are deployed is their maintenance (Coulouris et al. 2011). While performing offline testing for verification and validation of these systems is possible using a wide range of unit and integration testing (Sommerville 2016), including some efficient approaches known as abstract testing (Flammini et al. 2009), run-time maintenance is possible only by continuously monitoring these systems. For instance, Caporuscio et al. (2020) introduce the concept of smart-troubleshooting in which system life-cycle is decomposed in four sequential activities, consisting of Prevention, Detection, Recovery and Adaptation. All of these activities can be performed for run-time maintenance of IoT systems.

When a system is capable of autonomously detect anomalies and recover from them, it is said it possesses the so-called Self-healing capabilities. Self-healing refers to the automatic recovery process by detection and diagnosis of faults and their subsequent correction in a temporary or permanent manner. Self-healing systems are of particular interest because Self-Healing directly impacts improvements in dependability. Self-adaptive systems are ones that monitor their execution environment and react to changes by modifying their behavior in order to maintain an appropriate quality of service. Therefore, there is a substantial intersection between Self-healing and Self-adaptiveness. Self-healing systems can be considered as a specific kind of Self-adaptive systemsFootnote 3.

When performing run-time maintenance, the primary goal is to detect errors in the system. There are many ways to detect if an error has occurred in a system (Silva 2008). One of them is System-Level Monitoring; this technique is used by several commercial enterprise applications. Another notable method is to detect errors/failures at the application layer. However, one of the most common way to examine a system for detecting errors or failures is by analyzing the logs generated by the system. Logs may be produced either by collecting raw sensory data or by tracing activities performed during system operation. Considering this last option, there is an emerging research field that analyzes the traced activities in the log (which is also referred to as Event Log) to perform different kinds of activities: Process Mining. In this paper, we focus on the idea of analyzing Event Logs for Self-healing of IoT by automatically diagnosing and troubleshooting errors and failures in IoT devices. The choice of the approaches addressed in this paper is supported by a recent survey of the state-of-the-art addressing open issues, challenges and opportunities in this research field (Caporuscio et al. 2020).

The organization of this paper is as follows: Employable techniques for Prevention, Detection, and error/failure Diagnosis and Recovery in IoT systems will be briefly surveyed in Sect. 2, together with the inspection of the state-of-the-art for error/failure detection. Specific methods for Event Log analysis using Process Mining will be addressed in Sect. 3. Finally, Sect. 4 will draw conclusions and hints about future developments.

Fig. 1
figure 1

A typical smart home deployment diagram

2 Prevention, detection and diagnosis in IoT

Prevention, detection and diagnosis of anomalies play main roles in run-time maintenance of IoT systems. These activities can be autonomously performed when the environment where the application operates in is smart and can exploit highly-capable computing and storage resources. This is the case for cyber-physical systems (Caporuscio et al. 2020) (CPS). CPS integrate a high number of sensors, actuators, highly-capable computing and storage nodes. This integration is possible due to advancements in interconnecting heterogeneous nodes. This aggregation and interconnection of heterogeneous Things is, as stated before, commonly referred to as the IoT (Baheti and Gill 2011). Currently, IoT mostly refers to network interfacing and communication of physical objects, devices and peripherals with a central processing unit based on a cloud computing network. However, this centralized cloud structure restricts the feasible applications due to confidentiality and latency reasons. To address these issues, data processing and communication should be served closer to sensors and actuators themselves or should be moved back to the edge of the network. This shifts the environment the application operates in from the scalable and homogeneous cloud environment to a restricted heterogeneous edge network (Al-Fuqaha et al. 2015). Despite the increasing popularity of IoT, security and reliability challenges impose a notable impediment to pervasive adoption and application of these devices (Sfar et al. 2018). By growing the number of involved dynamic heterogeneous resources in a cloud framework, the reliability and security concerns become more critical (Thamilarasu and Chawla 2019). In other words, the high number of heterogeneous, resource-restricted (e.g., energy, power and memory), usually cheap and often unreliable components in the environment threaten reliability and security. Moreover, when resources keep varying, the manual management of configuration, healing, protection, and maintenance of these networks of devices is difficult even for experienced personnel. For this reason, another main challenge is the autonomous management of resources. This concern can be addressed by Autonomic Computing. Autonomic Computing (Psaier and Dustdar 2011) introduces a Self-* concept to address all the concerns like Self-configuration, Self-healing, Self-protection, and Self-optimization.

  • Self-configuration The ability of the system to readjust itself on the fly.

  • Self-healing The ability of the system to discover, diagnose, and react to disruptions.

  • Self-optimization The ability of the system to maximize resource utilization to meet end-user needs.

  • Self-protection The ability of the system to anticipate, detect, identify, and protect itself from attacks.

Self-Healing enables the system to detect errors and failures, and possibly recover from failures and continue to operate smoothly. A Self-healing system must detect errors and failures and take required actions to make sure that the failure does not impact the correctness of the system. There can be several reasons that can induce errors, and possibly failures, in systems, such as hardware faults, software faults and network faults. A Self-protecting system helps detecting intrusive treatment and take autonomous actions to protect itself against abnormal behavior. A Self-protecting system will be able to detect and protect its resources from both internal and external attacks. The principal requirements needed to design a Self-Protected system are the following (Chopra and Singh 2014):

  • By defining its own operation, a Self-protecting system should be capable of discriminating legal behaviors from illegal behaviors. In other words, a Self-protected system should be capable to detect intrusions.

  • After detecting the attack, the system should be able to react by blocking the attack or logging a warning.

  • Prevent the Self-protected components from being compromised.

2.1 Fault-tolerant techniques for preventing, detecting, and diagnose anomalies

According to the definition of Chandola et al. (2009), Anomaly Detection is “the problem of finding patterns in data that do not conform to expected behavior”. The non-conforming patterns are the so-called anomalies. Chandola et al. (2009) systematically approached how an Anomaly Detection technique can be implemented considering (1) the type of anomalies, e.g., software errors due to human-introduced bugs or hardware crashes, (2) the attribute of anomalies (point, contextual and collective anomalies), and (3) the research field where suitable Anomaly Detection techniques can be extracted and applied for detecting anomalies. For the sake of consistency in this paper, we will follow the taxonomy for dependable and secure computing as proposed by Avizienis et al. (2004): this taxonomy allows us to reason with the familiar terms fault, error and failure. Therefore, instead of using the general term “anomaly”, we will address inconsistent system states as errors and the deviation from correct service operation as failures. This helps in determining one of the points Chandola et al. (2009) highlight when determining how to build an Anomaly Detection technique, i.e., defining the type of anomalies. Introducing faults in a system is inevitable: since the expected number of bugs in a system is proportional to its lines of code (Lipow 1982), it can be safely assumed that it is impossible to exhaustively list down all possible faults in a system. Therefore, more emphasis is given in developing fault-tolerant systems, rather than building fault-less systems. A system could be termed as fault-tolerant if it is able to prevent an error from turning into a failure. Here follows a brief literature review on fault-tolerant techniques for Preventing, Detecting and Diagnose errors and failures in IoT systems.

Su et al. (2014) propose a fault-tolerant mechanism which is implemented on a WuKongFootnote 4 middleware. Gia et al. (2015) propose a fault-tolerant architecture for healthcare IoT systems consisting of wireless sensor networks (WSN). Misra et al. (2012) propose a method for fault-tolerant routing in IoT. A fault-tolerant routing protocol is suggested for IoT systems based on mixed cross-layer and learning automata (LA). van der Kouwe (2016) proposes fault-injection, in his thesis, where artificial faults should be injected into a system to learn the behavior of a system with activated faults, possibly preventing system failure when recovery actions can be performed.

Cyber-Physical Systems, such as SCADA (Supervisory Control And Data Acquisition), are an appropriate example for the current state-of-the-art in fault-tolerant Internet of Things systems (Sajid et al. 2016). These systems are competent in offering flexibility, stability and fault tolerance. These systems, exploiting cloud computing services integrated with the Internet of Things, can be judged as Smart Industrial Systems which are predominantly employed in smart grids, smart transportation, eHealthcare and smart medical systems.

Intentional attacks are common threats in IoT. Thus, it is important to employ an IoT intrusion prevention mechanism to prevent those threats. Various methods of intrusion detection are discussed and a taxonomy to classify them is provided by Zarpelão et al. (2017). Bertino and Islam (2017) propose various guidelines that can prevent an IoT system from being compromised. Kasinathan et al. (2013) propose a DoS Detection Architecture for 6LoWPAN IoT systems using Suricata, an open-source intrusion detection system that detects and eliminates the attack using appropriate countermeasures before network operations are disrupted.

Another efficient way of preventing intrusions is the installation of Honeypot in a system. Likewise, Honeynets are an aggregation of Honeypots that are intended to imitate usual servers and network services (Provos and Holz 2007). In general, honeypots are essentially a technique of deception, where the defender purposely hoodwinks the attacker into acting in one’s favor (Daniel and Herbig 2013).

Yu et al. (2015) reject the idea of Honeypots in an IoT setup due to non-scalability and dependency issues, La et al. (2016) consider the possibility and use game theory to analyze the situation where both the attacker and the defender try to deceive each other.

While virtual patching is a type of firewall often mentioned as a web application firewall (WAF), an IoT system also needs a fully-fledged firewall, as most of the embedded systems that are part of the IoT system have little to no security. A recent improvement on the traditional firewalls for IoT is the Smart Firewall, which, unlike traditional software-based firewalls, is a hardware-based deviceFootnote 5. Gupta et al. (2017) have proposed an implementation for a firewall for Internet of Things using Raspberry Pi as a gateway.

2.1.1 Data-driven intrusion, error and failure detection techniques for IoT

In the previous section a brief review on the possible approaches for building fault-tolerant IoT systems has been done. All fault-tolerant approaches make use of data-driven error and failure detection techniques: data collected from the deployed environment get inspected and classified according to the detection technique used. How data are regarded to as being anomalous, i.e., how it is shown whether there is an error in the state of the system or the system itself has failed, depends on the detection technique. Most Anomaly Detection techniques, i.e., techniques for error and failure detection, make use of machine learning (ML). ML is defined by Witten et al. (2011) as “the acquisition of structural description from examples.” This description can be used to infer classification rules (or other types of unsupervised rules, e.g., clustering rules). These rules can be applied to run-time data for error and failure detection, as the following brief literature review explores.

It is obvious from the recent research that the application of ML for intrusion and fault detection of the IoT devices is growing quickly. The most applicable and famous conventional ML algorithms for IoT errors and failures detection are Decision tree, support vector machine (SVM), K-nearest neighbor, Bayes Classifier, Neural Networks (Nisioti et al. 2018).

Silva and Schukat (2014) used a KNN classifier to design an intrusion detection system based on the Modbus/TCP protocol. Even though the developed mechanisms could obtain acceptable performance to some extent, they were dedicated to particular protocols with high false positive rates (FPRs).

Anthi et al. (2018) proposed an intrusion detection system for the IoT. To achieve this purpose, different ML classifiers have been developed for properly detecting network scanning probing and simple forms of denial of service (DoS) attacks.

Pajouh et al. (2016) proposed an intrusion detection model for detecting suspicious behaviors with the two-tier classification module based on the Naive Bayes and the Certainty Factor version of K-Nearest Neighbor algorithms. The designed model was also able to identify malicious activities like User to Root (U2R) and Remote to Local (R2L) attacks.

Stewart et al. (2017) developed an adaptive intrusion detection system for fitting the dynamic architectures of SCADA systems. For this purpose, the authors used different OCSVM models to choose the most proper one for effectively detecting various attacks. Nevertheless, the proposed model, in spite of producing a high false alarm rate for detection, required a high amount of computational resources.

With the evolution of the Industrial IoT (IIoT) the number of interconnected devices in a network will grow drastically, which given rise to notable amount of data. In contrast to traditional IoT networks, in the Industrial IoT, both the volume and characteristics of the produced data is important. Accordingly, the Big Data produced by Industrial IoT network needs intelligent real-time processing system (Liang et al. 2020). However, on large-scale of Industrial IoT networks, traditional ML methods show low accuracy and less scalability for attack and fault detection. Toward solve this problem, deep learning (DL) can play a major role in developing an intelligent processing systems to handle such huge amount of data from Industrial IoT networks to detect and diagnose the fault and attack. Currently, the DL algorithms are broadly used and obtained great performance in detecting cyberattacks (Meidan et al. 2018). For instance, to improve the accuracy in detection tasks, in Nicolau and McDermott (2018) authors used Auto-Encoder approach to project the original data to a new latent representation space. Nonetheless, the performance of the DL approaches highly depends on the amount of training data. Typically, to achieve a high accuracy DL model, needed to have a huge amount of labeled training data (Nisioti et al. 2018). Furthermore, require supposing that both test datasets and training datasets are subject to the same distribution. Therefore, the DL approaches only perform well under the two main assumption, i.e., large training datasets and same data distributions of training datasets and test datasets (Wen et al. 2017). However, in real applications, distribution discrepancy usually exists between training and testing data, which leads to significant reduction in the DL performance. Particularly, in network security, different kinds of attacks such as zero-day attacks, could be appeared on a daily basis (Nicolau and McDermott 2018). As a result, the practical industrial IoT test datasets are usually various from the training datasets. For solving this problem, one way is collecting a huge amount of labeled training data from multiple industrial IoT devices under the different possible working condition. However, manual Labeling of large amount of data is very expensive and time-consuming process (Buczak and Guven 2015). Consequently, it could restrict the real application of DL algorithms in error, failure and attacks detection of IoT devices under the different scenarios.

3 Log analysis and process mining

Recall that Sect. 2 introduced the systematic approach Chandola et al. (2009) used for defining an Anomaly Detection technique. The previous Sections showed how existing research narrows the scope for the types of anomalies to discover, what characterizes these anomalies, and which techniques, from the ML field, can be employed to discover anomalies. In this Section we will focus on detecting behavioral anomalies, e.g., security, software or hardware errors/failures, by employing techniques extracted from Process Mining. We will first start by considering how Events can be logged from the operating IoT system. Then we will focus on the available Process Mining activities that enable discovering process models and behavioral anomalies from Event Logs.

3.1 Logging

When possible, the code of the application can be instrumented by introducing explicit logging rules. These logging rules get triggered when specific events happen. In IoT systems, however, most of the times the logged data is low-level sensory data collected by the deployed IoT devices. In this work, we are interested in analyzing the activities performed by the communicating IoT devices. We will, therefore, first provide a brief literature review that propose frameworks for extracting activities from low-level sensory data. Afterward, supposing it is possible to explicitly log Events from the application, we will provide an overview on available techniques for instrumenting the code for logging Events.

3.1.1 Low-level sensory data connection to high-level activities

Connecting low-level sensory data to activities performed in the system is not a trivial task. However, lots of research work has been done on detecting behavioral anomalies inspecting Event Logs, i.e., logs that explicitly collect the activities performed by the system at run-time. Therefore, it is worthwhile to address the problem of connecting low-level sensory data to high-level activities for further analyses.

Seiger et al. (2020) try to provide a framework for connecting low-level sensory data retrieved from a Smart Factory IoT application to high-level activities tied to high-level behavioral models.

Bakar et al. (2016) also provided, among other things, a framework for extracting activities from low-level sensory data. This work explores the pipeline raw data go through for extracting activity labels. It first starts with the collection of raw data as a collection of samples. These samples are pre-processed and segmented for scoping the research for activities in a limited time-window. Significant features are then extracted from each time-windowed sample data. Finally, activities are extracted. The authors of the work review a number of techniques for extracting activities from data. These techniques can extract activities either using supervised approaches, which require having pre-existing labeled sample data, or using unsupervised approaches.

Suryadevara and Mukhopadhyay (2012) consider sensor networks attached to monitors that determine whether a certain appliance is turned on or not. The system processes the information and labels the events considering some rules. For example, if pressure is detected on a bed, and the pressure is applied between 9 p.m. and 6 a.m., the system logs this event as the householder being asleep.

Hemmer et al. (2020) propose, among other things, a framework for extracting activities from low-level sensory data. This framework is a pipeline that makes use of Data Normalization, clustering and Time-Splitting techniques to identify what the authors regard to as the states the system is in during its operation. The states, together with their timestamps, configure the entries of the resulting Event Log.

3.1.2 Event logging instrumentation

Traditional logging techniques in major applications are not structured and many of the errors logged by them do not lead to failures. While these errors might be useful in some cases, most of the times they only decrease the readability of error logs by humans. Furthermore, there are some instances where errors, which were directly responsible for the failure of a system, could not be captured by these logging techniques. Considering that we are only focused on errors that lead to a failure, these techniques lead to a lot of false positives and false negatives. Here we focus only on the effective errors that may evolve into failures.

Most of the logging patterns try to detect errors and log them by placing a line of code at the end of a block of instructions. These techniques may not be able to detect errors such as infinite loops, etc. Therefore, instead of working on the resultant error logs, Cinque et al. (2012) aim to develop a new logging mechanism which takes care of these issues and other such issues, e.g., false positives and false negatives, by monitoring changes in the control flow of the program, placing the logging instructions strategically in the code. Moreover, the proposed logging mechanism aims to detect only those errors which cause failure.

The Rule-Based Logging Mechanism A set of error-modes are established, based on the widely accepted taxonomy in the dependability area (Avizienis et al. 2004), such as Service Error (SER: Prevents an invoked service from reaching the exit point.), etc. Different types of code injections, called Login Rules (LR) in the form of events, are then introduced in the source code of the system. This helps in logging events happening in the system, such as, e.g., LR-1 (Service Start), LR-2 (Service End), HTB (Heartbeat), etc.

In Rule-Based logging, these Login Rules don’t write the log-file directly but are processed dynamically by a framework known as LogBus. The above logging mechanism can be extended to work within an IoT system by introducing a new entity within the system, which can be in the form of a framework. We name it as the LogMonitor (Fig. 2). Contrary to traditional logging mechanisms where lines of code written within the system update the error logs, the LogMonitor will monitor all the events spread over various IoT systems in an environment logged by the aforementioned logging rules, and eventually update the error log. This way a consistent error log can be generated for the entire system. However, this mechanism assumes that we have access to the source code of the software each entity within the IoT system hosts. Other tedious methods, such as parsing Event Logs (He et al. 2017; Du and Li 2016) from each entity, may have to be performed, in case internal access is unavailable.

Fig. 2
figure 2

Conceptual structure of a log-monitor

3.2 Defining event logs

In the previous section, an approach was discussed to make an Event Log consistent in an IoT system. However, there still is a need to define the Event Logs and categorize them in such a way that appropriate tools can be developed, working on each category efficiently. Some Event Logs contain timestamps; some contain user-readable texts while others do not. Some contain just a few sets of information in the events, while some may contain hundreds of information in a single line of the Event Log. Although the format of the Event Logs depend on the manufacturer, in most of the standardized cases, an Event Log has an Event ID, a Timestamp, one or more primary attributes such as Activity, and other secondary attributes such as Source, Server Name, etc.Footnote 6 Thus, for a better understanding, let us assume a general Event Log example, that is readily comprehensible, as shown in Table 1. The given Event Log can represent most types of Event Log generated by a system, as it includes some identification (Case ID and Event ID), a Timestamp and, in this case, one primary attribute (Activity) and zero secondary attributes. The number of attributes for an Event Log is not constrained. However, these Event Logs are bound by some underlying assumptions (Van Der Aalst 2016). One important assumption among the aforesaid is that each event refers to some activity. For example, in Table 1, every event consists of an activity, such as pin verification, connection successful, etc. A system such as the LogMonitor, as discussed in the Sect. 3.1.2 above, could be programmed in such a way that it generates the Event Logs in the above format.

Table 1 A sample event log

As can be observed, the Event Logs in Table 1 are initially divided into Cases. Each case is further divided into Event IDs. It is worth noting that events listed under a case should only relate to precisely one case. Moreover, all the events in a case should be chronologically ordered, and at least one attribute is mandatory for an Event Log to be valid.

3.3 Process mining

An IoT system can be expected to generate anywhere from a thousand to hundreds of thousands of Events Log instances in an hour, and as a result it is imperative to extract meaningful information from them that can help us troubleshoot any error. One of the ways through which this can be achieved is Process Mining. It is an excellent tool to capture meaningful information from Event Logs. It is mainly used to offer new medium to discover, monitor and improve processes in several application domains. However, it can also be used to analyze and rectify errors in a system. It is a comparatively new research discipline that tries to extract knowledge from Event Logs of real processes (i.e., not assumed processes) which are readily available in today’s information systems. A report from Gartner estimated the market size of Process Mining to approach $160 Million in 2018 (Kerremans 2018). This is expected to grow to $1,421.7 million by 2023 with 50.3% as the Compound Annual Growth Rate.Footnote 7 The same report also predicts the growth of Process Mining usage in the IT industry (Fig. 3).

Fig. 3
figure 3

Projected process mining use case in 2021

A process model generated by Process Mining using the Event Logs can be very effective in handling large amounts of data. Furthermore, many commercial and academic systems have implemented Process Mining algorithms. Vossen (2012) claims that an active group of researchers are working on Process Mining and it has become one of the “hot topics” in Business Process Management (BPM) research. Additionally, Process Mining functionality is added by more and more software vendors to their tools. Notable examples are: Discovery Analyst (StereoLOGIC), ARIS Process Performance Manager (Software AG), and Comprehend (Open Connect). Aforementioned reasons led us to believe that Process Mining can be an effective tool for solving the problem of Self-Healing in IoT systems. Besides this, as of July, 2015, there were 18 PhD students being funded by Philips who were working in analyzing the Philips X-ray machines and electronic shaving devices among other items to see how these machines are used in the field, when and why they fail, etc.Footnote 8Moreover, most of the X-ray machines manufactured by Philips use Process Mining for fault diagnosis (Van Der Aalst 2016; Vossen 2012).

Since Process Mining primarily deals with Event Logs, it is assumed that the Event Logs used are in accordance with the guidelines mentioned in Sect. 3.2. The Event Log is the starting point and can be used to conduct three types of Process Mining activities. We will briefly describe these three classes of activities below (Fig. 4).

  • Process discovery The activities related to this class are concerned with discovering process models from Event Logs through use of Process Discovery algorithms, such as the \(\alpha\)-algorithm or the Inductive Miner (Van Der Aalst 2016).

  • Conformance checking This class of activities are concerned with comparing the activities found in the Event Log with normative process models. These normative process models may have been developed at design-time or may have been discovered through Process Discovery algorithms.

  • Enhancement The activities of this class are focused on improving existing process models by recovering insights from available Event Logs.

While Process Discovery still remains the most popular type of Process Mining, its usage is constantly declining, whereas the usage of the other two, i.e., Conformance Checking and Enhancement, is increasing and is expected to gain even more popularity in the future as well (Kerremans 2018) (Fig. 5).

Fig. 4
figure 4

Different types of process mining in terms of input and output

3.4 Process discovery

For creating a process model in Process Mining, we start from a bunch of behavior (Event Logs, in our case) and automatically construct the model based on the logs. This is because commonly, there are no appropriate existing models, or the models are flawed or incomplete. Reference (Rolland 1998) describes a process model as, roughly, an anticipation of what the process will look like.

Fig. 5
figure 5

Projected adoption of basic Process Mining types

A process model can be represented using various notations, and the fact that there are many notations available demonstrates the significance of process modeling. A simple Event Log, as shown in the Table 1, may be represented using an intuitive model. However, in case of an IoT system, or any electronic system whatsoever, the Event Logs are far more complex than the one illustrated. As a result, a legitimate, well-defined notation should be used to represent a process model. Moreover, process models for complicated systems are prone to many errors (Van Der Aalst 2016), some of which are here outlined:

  • Oversimplification of the model Most models have a propensity to focus only on the desirable behavior, and in a way overlook the events that are less likely to happen. There are chances that these “overlooked” events cause most of the errors and thus, the model may not be helpful.

  • Incompetence in satisfactorily capturing human behavior Simple and mundane processes involving human engagement are prone to modifications, as human behavior is unpredictable. These changes, although minor, should nevertheless be incorporated in a model, as these non-conformities by and large result in an eventual error.

  • Restricted or redundant abstraction level It is important that an appropriate abstraction level is chosen based on the objective and the input data. There is a plausibility that the model is too abstract to answer a detailed question, or conversely, a model is too detailed, and thus redundant, to answer a simple question.

Manually specified process models are susceptible to these (and similar) errors. Moreover, a poorly designed model may engender wrong conclusions. To eradicate these problems, Process Mining uses Event Logs to create a process model. An Event Log contains the exact steps that the system underwent to complete a task. Additionally, using a surfeit of Event Logs for modeling will result in the inclusion of all the events that might be less likely to happen, as discussed above. Finally, a process model is capable of providing different views using different abstraction levels for the same system.

Fig. 6
figure 6

A Petri net process model generated using \(\alpha\)-algorithm for the sample event log displayed in Table 1

Algorithms such as the \(\alpha\)-algorithm (Van Der Aalst 2016) can automatically generate process models based on Event Logs, as will be discussed in future sections. Notations such as EPC’s, BPMN, UML activity diagrams, Transition System, Workflow Nets, YAWL etc. can be used to describe process models. However, Petri nets are capable of describing concurrent processes without much effort like the transition systems. This is one of the oldest process modeling language. Due to its broad usage, it has also been extensively investigated; as a result, there exist many tools to analyze them\(^{8}\) (Van Der Aalst 2016; Vossen 2012). A Petri net, as suitably described by authors in Manoj et al. (2012), is “a directed bipartite graph, in which the nodes represent transitions (i.e., events that may occur, signified by bars) and places (i.e., conditions, signified by circles).” Petri nets have a static network structure. Tokens flow through the network according to specific firing rules. The distribution of tokens over places (also referred to as marking) determines the state of the Petri net. Petri and Reisig (2008) provides a broad understanding of Petri nets. Furthermore, a labeled Petri net is an extension of the basic Petri nets, including a set of activity labels. Figure 6 is a Petri net model for the sample Event Log shown in Table 2.

Fig. 7
figure 7

A complex process model using Petri nets, and a zoomed-in version of the shaded region shown in the inset

There are various existing algorithms that convert Event Logs to a process model as above. One of the most prominent and rather naïve algorithm is the \(\alpha\)-algorithm (Van Der Aalst 2016). Additionally, Fig. 7Footnote 9 is an example of a seemingly more complicated process model. It was created using ProM Tools. The inset in Fig. 7 shows a zoomed-in version of the shaded region of the process model in Fig. 7. Similarly, process models also contain detailed information, but relevant information may be available only in a particular section.

3.5 Conformance checking

Once a reference (normative) process model is in place, event data stored in an event log can be replayed on-top of the model for diagnosing whether reality aligns with the normative model’s specifications. Therefore, conformance checking, i.e., event log replay, allows linking reality to process models. Specifically, among the opportunities conformance checking opens, we find (Van Der Aalst 2016):

  • Assessment of the quality criteria of the normative model against the events found in the event log;

  • Repair of normative process models exploiting insights provided by event data;

  • Detection of control-flow anomalies due to, e.g., missed or wrongly-ordered activities executions.

Concerning the assessment of quality criteria, there are:

  • Fitness

  • Simplicity

  • Generalization

  • Precision

In the rest of the section, out of these criteria, we will focus on fitness, showing it provides useful insights on the process captured through event data.

3.5.1 Fitness assessment

Conformance checking techniques aim at highlighting whether event logs fit a normative process model. This diagnosis is provided through a quantitative metric, termed “fitness”, whose value, a real number constrained between 0 (no fit) and 1 (best possible fit), depends on the specific technique used. Considering event data are organized in traces, where each trace refers to sets of event log entries with the same case ID, techniques can be classified according to their fitness analysis granularity. For instance, a trace-level technique may diagnose an event log does not fit even if only a single event in the whole trace does not conform to the expected control flow. Other techniques, such as token-based ones, work at a finer granularity, exploiting event-level information rather than whole traces.

In order to show the potential of finer-grained, fuzzier fitness evaluations, we will first introduce a popular conformance checking technique, Token Replay. Then, the framework used for performing experiments using this technique will be thoroughly described. Finally, a set of experiments will be designed and presented for evaluating the usefulness of finer-grained diagnoses when different kinds of control-flow anomalies are found in event data.

Token Replay Token replay is a technique which replays event data on a normative process model. This technique is tailored for replaying traces on normative Petri nets, recording whether transitions found in event data are allowed to fire considering the state the process goes through as events are replayed on the model. When a transition, which is mapped to the activity the replayed event records, is fired without the needed requirements, this situation is marked as a deviation from the nominal behavior. As transitions are fired, four quantities are recorded and updated accordingly:

  • p (produced tokens)

  • c (consumed tokens)

  • m (missing tokens)

  • r (remaining tokens)

Once all traces in the event log get replayed, the event log fitness is computed with the following formula (Van Der Aalst 2016):

$$\begin{aligned}Fitness(L,N) &= \frac{1}{2}\bigg (1-\frac{\sum _{\sigma \in L}L(\sigma )\cdot m_{N,\sigma }}{\sum _{\sigma \in L}L(\sigma )\cdot c_{N,\sigma }}\bigg ) \\ &+ \frac{1}{2}\bigg (\frac{\sum _{\sigma \in L}L(\sigma )\cdot r_{N,\sigma }}{\sum _{\sigma \in L}L(\sigma )\cdot p_{N,\sigma }}\bigg ) \end{aligned}$$
(1)

where L is the event log, N is the normative Petri Net, and \(\sigma\) accounts for the traces that belong to L. When considering a single trace, the formula reduces to:

$$\begin{aligned} Fitness(\sigma , N)=\frac{1}{2}\bigg (1-\frac{m_{N,\sigma }}{c_{N,\sigma }}\bigg )+\frac{1}{2}\bigg (1-\frac{r_{N,\sigma }}{p_{N,\sigma }}\bigg ) \end{aligned}$$
(2)

4 Simulation and results

In the following, token replay will be shown to not only provide fitness measurements, but also other useful information for classifying control-flow anomalies.

4.1 Experiments’ framework

As experiments will be performed comparing event data with normative process models through the token replay technique, we need the following:

  1. (1)

    A normative process model described as a Petri net;

  2. (2)

    Event data to be checked against the normative Petri net;

  3. (3)

    A software implementing the token replay technique.

Regarding requirements 1. and 2., we re-used the open-source simulator PLG2 (Burattin 2016), which allows us simulating business process modeling notation (BPMN) models, and, therefore, generating event data. Considering two models are trace-equivalent if their set of execution sequences are equal (Van Der Aalst 2016), a BPMN model can be translated to a trace-equivalent Petri net if the correct translation rules for control-flow constructs of BPMN models are applied. BPMN-to-Petri net translation rules for precedence, XOR split, and XOR join BPMN activities relations are shown in Fig. 8; we will limit ourselves to normative BPMN models with only these activities relations, as this will allow us to generate a trace-equivalent Petri net for applying the token replay technique. Traces generated with PLG2 need to be translated from BPMN traces to Petri net traces. Consider, for instance, Fig. 9; here are a BPMN model (top) and its correspondent Petri net (bottom). The sample trace \({\texttt {<A, C, D, E>}}\) would be translated as \({\texttt {<A, A\_XOR\_C, C, D, D\_XOR\_E, E>}}\).

Fig. 8
figure 8

Ruleset for precedence, XOR split, and XOR join BPMN activities relations

Fig. 9
figure 9

BPMN model and its trace-equivalent Petri net counterpart

Regarding requirement 3., we will use an open-source implementation of the token replay technique developed by us and available at Vitale (2022). Each of the experiments that will be presented will use the normative BPMN model in Fig. 10.

Fig. 10
figure 10

BPMN reference model

4.2 Experiment 1

The goal of this experiment is evaluating whether different kinds of control-flow anomalies injected to traces simulated through PLG2 using the reference process model in Fig. 10 result in statistically significant fitness differences.

We will consider three types of control-flow anomalies: missed activities; duplicated activities; and exchanged activities. Figure 11 represents instances of such anomalies; considering the sample correct trace \({\texttt {<A, C, E, F, G, L>}}\), each of the three depicted cases show how such a trace changes when these anomalies are injected. Please note that multiple instances of such anomalies on a single trace can be injected. As a starting dataset, we have generated 1000 correct traces with PLG2. Out of these traces, we have split them into batches of 20 traces each, so as each batch is evaluated independently from one another, resulting in a total of 50 batches. The experiment is run three times, and, for each run, a certain number of control-flow anomalies instances of one of the types previously defined is injected to each batch. This process outputs three result tables of 50 rows each. The rows hold the batch evaluated, the fitness value computed using the token replay algorithm, and the number of injected instances of the type of control-flow anomaly (we have considered 5 for each experiment run). Table 2 shows a flattened view (limiting to 10 entries out of 50), collapsing fitness values of each of the three experiments; specifically, column FMA, FDA, and FEA record the fitness values computed when traces were injected with the missed activity anomaly, duplicated activity anomaly, and exchanged activity anomaly, respectively.

Fig. 11
figure 11

Sample control-flow anomalies

As each of the fitness values are computed for the same batches injected with different anomalies, a paired t-test for each possible pair of observations can be performed. Table 3 summarizes the tests results, determining the p-value for each of the tests performed and whether the null hypothesis was rejected or not. As it is clear from the table, the null hypothesis was rejected for both the comparisons against exchanged activities, meaning the results can discriminate with reasonable confidence whether the missed (duplicated) activities anomaly type or the exchanged activities anomaly type was injected.

Table 2 Flattened view showing part of the results of experiment 1
Table 3 Results of the statistical tests comparing fitness results with different injected anomaly types

4.3 Experiment 2

The goal of this experiment is showing process mining can be linked to classical machine learning techniques for classification tasks related to the traces handled through the token replay algorithm. Specifically, we will prove machine learning techniques are able to detect with satisfying accuracy figures whether a specific type of control-flow anomaly is present in the analyzed traces.

We will consider the same 1000 traces generated within experiment 1. However, we are going to split these traces in 200 batches of 5 traces each. Moreover, we are going to further split these batches in 3 parts, ending up with three sets of trace batches. To each of the set of batches, labeled 0, 1, and 2, a different anomaly type, using the same labels as the ones used for the sets, is injected. Specifically:

  • To each trace of each batch of set 0 precedence anomalies are injected;

  • To each trace of each batch of set 1 XOR split anomalies are injected;

  • To each trace of each batch of set 2 XOR join anomalies are injected.

These types of anomalies slightly differ from the ones injected within experiment 1, as they are targeted at specific activities relations. Basically, for each trace, each time the trace has a precedence, XOR split, or XOR join relation between pairs of activities, these activities are exchanged in the trace, invalidating the relation and, therefore, causing a control-flow anomaly. Once anomalies are injected to the batches, they get processed using the token replay algorithm, which provides not only a fitness measurement, but also information about activities missing their input tokens for firing when replaying traces out of the batches. Table 4 provides some of the results obtained when applying token replay to the anomalous trace batches.

Table 4 Part of the results obtained when applying token replay to trace batches injected with anomaly types 0, 1, and 2 defined in experiment 2

Once the full dataset is available, the following routine is run 200 times:

  1. (1)

    Split the full dataset into randomly sampled train and test sets (75% train, 25% test);

  2. (2)

    Create two copies of the original sets, removing, in one case, all information about XOR transitions, and, in the other, all information about all transitions, leaving fitness values as the only information;

  3. (3)

    Apply, for each copy of the (train, test) pairs, two machine learning algorithms for the classification of test instances: K-nearest neighbor (KNN) and C-support vector classification (C-SVC), both set with the default parameters provided by the implementations of these classifiers in the Python 3.8 scikit-learn 1.0.2 package;

  4. (4)

    Evaluate the accuracy metric for each pair of sets and classifiers used.

Part of the results are shown in Table 5. Considering all the values for each of the columns of this table, we have that, on average, for the KNN machine learning algorithm, using the full feature space yields 99.96% accuracy, much higher than the averages in the case of no XOR information (81.34%) and only fitness information (59.56%). The same goes for C-SVC, with 100 % average accuracy using the full feature space (against the 82.23% and 62.92% accuracy figures with no XOR information and only fitness information, respectively).

Table 5 Part of the results obtained when applying the K-Nearest Neighbor and C-Support Vector Classification machine learning algorithms to the copies of train and test datasets of experiment 2 FFS_KNN full feature space with K-nearest neighbor, FFS_SVC full feature space with support vector classification, NXOR_KNN no XOR information with K-nearest neighbor, NXOR_SVC no XOR information with support vector classification, OF_KNN only fitness information with K-nearest neighbor, OF_SVC only fitness information with support vector classification

4.4 Experiments’ discussion

Out of the two experiments performed, it is clear that, though token replay represents one of the most basic finer-grained conformance checking techniques, the information provided by the application of such a technique provide useful insights about traces observed from a system. Specifically, experiment 1 showed that fitness values computed for trace batches injected with different types of control-flow anomalies may have statistically significant differences, whereas experiment 2 showed that fitness values, and related information about activities triggered without necessary conditions, provide useful results for the classification of different anomalies using widely-known machine learning algorithms, such as KNN and C-SVC.

5 Conclusion

Due to the huge interest and growth of IoT-based Cyber-Physical Systems in diverse smart-X applications where a certain degree of resilience is essential, in this paper we stressed the importance of methodologies, tools, technologies, procedures and frameworks supporting automatic threat (i.e., fault, error and failure) detection and—possibly—recovery. We have investigated the usage of Process Mining based on the analysis of IoT log files as a useful tool to support Self-diagnostics and Self-healing. We have shown the potential of those techniques and also highlighted some open issues to be addressed in future research. For instance, bisecting process models requires various protocols to be followed (Van Der Aalst 2016), which are not covered in this paper. However, once a robust process model is built, the following situations can be detected and managed through model analysis:

  1. (1)

    Something did not happen although it was expected to happen.

  2. (2)

    Something unexpected happened.

  3. (3)

    The IoT system is operating according to its specification.

  4. (4)

    Which is the most frequent path followed by the user, etc.

There are various tools available in the market to create a process model, e.g., ProM, Disco, Celonis, etc. Celik and Akçetin (2018) However, in order to create a process model for a new system, a plugin may be needed so that the Event Logs can easily be comprehended by the tool. Note that Process Mining may also be capable of solving the problem of error predictability, where an end-user may be warned beforehand in case external events and/or user actions have a high probability of causing an error. Such analyses belongs to the categories known as prognostics, early warning and situation assessment. In conclusion, it can be said that generating consistent Event Logs throughout all the IoT devices in one environment by using a system such as the LogMonitor can be very effective for analyzing inter-device faults. Furthermore, generating process models using Process Mining, while adhering to the Event Log Rules, is an essential step toward Self-Healing in IoT-based CPS.