The definition of reliability, as discussed in the previous sections, has a strong element of quantification associated with it. Reliability, as defined in Sect. 3, is not a subjective science, and therefore mechanisms aiming to assess reliability should be objective and quantifiable in their nature. There is also a heavy focus within reliability in defining and using metrics to assess the reliability of components and systems. Research in the area of IoT reliability has been conducted to enhance reliability at various levels of the IoT architecture. This section summarises the research available in the areas of device reliability, data quality, network reliability and anomaly detection, all of which represent key areas for improvement of IoT reliability.
Device reliability
Several authors researching IoT device reliability integrated classical reliability metrics into IoT-centric solutions. Reliability, failure rate, availability, and MTTR were quantified by Zin et al. (2016). The work proposed a probabilistic model for measuring reliability in connected IoT devices positing that the failure structures of IoT devices adhere to a certain probability distribution. The authors define the reliability measure R(t) as being the probability that the device is operating correctly at time interval [0, t]. This probabilistic function allows estimation of the expected time to failure, availability and reliability for a given IoT device. Meanwhile, Mavrogiorgou et al. (2018), included Mean Time to Repair (MTTR), MTTF, MTBF, and availability metrics in their work, which proposed a mechanism for capturing the reliability of heterogeneous IoT devices. This mechanism considered both known and unknown device types and sought to differentiate between which devices were reliable and which were not, with the goal of collecting data from the reliable ones and discarding data from unreliable devices. The mechanism consisted of four stages: devices recognition, specifications classification, reliability estimation and reliability validation. Using this mechanism, the authors were able to build a ranking of connected fitness devices based upon their reliability results from known reliability metrics. Lastly, Kim (2016) used reliability, failure rate and recoverability in their study which proposed a weighted model to quantifying reliability in the IoT. The model consisted of four quality criteria; functionality, reliability, efficiency and portability. Metrics were defined within these criteria which were assigned weights so that the model could provide a total score for the quality of the IoT application. The model was then evaluated in a virtual environment and scores were produced for each of the metrics. This model provides weighting, however, each criterion was weighted evenly in this experiment. These classical metrics provide a useful starting point in the quantification of IoT reliability, but have not yet matured in capability and cannot attest to reliability across all levels of the IoT architecture.
Moving away from the classic well-defined reliability metrics, some non-standard reliability metrics have been designed and implemented in recent studies. Saini (2016) presented a model to evaluate trust factor and reliability over a period of time (ROPT) for IoT systems. Due to the notion that identical IoT sensors might be deployed in drastically different environments (i.e., exposed to varying levels of humidity, temperature, wind) these identical IoT sensors might exhibit different expected lifetimes. The author proposed that ROPT is calculated for every individual device and gateway in the IoT system in order to gain a full understanding of how reliable the system is. The author also presented a trust factor rating scale allowing us to reason on how some IoT applications require higher levels of trust, i.e. defence systems, and therefore higher levels of availability. This study only uses one metric to determine the reliability of the system, and cannot represent the entire picture of reliability in the IoT. Li et al. (2012) also proposed three non-standard reliability metric definitions to observe real time quality of data collected from devices in IoT environments. The study validates the implementation of these metrics by applying them to two real-world open source datasets. The three metrics defined were; currency, availability and validity. Implementing the metrics onto real-world datasets validated that it was possible to calculate these metrics in real-time, but this was not able to attest to the effectiveness of the applied metrics in identifying data quality issues in IoT.
A more complete framework for managing quality and reliability is proposed by Sicari et al. (2016) and Sicari et al. (2014). This architecture is designed to quantify the security and quality of individual devices in IoT applications. The model used NOS (Networked Smart Objects) to extract metadata from IoT nodes in a network (Rizzardi et al. 2016). The parameters extracted from a security perspective were confidentiality, integrity, privacy and authentication. The collected parameters for quality were accuracy, precision, timeliness and completeness. Each parameter was attributed an index score ranging from zero to one, which reflected the effectiveness of the node with regard to that parameter. The model was tested using Raspberry Pis and sensors from a meteorological station and it was successfully able to calculate the specified parameters. This model concerns the data quality characteristics of IoT nodes though the quality metadata, which is not sufficient in describing the holistic reliability of an IoT system. The security metadata provides some insight into how secure a given node is in an IoT system, this could be enhanced by adding anomaly detection.
The research presented in this section is valuable in aiding the understanding of how reliable and prone to failure the devices in our IoT infrastructure are. These pieces of research help form an understanding of how some of this information can be quantified, using metrics like availability, MTBF and MTTR. Nevertheless, the quantification of hardware reliability is only one step in the solution to overall IoT reliability. These research studies are unable to attest to reliability at the network level or make an assessment about the likelihood of the system providing anomalous data or falling victim to a spreading threat.
Network reliability
Beyond being able to reason about the fitness of our IoT devices, we must also be able to attest to the reliability of the network infrastructure that forms the backbone of IoT communication. Generally speaking, there are two forms of network reliability studies which are discussed in this section; studies for enhancing QoS in networks, and studies aimed at quantifying reliability metrics for networks. This section presents the current state-of-the-art research in IoT networks reliability.
A novel IoT network QoS metric was proposed by Maalel et al. (2013) in their work, which designed a lightweight and energy efficient routing protocol to enhance and measure reliability in IoT applications, specifically emergency applications. Emergency applications in the IoT require a rapid response for alarms that have been raised. The work proposed a mechanism called AJIA (Adaptive Joint Protocol based on Implicit ACK) for packet loss and route quality evaluation. The mechanism relies upon the broadcast nature of the protocol, where messages are broadcast to all nearby nodes. The nearby nodes can therefore “overhear” the message being sent. This overhearing function is used rather than traditional ACK messages to ensure reliability of the message being sent. The links between nodes are then evaluated with a metric called Link Quality Indicator (LQI), which uses the history of packet loss in the link to determine the reliability of that particular path. Other QoS metrics, such as delay throughput, and packet loss, were quantified by Kamyod (2018). This work employed Riverbed’s Optimized Network Engineering Tools (OPNET) to observe these network reliability parameters in a smart agriculture scenario. These parameters were monitored so that they might provide some information as to how reliable the overall end-to-end IoT system was. The study found that increasing the number of nodes in the network saw longer packet delays and significantly longer transmission times and packet loss. Brogi and Forti (2017) proposed a general model for a QoS-aware IoT infrastructure, based on the fog computing paradigm. The model allows IoT applications to generate QoS profiles in order to request certain QoS characteristics from the Things it interacts with. Each communication link in the IoT system has an associated QoS profile, which allows the model to determine the potential latency and bandwidth for an application to things communication. The model only considers latency and bandwidth, which is a limited subset of QoS characteristics which would not fully represent the reliability of the network at a given point in time.
Further IoT network QoS metrics, embedded in a management framework, were examined in a study by Al-Masri (2018), which presented a microservices QoS management framework (mQoSM) for use in Industrial IoT (IIoT), which is a QoS-aware middleware that monitors the behavior of microservices in order to determine the “best” microservice amongst all discovered microservices. This information can then be used by IoT architects to decide if they wish to integrate the microservice. This framework monitors the following parameters; response time, throughput, availability, reliability and cost. The model presents a useful step towards generating a situational awareness of the IoT system with regards to reliability and performance, however, it has not been scaled up beyond microservices in an IoT environment.
An approach of reliability modelling using Generalised Stochastic Petri Net (GSPN) was proposed by Li and Huang (2017). This approach theorised mathematical models at edge nodes to provide statistics on the performance of IoT devices. The metrics calculated were time consumption, response time, failure rate and repair times. These metrics only speak to the performance of the device to edge layer and offer a very limited view of network performance which does not present a holistic view of IoT reliability. A gateway redundancy model was proposed by Sinche et al. (2018). This work made use of redundancy at both the ISP (Internet Service Provider) level and Gateway (edge node) level. This model tested three cases, an IoT infrastructure with no redundancy, an IoT with gateway redundancy and an IoT with gateway and ISP and gateway redundancy. The model was tested using a physical IoT testbed, wherein the devices were communicating using the I2C bus protocol. RTT (return trip time) was used as the performance metric to determine the effectiveness of the model. The results shown in the study found that the model which did not use the redundancy approach saw the RTT increase by 14% during fault conditions, whereas the redundancy models resulted in only a 1% increase in RTT. This study considers reliability at the network and cloud level only. Therefore, it does not consider the reliability of the physical devices, or their propensity to fail at any given time. This study also does not consider the heterogenous nature of IoT communication protocols. Alam (2018) presented a framework to handle reliability issues in IoT based on the TCP (Transmission Control Protocol). There are three components to the framework; the reliability calculator, the reliability controller and the reliability handler. The framework uses delay to determine the failure-state of the IoT system. If high levels of delay are observed by the reliability calculator, the reliability controller will attempt retransmission and the reliability handler will initiate a broadcasting mode and enter a power-saving state. This framework only deals with the delay QoS metric in IoT, thus it cannot represent the full state of reliability in the network.
The research presented in this section shows that while some attempts have been made to enhance reliability in IoT networks, both by enhancing the network’s QoS and by monitoring and quantifying network reliability, there is currently not a research approach which successfully combines device and network reliability into one framework.
System reliability
Some research has also been conducted to evaluate IoT reliability at a system level. These approaches are at a high level, and do not capture the individual detail for reliability, such as which devices are responsible for failures, or which parts of the network are responsible for traffic problems.
Behera et al. (2015) proposed a method of modelling reliability in a service oriented IoT. Specifically, algorithms were proposed to evaluate reliability in a Centralised Heterogeneous IoT Service System (CHISS). The authors proposed that reliability could be measured by modelling the availability of the program to run the service, the availability of input required for the service to run and the service reliability of subsystems associated with the system. The algorithms were tested on a case study of a fire alarm system, which was running under normal operation at the time. The algorithms were able to determine if the program and file was available for each component in the IoT system. This methodology did not, however, consider the notion that the IoT components could fail at any moment and begin sending anomalous data, or that the network could fall victim to a spreading threat or virus. In order to present a true reflection of reliability, it is necessary to have a mechanism which can alert the user to failures in the system before critical actuations are made.
Kharchenko et al. (2017) proposed the use of a Markov model to predict the reliability requirements of an IoT system. The Markov model considered that the application could be in a range of 15 states, from normal condition to complete failure. The probabilistic nature of the Markov model facilitates prediction that the system will move from one state to the next and can establish the probability of a failure at a given point in time. This model only considers the states specified in the design of the model and is not capable of reacting to new situations that were not catered for in the design of the model.
Anomaly detection
With the vulnerable state of IoT networks, given their constrained devices and highly mobile nature, it is essential that any framework which intends to quantify the reliability of an IoT infrastructure must have knowledge of the potential presence of anomalous data in its applications. This anomalous data could have severe consequences if left undiagnosed to be sent to the application layer and used in critical actuation situations. This section presents the current research on IoT anomaly detection. IoT-specific anomaly detection is a challenging area, because the solutions must be lightweight and capable of handling the heterogeneous range of IoT devices.
Spanos et al. (2019) proposed a smart-home anomaly detection method which combines statistical and machine learning techniques according the network behaviour of the device. During training, features are extracted from the network packet data, these features are then standardised and passed into a clustering algorithm. These clustered labels are then passed into ensemble classification methods, which determine the final result from soft-voting. The authors were able to detect mechanical exhaustion and physical damage to the devices. Nevertheless, more data and performance metrics are required here to determine if the model works at scale and with a wider set of devices.
Gonzalez-Vidal et al. (2019) examined methods to detect anomalies in IoT time-series data. Their process consisted of two steps; extract outliers and abnormal patterns using the individual time-series properties of the data, and then use the features extracted from these models to classify them from the annotated classes. For the time series anomaly detection model, the ARIMA and HOT-SAX frameworks were used, while Random Forest and Association Rule Mining methods were used in the classification component. The authors saw accuracies of up to 90% using their methods. This work is a valuable contribution in the area of sensor data-level anomaly detection, however, it is limited in that it requires time-series data to operate.
Stiawan et al. (2017) proposed a technique for early anomaly detection using network traffic analysis. This technique used the SNMP (Simple Network Mapping Protocol) to collect traffic from a heterogeneous range of IoT devices. This traffic was then visualised in graphs for further analyses. Thresholds could then be set based upon CPU and memory usage which can determine the presence of an anomalous communication in the network. This approach is lightweight and suited to the IoT, however, the solution does not include a method to automatically or statistically determine a threshold for failures, which could generate a high volume of false alarms.
Sedjelmaci et al. (2016) proposed an energy-efficient anomaly detection technique which caters for low-resource IoT devices. The technique uses a game theoretic methodology in order to reach the optimal energy efficiency by combining two known techniques for intrusion detection in IoT; signature-based detection and anomaly detection. The anomaly detection component learns activity and builds a classification rule, which is then passed to the signature detection component so that the next time the anomaly occurs it can be recognised by its signature rather than having to rerun the classifier to detect it. Game theory was then applied to this hybrid technique to create further energy savings, which opposes two “players” against each other, one being the attacker launching the new attack signatures and the other running the algorithm to detect anomalous new signatures. When the game finishes the historical data can be examined to determine the probability of a new signature and thus can state a time at which anomaly detection should be run to build new rules. The study compared the proposed lightweight game-theoretic technique to other known hybrid techniques in the research literature. The study found that accuracy was reduced in the game-theoretic technique, which was to be expected given the predictive nature of the technique. When comparing energy consumption, however, the study found that it was possible to save up to 6000 mJ of energy when running the lightweight technique, which represents a worthwhile energy saving given the low-resource nature of IoT.
Desnitsky et al. (2015) proposed a method for detecting anomalies in IoT applications using domain-specific knowledge to create a list of constraints for the application. For example, the temperature in a home should not exceed 30 degrees Celsius, or the constraints could be drawn from the history of the data, for example a motion sensor in an office stops providing data. If one of these constraints is exceeded this indicates the presence of an anomalous situation. This model is useful for detecting simple anomalous scenarios, however, it is entirely dependent upon the rule base which is designed by the domain-expert. This limitation means that if an anomaly is not accounted for in the constraints then it will not be detected.
Abeshu and Chilamkurti (2018) proposed a deep learning approach for detecting attacks based upon the fog-computing paradigm in IoT. Using the fog-computing paradigm can significantly reduce delays versus the traditional cloud centric paradigm, which is useful in mission-critical IoT scenarios. The study compared the performance of a deep learning model which used a pre-trained stacked autoencoder for feature engineering and SoftMax for classification against a shallow learning model. The study found that the deep model was consistently more accurate than the shallow model, on average this accuracy gap was 4% which is a large gap in a mission-critical application. Furthermore, the study revealed that the deep model coped with a scaling number of nodes much more comfortably than the shallow model, as when the shallow model was exposed to more than 80 fog nodes the accuracy fell by 2%.
Thanigaivelan et al. (2016) proposed an anomaly detection system for IoT where each node monitors the behaviour of its one-hop neighbours. The proposed system has three main components; the MGSS (Metrics and Grading Subsystem), the RSS (Reporting Subsystem) and the ISS (Isolation Subsystem). The MGSS is the component responsible for grading the neighbouring nodes, these nodes are graded based upon packet size and data rate. The RSS is responsible for reporting any nodes which are confirmed to be anomalous, which the ISS component will then isolate to remove the threat from the network. Further research is required within this solution in order to derive a more comprehensive list of network parameters to monitor, and a statistical method is needed to determine if a node is anomalous or not.
Nomm et al. (2019) proposed a method of detecting botnet attacks in IoT deployments. The method evaluated feature selection techniques to reduce the dimensionality of the data before passing it into a classifier. The dataset used in the experiment was a genuine dataset from a Mirai botnet attack, containing 115 discrete numerical features generated by 9 IoT devices. The features described various network characteristics, such as source and destination IP, jitter and socket information. The author used three different techniques to reduce the dimensionality of the data; entropy, variance and Hopkins statistics. Three classifiers were then used to classify the data; LOF (local outlier factor), one-class SVM (support vector machine) and an IF (Isolation Forest). The study found that feature reduction by entropy combined with the IF classifier was able to achieve accuracy results of 90% by using 5 features. This feature reduction is well suited to the IoT given that it is a much greener approach to machine learning, as opposed to a classifier having to train and test on 115 features. This anomaly detection technique is successfully able to detect anomalies at the network level but does not consider the anomalies that may occur in the payload of the packet being sent by the IoT devices themselves.
The papers reviewed here with regard to IoT anomaly detection represent a clear drive in the research community to create a more reliable IoT ecosystem. With this in mind, it should be stated that anomaly detection is an extremely large field, with application in IoT, network security and a vast array of other computing disciplines. Within the scope of this work, it is not possible to review all available anomaly detection methods, and as such, only the pertinent IoT examples are reviewed here in detail. A more detailed review of anomaly detection methods can be found within the literature (Zarpelão et al. 2017; Moustafa et al. 2019; Cook et al. 2020; da Costa et al. 2019).
Many methods were discussed in this section which provide accurate and varied mechanisms for detecting anomalies in IoT systems. Nevertheless, further research is required to determine how anomalies actually affect the reliability of an IoT system, given that the presence of an anomaly does not necessarily need to hinder or prevent IoT services from operating. This being said, the presence of anomalies is a clear indicator that the IoT system is not performing optimally.
Table 1 Summarised literature review and contributions of each work towards reliability assessment Discussion of surveyed work
The range of research presented in this section demonstrates a growing demand for quantifying reliability in IoT networks. This is not a straightforward task, given that we must be able to assess reliability at both a device and at a network level whilst also being able to detect anomalies as they occur in the system. The research studies presented in this paper all only tackle one facet of the problem, as is evidenced in Table 1 which summarises the contributions of these works. A complete solution would need to be able to integrate all of this valuable IoT reliability information into one reliability framework. The research presented in this paper presents a clear gap in the knowledge and understanding of IoT: there is currently not a solution available capable of, in an end-to-end sense, assessing the reliability of IoT infrastructure.
From the works aimed at quantifying device reliability, there are several different contributions made. Some works, such as Mavrogiorgou et al. (2018), Zin et al. (2016) and Kim (2016) use standard reliability metrics to quantify the state of reliability in IoT devices. These standard metrics include MTTF, MTTR, Availability, Maintainability and Failure Rate (Fries 2006). When given enough device data, these metrics can be used to mathematically reason about the reliability of IoT devices. Some works, such as Saini (2016) and Li et al. (2012) proposed non-standard metrics, like ROPT, Trust Factor and Maturity. Again, these metrics can provide some view of how reliable an IoT device or set of devices is.
The device reliability metrics, regardless of being standard or non-standard, offer up several opportunities for expansion and further research. Firstly, perhaps these metrics could also be extended to include network infrastructure and communications protocols. Doing so would enable the solution to be a more holistic one and bring it closer to managing reliability for the full end-to-end stack. Secondly, these metrics are able to attest to reliability of IoT devices at a certain point in time—could these metrics then be extended to allow the systems to predict and preempt failure? Doing this would be a valuable step towards a more reliable IoT, especially in scenarios where the IoT is supporting mission critical applications. This leads on to the third area for expansion here—while these metrics are valuable at solving reliability for a given set of sensors in a given environment, there is research required to understand how this generalises into other applications. Importantly, do different thresholds need to be applied when considering one IoT vertical over another? Some research is also required to understand how these reliability metrics might react as new and previously unseen devices are added to the applications. One would expect that new devices may carry a significantly different failure profile, and thus may influence the reliability metrics in different ways. The research on IoT device reliability, therefore, should be extended where possible to include the scenario in which the IoT is capable of handling new and unseen devices, operating over a wide range of communication protocols. Lastly, there is an interplay between IoT device reliability and anomaly detection which was not fully exploited in the works surveyed. Given that we know IoT devices are prone to both spontaneous failure and attack from malicious users, this notion will have a strong influence on the reliability of IoT devices. Therefore, research is required to understand the impact of anomalies on IoT device reliability. For example, some applications may be highly sensitive to noise and anomalies, while other applications may fail completely with the presence of a single anomaly. As such, anomaly detection methods provide a valuable insight into the current state of reliability for IoT devices. A potential research question exists here in trying to understand if reliability information can be synthesised from anomaly detection models.
With regard to the works researching network reliability, again we can observe that some metrics were proposed, both standard (Al-Masri 2018; Li and Huang 2017; Alam 2018) and non-standard (Sinche et al. 2018). We can also observe that some new communication protocols were proposed for enabling a more reliable IoT. Some research was also conducted to help address the need for IoT solutions to be considerate of the various vertical markets, for example emergency IoT applications (Maalel et al. 2013). Methods were also introduced to profile devices before they joined the IoT deployment, using reliability data as the decision factor (Brogi and Forti 2017).
The research conducted on network reliability opens up several areas for future research to enable a more reliable IoT. Firstly, while some research has been conducted to understand the sensitivity of different IoT verticals, there is still a growing need for research in this area to help in understanding the impact that these vertical markets have on reliability engineering in the IoT. Given the large predictions for growth in IoT services, we can only expect demand to increase and diversify in terms of the applications being offered. Therefore, in order to be fully reliable, the IoT must be cognisant of these vertical markets, and measure reliability in a tailored fashion. For example, do faults need to be reported in real-time, such as with emergency applications? Or perhaps we may be able to tolerate faults being reported in larger time windows, such as a day, as with smart home applications.
One of the main issues with the studies aimed at assessing network reliability is that they do not have an awareness of the reliability of the devices themselves. Therefore, it is pertinent that some research is conducted to help tie these two facets together in order to enable reliability across the full IoT stack.
As with device reliability, we can also speculate about the importance of anomalies and intrusions in network traffic. It is important that we understand the impact that these anomalies have on the reliability of a particular application. Moreover, if we are able to leverage intrusion detection methods and anomaly detection methods for networks and use them to ascertain reliability information then this represents a step towards a more reliable IoT. Also similar to the device case, some research would be pertinent to understand if it were possible to predict faults before they occur at the network level. The ability to perform this prediction would enable IoT architects to preemptively manage failure, resulting in a more reliable IoT—especially in the case of mission-critical IoT applications.
The system reliability modelling works reviewed in this paper were not specific to either the device or network component of the IoT architecture. Nevertheless, the methods in these works are at an early stage of development and lack the complexity required to deal with a complex IoT environment.
Referring to the works reviewed for anomaly detection, it is clear from these works that anomaly detection is a growing field within the IoT and computing in general. While the anomaly detection methods included were capable of detecting anomalies, there is still a lack of research and knowledge on how we might leverage this anomaly information to quantify the reliability of an IoT deployment. A key area of future research here will be to take these anomaly detection methods and to try to synthesise reliability information from them.