1 Introduction

The internet of things (IoT) has emerged as a game-changing technology in recent years, with its ability to connect devices and sensors to the internet. As IoT application continues to evolve, one can expect even more significant changes to come to different industries, revolutionizing the way ones work, communicate, and interact with technology [1]. Sensors are the primary driving force behind the IoT revolution [2, 3]. For example, in manufacturing, IoT-enabled sensors are used to monitor equipment performance, reduce downtime and improving overall equipment effectiveness [4]. In agriculture, IoT-enabled sensor devices are helping farmers to improve crop yield and quality, reduce water usage and prevent crop diseases [5]. The health industry is one of the most significant impacts of these sensor technologies, providing real time remote monitoring of health parameters and management of chronic diseases by using IoT-enabled fitness trackers and smartwatches. Also, IoT-enabled medical devices such as pacemakers and insulin pumps. In transportation sensors in the vehicles are used to collect data on vehicle performance, driver behavior and road conditions among other things to enable smarter and safer driving [6].

Sensor devices are basically the building block of IoT infrastructure, that they are capable of observing physical phenomena, processing and transmitting data, making decisions based on the observations, and taking appropriate reactions [1,2,3]. IoT sensors are designed to be connected to the internet and provide data to cloud-based applications. They typically use wireless communication protocols such as Wi-Fi, Bluetooth, Zigbee or Z-Wave to transmit data, where it can be analyzed and acted upon [7, 8]. This is what differentiates IoT sensors from normal or stand-alone sensors that are usually not connected to the internet, and typically only provide a local output signal, such as analog or digital signal. IoT sensor is composed of four basic components: sensing unit, processing unit, transceiver unit and a power unit. The individual sensors also exist in different types such as temperature sensors, motion sensors, and security sensors in order to monitor and control temperature, lighting, security and other aspects of the home environment. However, these small devices have obviously limited communication resources such as [9, 10]: limited memory storage, limited energy supply, and limited computing capability. Thus, they are usually prone to failure at any time where the functionality of the given IoT system is also questionable if not their future working status are not predicted.

When a sensor fails toward sensing or transmitting it produces corrupted data or erroneous reading or conflicting information that further the IoT system processes this corrupted data, as a result the overall performance of the system is compromised, making it inaccurate and unreliable sometimes even leading to total failure [11]. For instance, in health care system, if a sensor that monitors a patient’s heart rate or blood glucose level fails, it can result in improper medication dosing, leading to adverse health effects [12]. In agriculture, the total failure or incorrect working of a moisture sensor can result in over-or-under-watering of crops, affecting crop growth and yield. In the energy sector, an incorrect working or failure of a temperature sensor in a power plant can cause overheating and equipment damage. In logistic, an incorrect working or failure of a temperature-controlled shipping container can lead to spoilage of perishable goods. In aerospace, a failure or faulty working of sensors responsible for monitoring and controlling critical systems such as flight control, navigation, and engine can lead to catastrophic consequences [13]. Also, in smart cities, a failure or incorrect working of a sensor in a traffic management system such as traffic light can lead to traffic accidents and congestion [14]. In oil and gas pipelines, a failure or incorrect working of sensors that monitor the pressure and temperature of the pipelines could cause a leakage which can be an environmental disaster. In summary, sensor failures or incorrect workings have severe consequences in different industries, ranging from damage to property, system downtime, increased maintenance cost to loss of revenue. This leads in which sensor fault prediction is an important area of research.

Many existing studies have suggested different approaches such as model-based, signature based and data-driven approaches for detecting sensor faults and potential failures. The model-based approach uses a mathematical model of the system to estimate the true value of a sensor reading and detect sensor faults by comparing the predicted and measured values [15, 16]. The signature-based approach compares the current sensor reading with a pre-defined threshold value. If the reading exceeds the threshold, it’s an indicative of a fault [17]. The data-driven approach analyzes the statistical properties of the sensor data to learn the underlying patterns and relationships between the sensor reading and the occurrence of faults [18].

Though many works have been done to identify and address these faults throughout the literature, they are mainly focused on fault detection rather than predictions, for example, (Huang et al., [19], using time frequency analysis and deep learning detected and classified 3 types of faults. Uppal et al., [20], by analyzing the existing data (historical data) using machine learning techniques performed binary classification of faults i.e., healthy or unhealthy. Jana et al., [21], by using Convolutional Neural Network (CNN) and Convolutional Autoencoder (CAE) performed detection and classification of 4 types of faults. While detecting a fault is useful, it only addressed the problem after it has occurred. Further, the existing works are controlled with the number of faults (i.e., only specific faults) considered in the literature whereas impossible to be generalized to other faults. With regard to predicting multiple faults, very limited studies are conducted with demanding for performance improvement (such as accuracy). As a result, we are motivated to define the present work to apply in the real world problem. Our proposed system predicts potential faults or anomalies before they occurred by considering the sensor’s historical data.

The proposed architecture has two models that work in two phases such as forecasting model that is trained on the historical measurements of the sensors, it is used to forecast the N future values of each sensor. In the second phase, a classification model that is trained on the fault injected dataset, it is used to classify the forecasted values. The fault injected dataset has 4 classes for the 4 fault types and a 5th class to represent the normal (clean) values. So, in the second phase, a multi class (5 classes, i.e., normal, bias, drift, Polydrift and random) classification task is performed to classify the forecasted values.

Commonly, hybrid deep learning models were widely accepted in the recent works in order to achieve better performance [22, 23] since they are able capture the uncertainty and variability in the data and provide more accurate and trustworthy predictions. By ranking and choosing features that scored the highest in the provided data set, only the most pertinent features were picked and utilized in this regard. Such models are also the capability to leverage the benefits of different network architectures and learning paradigms in order to achieve better results when compared with single models. Thus, experiment findings demonstrate that the proposed CNN-LSTM outperformed the other models with a MAE of 2.0957 for a 45 time steps forecasting, and CNN-MLP achieved an average accuracy across the four faults 98.21% when compared with the others.

Accordingly, the main contribution of the study is the development of a multi-stage approach for sensor fault prediction and classification using hybrid deep learning models. As such, a comparative analysis of GRU, LSTM, BiLSTM, CNN-GRU, CNN-LSTM, CNN-BiLSTM for multi-step univariate timeseries forecasting was conducted through analyzing the impact of feature engineering on the predictive performance of timeseries models. Accordingly, since a proactive maintenance and repair can be carried out by identifying the potential sensor faults before they actually occurred, the proposed hybrid deep learning model improved the IoT ecosystems reliability and performances by reducing downtime and costs.

2 Related works

Various works aim to address issues related to sensor failures in order to make the whole IoT application effective. There are several approaches for faults categorization; however, the two major conceptualizations are Data-centric and System-centric view to classify the faults from the system point of view and the data point of view. Data-centric view focuses on the sensor data itself and classifies faults. This considers each sensor output as an independent data stream and looks for anomalies in the data [24]. This is useful for detecting faults that affect the accuracy of the sensor data. System-centric view takes a broader perspective and considers the system as a whole, including all the components that interact with the sensor. This considers faults that arise from the physical sensors themselves and their interaction between different components in the system. This view is useful for detecting faults that can affect the reliability or availability of the system [24].

Numerous researches have been conducted to address the issue of sensor faults using the approaches such as model-based, data-driven, signature-based, statistical or by combining two or more of them [25,26,27,28]. Manzoni et al., [12] employed model-based fault detection and detected Continuous Glucose Monitoring (CGM) sensor’s fault in an artificial pancreas. Kalman predictor was used to predict the expected blood glucose level in the normal conditions, accounting also for the meals, carbohydrate contents, and insulin injected by the CSII pump. This predicted value was compared against the measured values by the CGM. Large inconsistencies between the measured CGM values and predicted values of the blood glucose level suggested the presence of faults. Jihani et al., [29] employed a parity space approach to detect and isolate faults in WSN. A mathematical model was developed based on the instantaneous redundant measurement of two sensor nodes planted in the soil with the same depth. A residual signal (parity vector) was generated based on the mathematical model.

Developed mathematical model used to compare the measured sensor data with the predicted values from the model [15, 16]. If there is a significant difference between the predicted and measured values, it is indicative of a sensor fault. This approach requires a priori knowledge of the system and its behavior to construct the mathematical model, and are often more accurate when the model is well defined. The model typically consists of set of equations or rules that capture the dynamics of the system and its interactions with the environment. Hashimoto et al. [30], proposed a multi-modal approach to fault detection and diagnosis of internal sensor for mobile robot. Three failure modes (hard, noise and scale failure) of the sensor are handled. The mode probabilities, which are estimated based on a bank of Kalman filters, provide the fault decision which is made by comparing the model conditional estimates in the sensor gain. He et al., [31] proposed a model-based method by considering both disturbances and uncertainties, for detecting optical fiber sensor fault in aero-engine linear discrete time invariant (LDTI) system. The performance was demonstrated by applying the design procedure to a non-linear gas turbine model.

Yan et al., [32] introduced a novel model to for detecting minor soft faults of sensors in an air conditioning system. The authors used kernel principal component analysis for feature extraction and dimensionality reduction, and then the data was passed to a double layer bidirectional long-short term (DL-BiLSTM). The residual which was generated by comparing the output of the DL-BiLSTM with the actual value from the supply air temperature sensor was used to detect the minor faults of the sensors. According to the authors their experiment result showed that their method (KPCA-DL-BiLSTM) was 43% higher than that of KPCA and 18.33% higher than LSTM under 10% drift deviation fault.

Alwan et al., [33] proposed a time-series clustering technique to address the limitations of predictive analysis in detecting long-segmental faults in sensor nodes in large scale cyber physical systems. They also compared feature-based time-series clustering and shape-based time-series clustering and found that feature-based time-series clustering was more efficient long-segmental outliers detection mechanisms. Zhao et al., [34] hired sliding window approach to detect two incipient faults, i.e., constant bias and precision degradation, in industrial processes. The singular value of each window was calculated. The control limits were determined by empirical method using the mahalanobis distance, where each singular value corresponds to a control limit. Uppal et al., [35] developed an early fault prediction model for an IoT environment that uses IoT-based sensors, cloud and ML algorithms. The model was evaluated against four algorithms, i.e., decision tree, k-nearest neighbor, Gaussian Naïve Bayes and random forest. According to the authors, Random Forest showed the highest performance on the given dataset, with a classification accuracy of 94.25%. Liu et al., [36] proposed a deep learning approach for sensor fault self-diagnosing scheme for a wind turbine blade. First a sensor data prediction model was built by mining the inherent relevance between the sensors, and then the residual between the predicted value and the actual value was compared against the control limit to detect the presence of faults. According to the authors, the experimental results of the model showed a good prediction (RMSE = 0.001154, MAE = 0.000214, and R2 = 0.999993) and fault diagnosis performance (a recall of up to 98.43%, an accuracy of up to 98.58%, and a precision of up to 90.01%). Wahid et al., [37] proposed a predictive maintenance based on CNN and long short-term memory (LSTM) for machine failures in industrial plants and factories. LSTM was used to analyze the relationships among the different timeseries data and 1D CNN was used for the effective extraction of high-level features. The authors experimented with CNN, LSTM and CNN-LSTM models for the prediction and found that CNN-LSTM provided the most reliable and highest prediction accuracy over the others. Uppal et al., [20] proposed architecture for monitoring each and every office appliance connected via IoT using machine learning. Data from the appliances is collected by IoT and is sent to the ML algorithms. The ML algorithm performs the classification process of the faults. Safavi et al., [38] recommended a method for predicting the health status of the electronics sensors of autonomous vehicles. For fault identification and isolation, a separate feature extraction block was used and to identify the type of fault and recognize the faulty sensors multi-class DNN was used. From these works, it has been observed that, each model and study had limitations in their studies interms of many considerations. In Table 1, some of the gaps indicated from the previous studies explained above.

Table 1 Comparison of previously proposed schemes in terms of technique, performance measures, and limitations

3 Research methodology

This work is aimed to anticipate potential sensor data faults by developing a 2-step hybrid deep learning models for the prediction and classification of sensor faults. An effective and robust solution was generated to predict and classify sensor faults with high accuracy. All the necessary preprocessing and feature-engineering was performed on the data before using it for training. Finally, compared this work against a baseline work and proved that this solution has improved accuracy. The detailed process design induced to this process is described in the Fig. 1.

Fig. 1
figure 1

Research design methodology overview

3.1 Dataset description

Wang et al., [40] using the Intel Berkeley Research Lab dataset (nodes 1,2,33,35 and 37) injected artificial outliers to the normal data to train and evaluated their proposed framework. The outliers consisted 3% of the data and the distribution was random. Emperuman & Chandrasekaran [41] also introduced bias, drift, random and spike artificial faults to the raw data using the algorithm described in [42]. Mishra & Mohanty [43] injected intermittent fault signals to the intel lab temperature and light dataset to test the performance of various machine learning classifiers for intermittent fault diagnosis.

The benchmark datasets used in this paper [42]. The authors also presented a standardized approach (methods and algorithms) to annotate benchmarks and inject faults into clean datasets based on a generic fault model to address the issues of inconsistencies between datasets that is caused by researches having to annotate the datasets by them in order to obtain ground truth. The dataset used in this study contains 10 sensors from the intel lab (temperature). Each sensor has a clean set annotated with the ground truths and 4 sets injected with 4 artificial faults they surveyed from the literature, based on domain expertise and algorithmic methods described in their paper. The dataset is a timeseries data with features timestamp, mote_id, temperature, and has_fault_type. In this regard, the has_fault_type feature is encoded as 0 for no fault, 1 for random, 4 for bias, 8 for drift, and 16 for poly drift.

3.2 Data preprocessing

The data preprocessing steps taken in this research are summarized as follows.

3.2.1 Data integration

Data from multiple sources are combined into a single data frame to obtain a comprehensive dataset. This is done because our data is collected from different sensors, and so the data from different sensors are integrated together to train a model.

3.2.2 Data transformation

Converting the data into a suitable format for analysis. It may include normalizing, scaling, or converting data types. In this work, MinMaxScaler used to transform the data. Also the data type of the timestamp feature was converted from object to datetime format.

3.2.3 Imbalance handling

The dataset was initially highly imbalanced with the normal class significantly dominating the fault classes. Highly imbalanced dataset can lead to biased model during training which can result in poor predictive performance especially on the minority classes for which under-sampling, over-sampling and SMOTE are to balance the dataset.

3.2.4 Stationarity

The data initially contained an increasing trend. Differencing operation was performed in order to remove the trend. Then adfuller test was then conducted to check the stationarity of the data. The test returned a p-value of 0.03 which is less than 0.05 implying the data is stationary.

3.2.5 Feature engineering and selection

3 main tasks were performed under feature-engineering. The first was generation of time-based and rolling window features to capture the local pattern to enhance the predictive power of the model. The second was, among the generated features, selecting the features that contributed more and were more significant than others. The third task was converting the timeseries problem into a supervised learning problem by using the keras TimeseriesGenerator API [43].

Figure 2 show the data preprocessing done for the first stage, i.e., forecasting. After performing timeseries decomposition and detrending the data by performing the differencing operation, preprocessing is then done on the data, i.e., integrating the data from the sensors into 1 dataframe, resampling the frequency to 2 min, converting the timestamp feature to datetime object and so on. Finally, the timeseries data converted into a supervised learning problem by using the Keras TimeSeriesGenerator API.

Fig. 2
figure 2

Data preprocessing for univariate timeseries forecasting

Figure 3 show the similar preprocessing done for the second stage i.e., classification. The main difference being in this case the problem is already a supervised problem as inputs and output are considered, where the output is the has_fault_type feature.

Fig. 3
figure 3

Data preprocessing for multi-class classification

3.3 Model selection

Since sensors are sensing and storing timeseries data, the ability to capture the temporal dependencies is the major criteria for the model selection. The ability of the model to generalize well on unseen data is the second criteria. Based on these criteria, the following candidate models were identified and evaluated each of them.

For regression: the candidate models for regression are GRU, LSTM, BiLSTM, CNN-GRU, CNN-LSTM and CNN-BiLSTM because these models have the ability to capture the temporal dependency between the data and also capture the long-term dependencies that may exist.

For classification: the candidate models for the classification task are MLP, CNN and CNN-MLP.

3.4 Proposed solution architecture

Figure 4 shows the architecture of the proposed system. The architecture consists of a 2-step process, in which, in the first stage a CNN-LSTM model is trained on the historical sensor data. This model is then used to forecast future sensor measurements based on the historical data. In the second stage the forecasted measurements are then passed to the second model, CNN-MLP, which has been trained on fault-injected data to identify 4 types of faults, to categorize the measurements accordingly.

Fig. 4
figure 4

The Proposed architecture for fault prediction and classification

3.4.1 Multi-step forecasting

Multi-step forecasting is the forecasting of many timesteps in the future. There are generally two approaches to forecast many timesteps into the future. The direct multi-step forecast which involves forecasting all the future timesteps at once. The second approach is the recursive multi-step forecasting, whereby a single timestep is forecasted, then the value at the single timestep is taken as input and the next single step is forecasted. This loop continues until the desired number of future timesteps are achieved. The main disadvantage of direct multi-step forecast is that since all future timesteps are forecasted at once, intermediary events or changes are not taken into account. In recursive multi-step forecast, since each forecasted timestep is taken as input for the next timestep, intermediate changes are fed into the model for the forecast of the next timestep. This approach involves predicting the next value in the time series and then using that predicted value as an input to predict the next future time step. This process is repeated until the desired number of future values has been predicted [45, 46].

In this study, recursive multi-step forecast was implemented where the next 45 timesteps were forecasted. According to our design, the next 45 timesteps represent the measurements for the next 1 h and 30 min. CNN-LSTM model was used to perform the forecasting. The reason for combining both of them is to use the strength of each model. The CNN module is used for feature extraction while LSTM is used for learning the temporal pattern and capture the long-term dependency in the data.

3.4.2 Multi-class classification

There are two main approaches to multi-class classification. Direct /native multi-class classification and binary transformation (class binarization). In direct multi-class classification, the classification algorithm is designed to handle multiple classes natively (directly), without any explicit transformation or modification. The algorithm assigns a single class label to each input instance directly. Some algorithms that inherently support multi-class classification include decision trees, random forests, k-nearest neighbors (KNN), and naive Bayes. Another common example of a direct multi-class classification algorithm is the Softmax regression (also known as multinomial logistic regression) which extends the logistic regression to handle multiple classes by using the Softmax function. Class binarization, also known as binary transformation, involves transforming a multi-class classification problem into multiple binary classification subproblems. Instead of directly predicting the multi-class label, a set of binary classifiers is trained to distinguish between each class and the rest of the classes. There are two common strategies for class binarization: one-vs-all (OvA) and one-vs-one (OvO).

For the multi-class classification, because the instances in the data are dependent on each other and it needs clear distinction between the classes, in this study, the binary transformation approach was followed where 4 binary classifiers were trained using CNN-MLP to distinguish between the 4 faults. During classification of new measurements, the classifier with the highest probability is selected as the predicted class.

Although CNN is mostly popular in computer vision and image processing tasks, it has also shown great potential in handling time-series data [47]. The major difference is that computer vision uses image matrix while time-series uses 1D array [48]. So CNN can be adapted for tabular data by using 1D convolutional layer. As a result, we consider the CNN model that can be combined with the other models as discussed below in order to effectively sensing and storing timeseries data as well as effectively model generalizing on unseen data.

3.4.3 Hybrid CNN-LSTM

It is a deep learning architecture that combines convolutional neural networks (CNNs) and long short-term memory (LSTM) networks. The architecture is suitable for processing both spatial and temporal data, making it particularly effective for time-series forecasting problems. The model basically works in 3 main steps.

  1. 1.

    The input time series data is first fed into a CNN layer that extracts the important features from the data. The output of this layer is a set of high-level feature maps.

  2. 2.

    The output of the CNN layer is then fed into an LSTM layer that learns the temporal dependencies in the data. It takes the feature maps as input and produces a sequence of output vectors.

  3. 3.

    The output vectors of the LSTM layer are then fed into a fully connected layer that generates the final forecasts.

The architecture allows the model to learn both the spatial and temporal dependencies in the time series data. The CNN layer extracts the spatial features from the input data, while the LSTM layer captures the temporal dependencies in the data. The model is trained using the mean squared error (MSE) loss function. During training, the model learns to minimize the MSE between the predicted and actual values. To further improved the model, attention layer is added to the model and hyperparameter tuning is also performed in our case.

3.4.4 Hybrid CNN-MLP

This is also a deep learning architecture that combines convolutional neural network (CNN) with multilayer perceptron (MLP). In the architecture, first the input data is fed into a CNN layer that extracts important features from the data. The output of this layer is a set of high-level feature maps. The output of the CNN layer is then flattened and fed into an MLP layer that produces a set of output probabilities for each class. The class with the highest probability output by the MLP is then determined to be the predicted class. The architecture allows the model to learn both the spatial and semantic features in the input data. Four instances of the model are trained on the four sets of data and the instance that produces the highest probability determines the class. The instances (classifiers) are trained using binary_crossentropy as the loss function.

4 Performance evaluations and discussion

For evaluating the performance of the propose models, forecasting analysis and classification analysis metrics were employed. In forecasting, the goal is to predict a continuous variable or a time-series value by analyzing the past values of the time-series [28, 49, 50]. Some of the commonly used metrics in forecasting the model performance are: Mean Square Error (MSE), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) in which we have also applied. On the other hand, the goal of classification is to predict the class or category of a given instance based on its features or attributes. In this regard, metrics such as accuracy, precision, recall, and F1 score are used in our case.

In the first experiment, after identifying candidate models that are suitable for multi-step forecasting sensor values, this work has evaluated each of them using regression metrics RMSE, MSE and MAE. This experiment has been divided into two sub-experiments.

Experiment (a): Under this experiment 3 models i.e., GRU, LSTM and BiLSTM have been evaluated and tested using the TimeSeries cross validation with 2 splits.

Experiment (b): Under this experiment, to leverage the strength of CNN model to extract features, the CNN architecture has been incorporated into the above 3 models (exp a) and designed 3 hybrid models CNN-GRU, CNN-LSTM and CNN-BiLSTM and evaluated the performance of each of them again using the TimeSeries cross validation with 2 splits.

In the first part of the experiment, BiLSTM performed better than the others with a MAE of 2.3324 forecasted over 45 timesteps, and LSTM came in second with MAE of 2.3862. In the second part of the experiment, after the hybridization, all of the 3 models showed improvements, with CNN-LSTM showing the highest improvement of MAE 2.0957 and CNN-BiLSTM following up with 2.2442 (Table 2).

Table 2 Performance of forecasting models

Figure 5 shows the graphical representation of the performances of the identified six regression model as evaluated with RMSE, MSE and MAE.

Fig. 5
figure 5

Forecasting performance evaluation using plot

Figure 6 shows the training and validation loss curve of the CNN-LSTM model. The model was trained with the mse loss function and an early stopping of 10 patience. The graph shows that the model does not overfit or underfit and converges between epochs 50 to 60.

Fig. 6
figure 6

Training and validation loss CNN-LSTM

In the second experiment, three candidate models have been identified (i.e.,CNN, MLP and CNN-MLP) for sensor fault classification task and evaluated their performance using evaluation metrics accuracy, precision, recall and f1. We also employed cross validation in order to avoid overfitting as well as assessing the performance of our model in generalizing the new data that it has not seen before. In this regard, 5 and 10 are the most commonly selected number of folds that are used in the research for grouping the data, i.e., one group as a test set and the rest as a training set. Since binary transformation is considered for multi-class problem, for each fault a classifier is trained and the classifier with the highest probability is selected.

The same setting is conducted for all of the selected three models using 5-and 10-folds cross validation in which their perforances were reported in Table 3 and 4 respectively. Accordingly, the CNN-MLP outperformed the other two models when one compares its accuracy results under both cross validations. Thus, the reported accuracy of CNN-MLP would be 96.10 for bias, 99.30 for drift, 98.80 for poly-drift, and 98.60 for random under 5-folds cross validation, whereas 96.11% for bias, 99.33% for drift, 98.81% for poly-drift, and 98.61% for random under 5-folds. The reason for this can be attributed to the fact that it leveraged both of the strengths of the other models.

Table 3 K5k validation results
Table 4 K10k cross-validation

An added benefit of binary transformation approach is since it trains a separate classifier for each class, one can assess the performance of each classifier. Below, the performance assessment of each classifier is shown. The Bias classifier refers to the classifier trained to identify the bias fault class and so on.

Table 3 show the performance result of the MLP, CNN and CNN-MLP model evaluated using 5 folds cross-validation. Accordingly, it can be observed that the model was able to identify drift, poly-drift and random faults remarkably, but performed less than desired for the bias set.

Table 4 show the performance results of CNN, MLP and CNN-MLP model evaluated against 10 folds cross-validation respectively.

Accordingly, the CNN-MLP model outperformed both the other models. Such better performance achievement is due to, the combination of CNN and MLP that is allowed the model to construct a more effective representation of the data, capturing both the spatial and temporal patterns of the data.

Confusion matrix visualization: the performance of each classifier for the three models using the confusion matrix is shown in the below description.

Figure 7 illustrate the confusion matrices for the bias, drift, random, Polydrift and malfunction set experimented with MLP model. For example, for the bias set, from a total of 59,515 samples, the model was able to correctly classify 25,836 as normal and 27,645 as bias. The model incorrectly classified 3,977 normal samples as bias and 2057 bias samples as normal. For the drift set, from a total of 60,146 samples, the model was able to correctly classify 30,062 as normal and 29,574 as drift, and incorrectly classified 68 normal samples as drift and 442 drift samples as normal. For the random set, out of total of 70,137, the model classified correctly all the normal samples and 33,965 random samples correctly. It incorrectly classified 955 random values as normal. For the Polydrift set, out of total of 60,137, 30,072 normal samples were correctly classified with only 55 samples incorrectly classified as Polydrift. 29,314 Polydrift values were correctly classified with only 696 samples incorrectly classified as normal.

Fig. 7
figure 7

Confusion matrix MLP

Figure 8 illustrate the performance of CNN model using confusion matrices. For example, for the bias set, out of a total of the 59,515 samples, 25,919 were correctly classified as normal and 28,329 were correctly classified as bias. 3,894 normal samples were incorrectly classified as bias and 1,373 bias samples were incorrectly classified as normal. For the drift set, out of the total 60,146 samples, 30,052 were correctly classified as normal and 29,685 were correctly classified as drift. 78 normal samples were incorrectly classified as drift and 331 drift samples incorrectly classified as normal.

Fig. 8
figure 8

Confusion matrix for CNN

Figure 9 shows the performance of the CNN-MLP model using the confusion matrices. Accordingly, for the bias set, out of the total 59,515 samples, 26,532 were correctly classified as normal samples and 27,606 correctly classified as bias samples. 3281 normal samples were incorrectly classified as bias and 2096 bias samples classified as normal. For the drift set, out of the total 60,146 samples all normal samples were correctly classified and 29,511 drift samples were correctly classified. 505 drift samples were incorrectly classified as normal. For the random set, out of the total 70,137 samples, all normal samples were correctly classified and 33,966 samples were correctly classified as random. 954 random samples were incorrectly classified as normal.

Fig. 9
figure 9

Confusion matrix CNN-MLP

5 CNN-MLP classifiers learning curve

Figures 10, 11, 12, 13, 14, 15, 16, 17 show the plot of the accuracy and loss during training and validation for the CNN-MLP classifiers. The classifiers were trained using an EarlyStopping of 10 patience to prevent underfitting or overfitting.

Fig. 10
figure 10

Training and validation loss for bias fault

Fig. 11
figure 11

Training and validation accuracy for bias fault

Fig. 12
figure 12

Training and validation loss for drift fault

Fig. 13
figure 13

Training and validation accuracy for drift fault

Fig. 14
figure 14

Training and validation loss for random fault

Fig. 15
figure 15

Training and validation accuracy for random fault

Fig. 16
figure 16

Training and validation loss for poly-drift fault

Fig. 17
figure 17

Training and validation accuracy for poly-drift fault

6 Summary of individual classifiers

Table 5 shows the performance of each individual binary classifiers for the multi-class classification problem. In this work, the faults names as the classifier’s names have been used for simplicity, i.e., Bias classifier refers to the classifier that is trained to identify the bias fault class and so on.

Table 5 Performance of individual binary classifiers

In multi-class classification, there are 3 main techniques for assessing the overall performance of a model across all the classes. This are macro-average, micro-average and weighted average. The overall performances of the models are discussed using the macro-average technique because this technique gives equal importance to all classes.

Table 6 shows the overall performance of the models using the macro-average technique. This technique is chosen because it gives equal importance to all classes.

Table 6 Macro-average performance of classification models

For a model comparison, we have considered the work in [41] as a baseline in order to compare the performance of our proposed CNN-MLP. The baseline work evaluated the model (CDHMM-LVQ) against 10%, 20%, 30%, 40% and 50% injection rate and performed the classification for the faults bias, drift, random and spike. Thus, we used the performance of the 20% rate to compare our model against since the injection rate for our dataset has been similar in this consideration. Accordingly, CDHMM-LVQ model achieved the performance of bias 98.33%, drift 96.67%, spike 96.67%, and random 100%, whereas our model achieved bias 96.11%, drift 99.33%, poly-drift 98.81% and random 98.61%. When we compute an average accuracy, our CNN-MLP model achieved 98.21% while the existing model achieved 97.91% accuracy. In this regard, our presented work improved an average accuracy of 0.3% over the previous work. In this regard, the achieved results desire to capitalize on the benefits of IoT technology and harness the potential of these technologies to an even greater extent than those we are currently using, while at the same time trying to minimize the impact that results when the system malfunctions or fails.

7 Conclusions

In this study a proactive approach has been developed that predicts the potential occurrence of faults using hybrid deep learning models. The solution involves a two-stage procedure where in the first stage, a hybrid CNN-LSTM model is trained on a clean non-faulty historical data of the sensors to perform a multi-step forecasting of future values. In the second stage, a hybrid CNN-MLP model is trained on fault injected sensor data to learn the patterns of the different fault types and classify new data accordingly. The forecasted values of the first hybrid model are then passed as input to the second hybrid model to determine its status.

For the regression task, with different forecasting models like LSTM, GRU, BiLSTM, CNN-LSTM, CNN-BiLSTM and CNN-GRU have been experimented both before and after incorporating rolling window statistical features. Also, evaluated each of them using evaluation metrics RMSE, MSE and MAE and finally chose CNN-LSTM with incorporated rolling window features as it outperformed the other models with a MAE of 2.0957 for a 45 timesteps forecast. For the classification task, with MLP, CNN and CNN-MLP and evaluated them using the classification evaluation metrics accuracy, precision, recall and f1 using 5-folds and 10-folds cross validation and in both cases CNN-MLP outperformed the other models with accuracy of 96.11% for bias, 99.33% for drift, 98.61% for random and 98.81% for poly-drift, an average of 98.21% over the 4 faults. Finally, this wrok compared with the baseline work and achieved a 0.3% increase in the average accuracy across the 4 faults.

The outcomes of this study have implications for various sectors, including manufacturing, agriculture, transportation, and healthcare, where the prediction of potential sensor faults can prevent catastrophic failures, reduce maintenance costs, and optimize system performance. In all these and other cases, by integrating the model at different levels of IoT’s architectures (edge, gateway, cloud), the IoT system can have the ability to take proactive measures, minimize downtime, optimize decision-making, and enhance overall system reliability and efficiency. Furthermore, the proposed hybrid deep learning models can serve as a foundation for future research in the field of predictive maintenance and fault diagnosis.