1 Introduction

The global market for IoT is rapidly growing, which will rise from $250.72 billion to $1.4 trillion from 2019 to 2027 [2]. Also, the Smart Home market is expected to grow to $174 billion by 2025 from $55 billion in 2016. Currently, there are over 175 million smart homes on a global scale. Some commonly used smart home IoT devices include video-enabled door alarms, locks with remote access, device-controlled burglar alarms, face recognition systems, and many more [3]. The advantages of IoT technology include the ability to access devices remotely and automate tasks that have enhanced the overall experience in homes [1]. However, this improved communication has the risks of security exploits which may expose data to unwanted sources. This can result in potential threats to the environment in which the attacked device is located. The risk is higher when the data is collected from multiple devices. Large, combined datasets can help attackers learn patterns about the users and businesses. The security of these IoT devices and data privacy is a very important issue that needs to be addressed.

Anomaly network intrusion detection systems, which distinguish between normal behavior and abnormal behavior, have been commonly used to detect attacks towards IoT devices. A subset of anomaly-based network intrusion detection methods use deep network models to classify attacks and benign traffic. Traditionally, machine learning models have been trained in a centralized framework by collecting and storing data at a central server. However, this approach increases security risks as data containing sensitive information can be compromised due to an attack on the central server. Other risks include vulnerability to data leaks during data transfer from individual devices to the central server.

With the increasing attention to data privacy, it is important to identify alternative solutions that can ensure the security of IoT devices while protecting data privacy. Federated learning is a promising solution to address the issues of training machine learning models within a centralized framework. It uses a global model to aggregate models trained on the devices. It helps prevent breaches of sensitive data as the data is not shared across the network for machine learning model training [4]. It eliminates the need for data collection at a central location for model training. The devices use their own local data set to train the local model. The trained model is sent to the global model for aggregation and an improved updated version of the model is downloaded to the device for training. The global model aggregates the local-device models without having the need to be trained on the entire dataset.

In this paper, we explore and implement federated learning techniques to detect attack traffic in the IoT network. We use MQTTset [5], a public dataset to train simulated devices in a federated framework. Along with benign network traffic, this dataset contains five types of attacks. We use this dataset to implement supervised and unsupervised deep learning models in a federated learning framework. For supervised deep learning we use Deep Neural Networks and for unsupervised deep learning we use Autoencoders. We implement three federated global model averaging algorithms - FedSGD, FedAvg, and FedProx and perform experiments by adjusting model training parameters. To compare the performance of federated learning with centralized learning, we also implement similar models in a centralized framework by using accuracy as a performance metric. The goal of this paper is to learn the performance of federated learning framework for detecting attacks in IoT networks by using supervised and unsupervised deep learning models, and determine which aggregation algorithm for the global model yields the best performance.

2 Our Approach

2.1 Dataset

In this paper, we use a public dataset, the MQTTset [5] which contains data from home-based IoT sensors. This dataset is also focused on the Message Queue Telemetry Transport (MQTT) protocol, which is widely used nowadays in IoT networks. To simulate the smart home environment, MQTTset includes IoT devices of different natures, such as motion sensors, humidity, and temperature. A tool called IoT Flock is used to generate network traffic in MQTTset. This tool allows the configuration of networks based on scenarios and protocol-specific threats. The network consists of eight sensors and an MQTT broker. The IoT devices are connected to this broker, and the communication uses MQTT Protocol. This data set contains legitimate traffic and attack traffic. It contains five types of attacks namely Flooding Denial of Service attack, MQTT Publish Flood attack, Slow Denial of Service in the Internet of Things environment (Slow ITe) attack, Malformed Data attack, and Brute Force Authentication attack. We grouped five types of attacks into a group labelled as attacks. Besides, the dataset contains a total of 330,926 records and 34 columns.

Data Preprocessing. The data preprocessing steps are common for centralized and federated learning. One key difference is that for supervised learning implementation we use the target column while in unsupervised learning we drop this column so that the Autoencoder is trained without any labels as expected.

For binary classification, we group and relabel the five threat types as 1 and legitimate as 0. Therefore, all benign records are marked as 0 and attacks are marked as 1. Next, since we perform categorical encoding by changing the datatypes of all records to type “category” for consistency. This results in the datatype of all data values to integer. Sklearn library [7] provides train_test_split() method to split the entire dataset into train set and test set with test size of 30%, which results in 231,648 records in the train set and 99,278 records in the test set for supervised learning. Since the Autoencoder is trained only using benign data, we use 115,814 records for unsupervised learning.

Dataset Splitting for Federated Learning. In federated learning, the data resides on the devices in the network that run machine learning models locally. For our implementation, to simulate the federated learning setup, we distribute the data locally among clients. The dataset used in our experiments has an Independent and Identical Distribution (I.I.D). Considering the nature of the dataset used, we split the MQTTset in an I.I.D manner. As the data samples in this dataset are not dependent on each other and can be distributed independently, I.I.D splitting is a natural choice. This splitting is used only for our supervised learning experiments. The dataset is split based on the target column which identifies each record as an attack or benign. However, when the data is allotted to each client, it must be distributed based on the index of the data record. We use a dictionary data structure to store the data for each client in a \({<}key, value{>}\) pair. Each client is the key, and the value is a set of data record indices assigned to the client. The data dictionary contains only data indices and actual data. So as the next step, we need to assign each client the actual data record based on the indices assigned to it. To assign the actual record we use a PyTorch data loading utility called DataLoader along with a custom class that returns actual data record based an input index. We use this data splitting technique to distribute data among each client.

2.2 Approach Overview

This section provides an overview of our approach for detecting attack traffic in IoT networks. For the federated framework, we implement both supervised and unsupervised learning to compare how the attack detection performs on labeled and unlabeled data. 1) For supervised learning, we use a deep neural network. The performance of the global model in a federated framework depends on the model averaging algorithm to improve the global model. We implement three model averaging algorithms - FedSGD, FedAvg, and FedProx. For supervised DNN in a federated framework, we implement all three algorithms to understand the differences and learn which algorithm yields the best results. 2) For unsupervised learning we use an Autoencoder. We use the FedAvg with Adam optimizer for the implementation, because it produces the best results compared to other algorithms we implemented for supervised learning.

To compare with the federated framework, we also implement supervised learning and unsupervised learning in a centralized framework. In centralized learning, we also use DNN for labeled data and Autoencoder for unlabeled data. As the global model for centralized learning is trained directly on the centralized data, there are no averaging algorithms required in this framework.

Federated learning is different from traditional centralized learning methods. In federated learning, the interaction between the server and the client occurs during the communication round. During each round, the clients only exchange models while the data resides locally on each client. The approach is depicted in Fig. 1a showing how we use federated learning in IoT networks. During the initialization phase, each client is trained locally using the MQTTset distributed among clients. The global model is initialized, which is not trained on any data. The structure of the model is same as the local models. The local model on a given client is initialized for local model training. The ways in which the clients are trained and the global model is averaged depend on the specific algorithm used in the implementation. The following steps are executed during each communication round: 1) Each client receives a copy of the global model; 2) Each client trains a local model using its local dataset over e epochs and l number of local batches; 3) At the end of the round, the updated local models are sent back to the central server, where all the models received are aggregated to update the global model, which is a better version compared to the previous round. Steps \(1-3\) are repeated for each round as the global model learns indirectly from individual local models without exchanging any data.

The evaluation phase is depicted in Fig. 1b. In this step, we use the MQTTset test data for evaluation using the global model. The global model is improved by averaging the local client model after each round based on the algorithm. Therefore, it learns about the entire network through the averaged models. The accuracy of the model is determined based on the output value of the correct prediction.

Using unsupervised Autoencoders is based on the assumption that the attack traffic will have features different from the benign traffic. During the training phase, we train the autoencoder only on benign traffic. During the evaluation phase, the test data contains both benign and attack traffic, so the global model determines the reconstruction loss for both types of data. Our goal is to optimize the encoder by minimizing this reconstruction loss. By training the autoencoders on benign traffic data, we can identify the attack traffic during the evaluation stage of the reconstruction loss. A threshold value is determined based on the reconstruction loss of the benign traffic. Using the threshold value, we can classify the traffic as benign or attack.

Fig. 1.
figure 1

Federated training and evaluation phase

2.3 Federated Learning Algorithms

For supervised federated learning, we implemented three federated averaging algorithms - FedSGD, FedAvg, and FedPro for local model aggregation to improve the global model.

FedSGD. In federated learning, the global model can be improved by averaging the gradients or weights of the local client models. In the FederatedSGD (FedSGD) [8] algorithm, the model weights are updated by the Eq. 1 [8],

$$\begin{aligned} w_{new}=w_{old}-(\eta * g), \end{aligned}$$
(1)

where \(w_{new}\) represents the new weights of the model, \(w_{old}\) represents the weights before update, \(\eta \) represents the learning rate and g are the gradients.

Specifically, for the global model, the weights are updated as described in Eq. 2 [8],

$$\begin{aligned} w_{new}=w_{old}-(\eta * (\varSigma _{k=1}^K\frac{n_k}{n} * g_k )), \end{aligned}$$
(2)

where n represents the total number of data samples across all the clients, \(n_k\) represents the number of data samples and \(g_k\) represents the gradients on \(K^{th}\) client.

In a single epoch, the gradients do not get updated while the model weights are updated. Gradients relate to the loss function used during the epoch. Accumulation of gradients can cause two problems: the vanishing gradient problem where the gradients become too small, and the exploding gradient problem where the gradients get too large. Therefore, it is important that the gradients are cleared at the beginning of each epoch.

FedAvg. Using the FederatedAveraging Algorithm [8] (FedAvg) the global model is improved by averaging the model weights received from individual clients. A ClientUpdate function accepts the arguments, virtual device, and client dictionary. Each client dictionary contains the model, dataset, criterion, optimizer, and loss. The ClientUpdate function is run on each client to train each client on its local dataset and update these models through the client dictionary. During the training phase, the gradients are cleared, the loss between the actual and predicted values is calculated and finally, the weights are updated. At the end of ClientUpdate invocation, the trained model is updated in the client dictionary and returned to the global server. A model average function is then invoked to average the model weights calculated from the models trained on each client using their local dataset.

FedProx. FedProx [8] is a generalization of the FedAvg algorithm. In the FedAvg algorithm, during each round of global model update, not all clients are included for model training. Hence, not all model updates from the clients are included. This problem will affect the overall accuracy of the global model.

To improve the accuracy of the global model, the FedProx algorithm has been proposed to improve the performance of FedAvg in terms of the clients included during the global model update. Instead of excluding some clients, all the clients are trained but the number of epochs over which these clients are trained may vary across clients. We classify the clients included during each global model update round into three major groups: K = Total number of clients; S = Selected clients that during each round of global model update where \(S \in K\); A = Active clients that contribute to the model averaging step where \(A \in S\); R = Rest of the clients that belong to S but not A where \(R \in (S-A)\).

In FedAvg, the global model averages model weights from only the active clients, while in FedProx, R clients also contribute to the model averaging steps, thus expected to improve global model accuracy, and improve in fewer communication rounds. In FedProx, we consider a proximal term which basically calculates the difference between the client model and global model. We consider the global model because it is better than the local client model at any point. The proximal term improves the overall accuracy by improving the model updates. In FedProx the weights are updated as described in Eq. 3 [9],

$$\begin{aligned} w_{new}=w_{old}-(\eta * (g+ \mu (w_k - w_g))), \end{aligned}$$
(3)

where \(\mu \) is described as a parameter that is a re-parameterization of E [9]. The other terms \(w_k\) and \(w_g\) represent the model weights of the given client and global model weights, respectively.

3 Experiment Setup

We train and evaluate all the ML models using Google Colaboratory Pro with GPU hardware accelerator and Google Drive to store the dataset. For federated learning implementation for supervised and unsupervised learning, we simulate the setup by creating virtual clients. We use a Python library called PySyft which decouples private data from model training [10]. We use PySyft version 0.2.9 and Pytorch version 1.6. PySyft provides get() and send() methods for exchanging models between the client and server.

For supervised learning, we implement a DNN with structure summarized in Fig. 2. The activation function transforms the weighted sum of inputs into output for the node. We also use a rectified linear activation function (ReLU). For the output layer, we use a Sigmoid activation function which transforms the result into a value between 0 and 1. We also use a dropout to randomly ignore neurons during the training process to prevent the model from overfitting.

The autoencoder implemented in this paper consists of 2 fully connected layers with ReLU and Leaky ReLU activation functions. The decoder has a similar structure where the neurons are placed in an opposite way like the structure of the encoder. The number of neurons in the input layer of the encoder and the output layer of the decoder is the same as expected.

Fig. 2.
figure 2

Model summary for supervised DNN

4 Experiment Results

We present the experiment results in this section. We use the following performance metrics for machine learning model evaluation. 1) Accuracy is the ratio of correct predictions to the predictions made. As we use a balanced dataset for our experiments, we can use accuracy as a performance metric to evaluate the model. Logarithmic Loss is used to penalizing false classifications. Low loss results in higher accuracy of the classifier. 2) A Confusion Matrix (CM) describes the complete model performance. It contains four important values: true positive where both actual and predicated classification is “attack”; true negative where both actual and predicated classification is “benign”; false positive where the actual classification is “benign” but the predicted classification is “attack”; and false negative where the actual classification is “attack” but the predicted classification is “benign”. The diagonal values represent the accuracy of the model [11]. 3) Area Under Curve (AUC) is used with respect to binary classification problems. The probability that a randomly chosen positive record will be ranked higher than the negative record is indicated by the AUC. To plot this, false positive rate and true positive rate are used. Greater values represent a better-performing model. 4) The F1 score measures the accuracy of the test. It is the preciseness and robustness of a model. As with AUC, the higher the F1 score, the better the performance of the model [11].

Fig. 3.
figure 3

FedSGD (supervised) with learning rate = 0.0001, epochs = 200

4.1 Federated Supervised Learning

For federated supervised learning, we train each client on local epochs and then send the model parameters to the global model for averaging. In our experiments, we use 10 clients, and each client has trained locally over 10 epochs. The parameters averaged for the global model vary based on the algorithms. Some train clients around epochs and use communication rounds for model averaging, while other algorithms train clients on a single epoch, so the model takes place in multiple periods on average and does not use rounds. We discuss these details with the results of every algorithm in the following sections.

Fig. 4.
figure 4

Accuracy of different federated learning algorithms

FedSGD. In the FedSGD Algorithm, the local clients are trained on a single epoch and the gradient averaging is done after each epoch at the global model. At the client level, we define parameters such as learning rate and batch size. For FedSGD, the batch size is the number of records on each client. We conducted a total of 6 experiments to evaluate the performance of the model using three learning rates - 0.001, 0.0001, 0.00001, and in each experiment, we trained the models using 100 and 200 epochs - 10 rounds per epoch for 10 epochs and 20 rounds per epoch for 10 epochs - to compare with other algorithms. The highest accuracy of 78% is obtained for the model trained with a learning rate of 0.0001 and Epochs = 200. We use Pytorch SGD [12] optimizer and BCEWithLogitsLoss [13] to calculate the loss. The result of this model is shown in Fig. 3. According to Fig. 3a, the model has a high false negative and classified a large number of attack records as benign. The AUC plot in Fig. 3b has a value of 0.78.

In Fig. 4a, we analyze the performance of the model by changing the learning rate and the number of epochs. We learn that the highest accuracy of 78% is achieved for 200 Epochs with a learning rate of 0.0001. For a learning rate of 0.00001, the model does not perform as well as that for other learning rates. We can also see that the number of epochs affects the accuracy of the learning rate 0.0001 and 0.0001. We can conclude based on the results that the greater number of epochs results in better accuracy.

FedAvg. In the FedAvg Algorithm, the local clients are trained on epochs and the global model is averaged across rounds. Each client is trained over 10 epochs. In our experiments, we use different optimizers with the FedAvg algorithm to compare the performance. We perform 6 experiments each with SGD and Adam optimizer separately. We evaluate the performance of the model using three learning rates - 0.001, 0.0001, 0.00001. We observe that the highest accuracy of 95% is obtained for the model trained with a learning rate of 0.0001, Rounds 20 using the Adam optimizer.

Fig. 5.
figure 5

FedAvg SGD Optimizer (Supervised) with learning rate = 0.001, epochs = 10, rounds = 20

The results of these models are described in Fig. 5 and Fig. 6. For the results of the FedAvg-SGD Optimizer, in Fig. 5a, the classification report shows that the accuracy rate is 78%, while the AUC value of Fig. 5b is 0.7844, which is similar to FedSGD. The training loss in Fig. 5d is for a single client. It can be seen to decrease after the first round of model averaging, and it remains stable after increasing rounds. Based on the confusion matrix, we do not see a significant difference between FedSGD and FedAvg-SGD results in terms of accuracy and classification, but a higher number of benign records are correctly classified in this case.

Figure 6 shows the results obtained using the Adam optimizer for model training and the FedAvg averaging algorithm. FedAvg is different from FedSGD in that FedSGD compares the local model with the model gradient. Compared to previous results, we see a significant increase in model accuracy, and the overall model performance also improves. In Fig. 6a, the classification report shows that the F1 scores for both attack and benign records are good. Similarly, the confusion matrix in Fig. 6c has higher true positive and true negative values. The AUC value of 0.9535 in Fig. 6b indicates that our model has a high probability of correctly classifying attack and benign data. The results show that FedAvg using Adam Optimizer is a better performing model.

Figure 4b shows that when the SGD optimizer is used with FedAvg, the highest accuracy of 78% is achieved for a learning rate of 0.001 with model-averaged for 20 rounds. The accuracy rate varies with the learning rate and global model averaging rounds. These results are comparable to FedSGD, and we see that the accuracy increases as the learning rate and rounds increase. The model does not perform well with for learning rate of 0.00001. In the case of FedSGD the global model is averaged after every epoch in the client while in the case of FedAvg, the model averaging occurs after several epochs on each client. In our case, we fix the number of epochs on each client as 10.

Fig. 6.
figure 6

FedAvg Adam Optimizer (Supervised) with learning rate = 0.001, epochs = 10, rounds = 20

Fig. 7.
figure 7

FedProx (supervised) with learning rate = 0.001, epochs = 20, rounds = 10

In Fig. 4c, we observe a huge improvement in accuracy compared to other models. Contrasted with SGD, the FedAvg algorithm uses the Adam optimizer to perform better, with the same learning rate, and an average of 0.001 for 20 rounds. The highest accuracy rate is 95%, which is also the highest accuracy rate obtained by the supervised learning model in the federated framework. We see that the choice of optimizer during model training makes a difference in the performance of the model. Even in the case when Adam optimizer is used, the model accuracy is 78.30% for a learning rate of 0.00001 for both 10, 20 rounds. The number of rounds over which the global model is averaged does not affect the model accuracy significantly. Therefore, we can conclude that the choice of optimizer and learning rate mainly determines the model performance of a supervised DNN trained using federated learning and a global model aggregated using the FedAvg algorithm.

FedProx. The FedProx algorithm uses a similar implementation as FedSGD and FedAvg but uses an extra parameter \(\mu \), a variable parameter similar to learning rate. We train the model using three learning rates - 0.001, 0.0001, 0.00001, and each experiment used 10 and 20 rounds for training. Each client uses 10 epochs for training locally. We perform a total of 12 experiments for FedProx to first learn that the learning rate of 0.001 yields the highest accuracy for FedProx. Therefore, we perform additional experiments to understand how \(\mu \) affects the algorithm’s performance. We use a \(\mu \) of 0, 0.5, 0.9. The results below are for \(\mu \) of 0.5 which indicates equal number of clients have 10 epochs and others are trained over random number of epochs less than total number of epochs.

The results for FedProx are described in Fig. 7. Figure 7a shows that the confusion matrix results are similar to that of FedAvg-SGD optimizer. We know that FedAvg and FedProx algorithms are driven by a similar model averaging logic but the difference is the number of clients that participate in each round. For model training using SGD Optimizer, we see a similar result. In the case of FedProx, the additional complexities are associated with a customized optimizer that uses additional parameters based on the clients. Clients in FedProx are trained in a variable number of epochs. Figure 7b shows the AUC value of 0.7607. This is similar to the FedAvg algorithm using SGD Optimizer.

As shown in Fig. 4d, the global model achieved an accuracy of 76% with a learning rate of 0.001 in 20 rounds of training. As we have seen in the other models described earlier, the increased number of rounds improves the accuracy of the model. The accuracy rate depends on the learning rate and the number of rounds.

4.2 Federated Unsupervised Learning

We performed 9 experiments to train the unsupervised autoencoders in a federated setting with various parameters. We trained the model using three learning rates - 0.001, 0.0001, 0.00001, and each experiment used 10, 20, and 30 rounds for training. Each client uses 10 epochs for training locally. The model trained after 20 rounds, and a learning rate of 0.001 achieved the highest accuracy rate of 80%. We use Adam [14] optimizer and MSELoss [15] a function from Pytorch to calculate the loss. For model averaging, FedAvg algorithm is implemented.

Since the reconstruction loss for attack traffic is higher than that of benign traffic, we use the reconstruction loss for the classification problem. The reconstruction loss for benign and attack traffic is shown in Fig. 8. We classify the traffic as benign and attack based on a threshold value, which is calculated using the normal distribution described in the Eq. 4 below.

$$\begin{aligned} Threshold = \mu + \sigma \end{aligned}$$
(4)

The result of this model is shown in Fig. 9. Figure 9c shows that for the selected threshold, the autoencoder can correctly classify a larger number of benign records than the attack records. The right value of the threshold is important for classification. If we use a higher threshold, we will see that all benign records will be correctly classified, but it will also affect the number of attack records that are misclassified. For unsupervised learning, as the model is trained on an unlabeled data set, reconstruction loss helps to calculate the threshold required for classification. We see that using Eq. 4 for determining threshold value yields better results compared to the randomly chosen value.

Fig. 8.
figure 8

Reconstruction loss for federated autoencoder (unsupervised)

Fig. 9.
figure 9

Results - federated autoencoder (unsupervised)

4.3 Centralized Supervised Learning

We trained a deep neural network in a centralized setting, where the DNN model is trained on the complete data set. The highest accuracy of 96% is obtained for the model trained with a learning rate of 0.0001, Epochs = 500. We use Adam [14] optimizer and MSELoss [15] from Pytorch function to calculate the loss. The result of this model is shown in Fig. 10. Overall, we see that Centralized DNN has a high classification accuracy and a high f1 score for both attack and benign labels.

Fig. 10.
figure 10

Results - centralized DNN (supervised)

Fig. 11.
figure 11

Results - centralized autoencoder (unsupervised)

4.4 Centralized Unsupervised Learning

For unsupervised learning, we train an autoencoder. The model trained with a learning rate of 0.0001 can achieve the highest accuracy rate of 80% with 500 epochs. The result of this model is shown in Fig. 11. We see that the results are similar to those of an unsupervised autoencoder trained using a federated framework. The model can classify attack and benign records with an accuracy of 80%.

4.5 Performance Comparison

In this section, we compare the performance of supervised and unsupervised machine learning models trained in centralized and federated frameworks.

Federated Supervised Learning - Averaging Algorithms. For federated supervised learning, we compared the performance of the global model averaging algorithm results in terms of accuracy. As shown in Fig. 12, the FedAvg algorithm using DNN trained with Adam optimizer on each client for 10 epochs and a learning rate of 0.001 for 20 rounds perform best with an accuracy of more than 95%. Among other algorithms that use the SGD optimizer, FedProx has the lowest accuracy which is close to 75%.

Fig. 12.
figure 12

Federated supervised learning-model averaging algorithms

Federated - Supervised vs Unsupervised Learning. We compare the performance of supervised learning and unsupervised learning in the federated framework. As shown in Fig. 13a, the performance of supervised learning is better than unsupervised learning due to the labeled data that can be used for model training and evaluation. The accuracy of federated supervised DNN models is around 95% while for federated unsupervised autoencoder, the accuracy is close to 80%. For federated unsupervised autoencoder, we use the results of FedAvg-Adam experiments for global model averaging. We conclude that supervised learning performs better than unsupervised learning in a federated framework for IoT attack detection.

Unsupervised - Centralized vs Federated Learning. We compare the accuracy of unsupervised autoencoders implemented in federated and centralized frameworks. For federated unsupervised autoencoder, we use the results of FedAvg-Adam for global model averaging. As you can see from Fig. 13b, there is no significant difference between the results and the accuracy is close to 80%. Thus, for real-time analysis of unlabeled data, we can use federated learning for IoT attack detection to protect data privacy.

Comparison - Supervised, Unsupervised, Federated, Centralized Learning. We compare the accuracy of supervised and unsupervised learning in the federated and centralized frameworks in Fig. 14 and summarize our analysis as follows: 1) For supervised learning, federated and centralized frameworks achieve similar accuracy close to 95%; 2) For unsupervised learning, federated and centralized frameworks achieve similar accuracy close to 80%; 3) Among the federated averaging algorithms used for supervised learning, FedAvg using the Adam optimizer achieves the highest accuracy of 95%.

We also see a significant improvement in the accuracy of the used deep neural networks implemented in a federated framework using MQTTset. Ferrag et al. [6] achieved an accuracy of 82.60% for the federated DNN global model using IID MQTTset after 50 rounds and 10 clients. We achieve an accuracy of 95% after 20 rounds. This is faster because of the Adam Optimizer and other parameters used during model training.

Fig. 13.
figure 13

Performance comparison of different categories

Fig. 14.
figure 14

Comparison - supervised, unsupervised, federated, centralized learning

As the global model is averaged using local models, the expectation is that the global model has learned about the entire network through federated learning so it should show similar results to the centralized framework. Our experiments proved that this is true and the centralized and federated frameworks result in similar accuracy. Our centralized supervised Deep Neural Network also performs better with an accuracy of 96% compared to the neural network model trained on a balanced data set in [5], where the accuracy of 90.44% is achieved. We understand that improvements in model configuration have improved our results.

We conclude that in federated and centralized frameworks, supervised DNN and unsupervised autoencoders are effective in detecting and classifying attacks in MQTTset.

5 Conclusion

In this paper, we used a public dataset called MQTTset to implement a classifier that uses deep learning to filter IoT traffic. We implemented deep learning models for supervised and unsupervised learning in a federated framework and compared their performance with a centralized implementation. For federated learning, we implemented three different types of federated averaging algorithms - FedSGD, FedAvg, and FedProx, and compared how they perform to determine, which algorithm performs better in our experimental setup. Our results showed that federated learning is effective in determining attacks in IoT traffic using deep learning models with supervised and unsupervised data.