Robust Federated Learning for execution time-based device model identification under label-flipping attack

The computing device deployment explosion experienced in recent years, motivated by the advances of technologies such as Internet-of-Things (IoT) and 5G, has led to a global scenario with increasing cybersecurity risks and threats. Among them, device spoofing and impersonation cyberattacks stand out due to their impact and, usually, low complexity required to be launched. To solve this issue, several solutions have emerged to identify device models and types based on the combination of behavioral fingerprinting and Machine/Deep Learning (ML/DL) techniques. However, these solutions are not appropriated for scenarios where data privacy and protection is a must, as they require data centralization for processing. In this context, newer approaches such as Federated Learning (FL) have not been fully explored yet, especially when malicious clients are present in the scenario setup. The present work analyzes and compares the device model identification performance of a centralized DL model with an FL one while using execution time-based events. For experimental purposes, a dataset containing execution-time features of 55 Raspberry Pis belonging to four different models has been collected and published. Using this dataset, the proposed solution achieved 0.9999 accuracy in both setups, centralized and federated, showing no performance decrease while preserving data privacy. Later, the impact of a label-flipping attack during the federated model training is evaluated, using several aggregation mechanisms as countermeasure. Zeno and coordinate-wise median aggregation show the best performance, although their performance greatly degrades when the percentage of fully malicious clients (all training samples poisoned) grows over 50%.


Introduction
Currently, there exist a vast number of devices deployed all over the world, from smart cars, traffic lights, security systems, to smart homes and industries.The IoT market has grown to a total of 31 billion connected devices by 2020, with a forecast of ≈30 billion devices connected to each other by 2023, according to Cisco [7].One of the main reasons of this growth is the fourth industrial revolution or Industry 4.0, with the explosion of a set of technologies and paradigms such as 5G, machine and deep learning (ML/DL), robotics, and cloud computing.
The emergence of such technologies poses new challenges to be solved in order to ensure a safe and efficient environment [17].In this sense, there are billions of connected devices, many of them performing critical tasks where failures can be fatal, such as autonomous car driving or industrial operations.In addition, the growing popularity of these technologies makes them a desirable target for cybercriminals.Between the possible security threats affecting resource constrained devices, device impersonation is one of the most serious problems of large organizations with proprietary hardware where one device model could be impersonated for malicious purposes, such as industrial espionage.In addition, there are a multitude of counterfeit devices on the market, some of which are difficult to differentiate from the original [15].
To solve these issues, device model and type identification based on performance fingerprinting arises as a solution [20].The main benefit of device model identification is to prevent third-party attacks such as spoofing, as well as to identify malicious or counterfeit devices.Although there are numerous works in the literature exploring the identification of models from different performance characteristics, such as execution-time, network connections or system logs, and leveraging ML/DL for data processing, these solutions mostly require data centralization, making them not suitable for scenarios where data leakage protection and privacy is critical.In this sense, Federated Learning (FL) based techniques have recently gained enormous prominence [24].In FL approaches, the training data of the ML/DL models remain private and while the locally trained models are shared.Later, these models are aggregated (usually by a central party) into a joint model that goes back to the clients for further training, repeating the process in a cyclic fashion.This approach improves both the privacy of the data, as it does not leave the client but also the communication overhead, as sharing only model parameters is usually less resource consuming that sharing the complete data used for training.
In addition, there are few datasets modeling the performance of IoT devices for identification [20], and any of them is focused on execution time performance or FL-based sce-narios.Moreover, most of the current solutions in the literature do not explore the impact of possible adversarial attacks targeting the ML/DL models during their generation and deployment [16].These attacks may happen when one of the clients participating in the federation acts in a malicious way sending corrupted model updates.These problems have additional importance in FL setups, where the control of the clients is no longer under the entity generating the joint ML/DL model.
Therefore, this work explores the following three main areas to improve the completeness of the literature: (i) the identification of device models using centralized Machine Learning (ML) algorithms and execution time data, (ii) the decentralization of this training using the Federated Learning (FL) techniques, and (iii) the use of the Adversarial Machine Learning (AML) techniques to evaluate and improve the robustness of the generated models.In this sense, its main contributions are: • An execution time-based performance dataset collected in 55 different Raspberry Pi (RPi) devices from 4 different models, and intended for model identification.This dataset is generated using physical devices under normal functioning, reflecting a real scenario where many devices are operating.
• The comparison between a centralized and a federated Multi-Layer Perceptron (MLP) model with identical configuration, only changing its training approach.It is showed how the federated setup maintains an almost identical model identification accuracy of 0.9999, without losing performance and improving data leakage protection and privacy.
• The comparison of different aggregation methods as countermeasure for the federated model under a labelflipping attack.Federated averaging, coordinate-wise median, Krum and Zeno aggregation methods are compared, showing median and Zeno the best results regarding attack resilience.
The remainder of this paper is structured as follows.Section 2 describes the closest works in the literature, motivating this research.Section 3 explains the procedure followed to extract the model identification data.Later, Section 4 compares the performance of a DL-based classifier when it is trained from a centralized and from a federated approach.Section 5 explains the adversarial setup followed to test the solution resilience against attacks.Finally, Section 6 draws the conclusions extracted from the present research and future lines to explore.

Related Work
This section will review how the problem of device identification has been addressed to date from different approaches and techniques.Likewise, some works in the literature on Federated Learning and Adversarial Machine Learning will be analyzed.
Device type and model identification has been widely explored in the literature, with varied data sources and ML/DLbased processing techniques [20].As one of the closest works to the present one, the authors of [4] proposed a novel challengeresponse fingerprinting framework called STOP-AND-FRISK (S&F) to identify classes of Cyber-Physical Systems (CPS) devices and complement traditional CPS security mechanisms based on hardware and OS/kernel.It is exposed that unauthorized and spoofed devices may include manipulated pieces of software or hardware components that may adversely affect CPS operations or collect vital CPS metrics from the network.Another interesting paper showing a fingerprinting technique using hardware performance is [19].Such a technique is based on the execution times of instruction sequences available in API functions.Due to its simplicity, this method can also be performed remotely.Additionally, network is the main data source employed in the literature for device model and type identification [13], as it can be collected from an external gateway.
Regarding the application of FL in device identification, the authors of [8] leveraged FL for device type identification using network-based features.Here, authors experienced a slightly reduced performance compared to a centralized setup, 0.851 F1-score in the centralized and 0.849 in the federated, but the training process was faster and safer.Additionally, in [14], the authors performed application type classification based on network traffic using FL to build the models.Although the authors of [21] proposed a distributed solution for network-based model identification, data is shared with an aggregator that performs clustering for model inference.Therefore, no privacy in preserved in this solution.
Moreover, datasets available in the literature for device type or model identification are focused in dimensions such as network connection [3] or radio frequency fingerprinting [1].However, there are not execution time-based datasets modeling device performance for identification, just some benchmark datasets focused in other tasks [22].
Concerning adversarial ML in FL, the authors of [10] exposed the impossibility of the central server to control the clients of the federated network.A malicious client could send poisoned model updates to the server in order to worsen learning performance.A new framework for federated learning is proposed in which the central server learns to detect and remove malicious model updates using a detection model.Finally, the authors of [18] considered the presence of adversaries in their solution for FL-based network attack detection.However, no model identification experiments were carried out.
In conclusion, although each research topic, namely hardware based model identification, federated learning, and adversarial ML, has been separately explored.To the best of our knowledge, and as Table 1 shows, there is no work in the literature analyzing device model identification from a federated learning perspective.Besides, there is not a dataset focused on model identification based on execution time-based features.Furthermore, there is no solution evaluating the impact of adversarial attacks when some clients are malicious, (Hardware-based) 0.9873 average accuracy using correlation-based algorithms to recognize 11 device classes.[19] (Hardware-based) +200 computers individually identified based on execution-time statistical comparison.[8] (Network-based) 0.882 accuracy using a federated LSTM network to identify 10 IoT device types.[14] -(App identification) 0.92 accuracy using a federated CNN to identify user-level applications.[21] (Network-based) (Distributed) ≈0.97 accuracy for clustering-based IoT device type classification.This work (Hardware-based) (Label-flipping) 0.9999 accuracy identifying RPi models and adversarial impact analysis.
together with the main aggregation-based attack mitigation techniques.

Scenario and Dataset Creation
This section describes the scenario and the procedure followed to generate the execution time dataset used in the present work.Besides, it provides some insights on the data distribution that can be useful to understand the model identification performance.

Scenario description
In total, a setup of 55 Raspberry Pis from different models but identical software images are employed for data collection, running using Raspbian 10 (buster) 32 bits as OS and Linux kernel 5.4.83.The generated dataset is composed of 2.750.000vectors (55 devices * 50000 vectors per device).Each vector has two labels associated, one regaring the individual device that generated it, and another regarding the model of this device.Data collection was performed under normal device functioning and default frequency and power configuration, where the CPU frequency is automatically adjusted according to the workload.The list of devices contained in the dataset is showed in Table 2.

Dataset creation
The generated dataset has been made publicly available [6] for download and research of other authors.The published data includes both identifiers for RPi model and for individual devices, so new research could be done regarding individual device identification.
For the device performance dataset generation, the CPU performance of the device was leveraged as data source.In this sense, the time to execute a software-based random number generation function was measured in microseconds.
To minimize the impact of noise and other processes running in the device, the monitored function was executed in groups of 1000 runs a total number of 50000 times per group.Then, for each 1000-run group, a set of statistical features was calculated, generating a performance fingerprint composed of 50000 vectors per device.In total, 13 statistical features are calculated: maximum, minimum, mean, median, standard deviation, mode sum, minimum decrease, maximum decrease, decrease summation, minimum increase, maximum increase and increase summation.Decrease and increase values are calculated as the negative or positive difference between two consecutive values in each 1000-run group.Besides, the device model is added as label.Table 3 shows an example of a vector in the dataset belonging to a Raspberry Pi 4 device.

Data exploration
Figure 1 shows the data distribution for min, max, mean and median features.It can be observed how the values vary according to the model that generated the vector, resulting in a presumably good model identification performance.

Centralized vs Federated Model Identification Performance
This sections seeks to evaluate firstly the performance of the generated dataset when identifying the different device models in a DL-based centralized setup, and secondly the performance variation when the model is generated in a distributed manner, following a FL-based approach.

Centralized setup
For the centralized experiment, the dataset described in Section 3 is divided in 80% for training/validation and 20% for testing, without data suffling.Min-max normalization is applied then using the training data to set the boundaries.
To measure the centralized classification performance, a (MLP) classifier is implemented.After several iterations  testing different number of layers and neurons per layer, the chosen MLP architecture is composed of 13 neurons in the input layer (one per feature), two hidden layers with 100 neurons each one using relu (Rectified Linear Unit) as activation function [2], and 4 neurons in the output softmax layer (one per model class).Adam [9] was used as optimizer with a 0.001 learning rate, and 0.9 and 0.999 as first and secondorder moments.Table 4 shows the details of the model.With this setup, the MLP is trained for 100 epochs using early stopping if no validation accuracy improvement occurs in 20 epochs.
Figure 2 shows the confusion matrix resultant of the evaluation of the test datastet.As it can be seen, almost a perfect identification is achieved, with only 15 samples being misclassified out of ≈550000 (0.999972 accuracy).These results are aligned with the expectations, as having differ-ent CPUs in each RPi model makes the execution time of the same functions different between them.However, model identification performance is not the main focus of the present work, where the priority is to prove the effectiveness of a federated setup and the impact of adversarial attacks and countermeasures.

Federated scenario and results
Once the centralized model has been obtained, the decentralized model is implemented using FL to compare the performance of both approaches.The FL approach is based in horizontal FL, where the clients have datasets with the same features but from different data samples.
For implementation, the IBM Federated Learning library [11] is used, which incorporates the necessary tools to perform the training in a decentralized manner.

Scenario
For the decentralization of the training phase, a scenario has been created in which there are 5 independent organizations in which the available data are distributed.Each of them has a certain number of devices belonging to different models, but not all of them have information on all models, i.e. there are organizations that only have devices of type 4 model, others that only have devices of type 2 and 3 models, etc. Figure 3 provides the details of the device distribution in each organization.Therefore, the 5 organizations intend to generate a global model capable of identifying all the existing device models among all of them.This setup leads to an scenario of Non-IID (Non-Independent and Identically Distributed) data, harder to solve with FL as model aggregation will be negatively influenced in the aggregated models are very different to each other.

Federated architecture design
In order to test the performance of a FL-based setup, first it is necessary to define the architecture to be implemented.In this sense, Figure 4 shows the organization of the different client which will hold the data and upload their local models to the aggregator in order to cyclically build a common model capable of making predictions based on the local data of all clients.

Performance Evaluation
In order to fairly compare the models, the MLP architecture to be trained will be the same as the one used in the centralized model (see Table 4, i.e. the layers will have 13, 100, 100, 100, and 4 neurons, from input to output.As aggregation method, Federated Averaging is applied as proposed in [12].As initialization step, the aggregation server performs two tasks: (1) to initialize the weights of the model that the clients will start to train, so all clients start from the same setup; (2) to retrieve for each client its min-max values of each feature for common dataset normalization, having a min-max normalization for each dataset x in organization o ∈ defined as:  Regarding performance, Figure 6 shows the results of the test dataset evaluation, the same dataset than in the centralized setup.Here, the results are almost identical, with only 17 errors in ≈550000 test samples and an accuracy of 0.999969.From the previous results, a main conclusion can be extracted: no performance loss has been introduced in the resultant model due to the application of a FL-based approach.Besides, as no data has left each organization in the process, the privacy of the information has been kept private successfully.

Adversarial Attack and Robust Aggregation
After testing the effectiveness of Federated Learning, its robustness will be tested using adversarial attacks, specifically the label-flipping technique, using different aggregation algorithms in order to see which one best fits the proposed scenario in the presence of attacks.

Label flipping attack
The label-flipping adversarial technique is applied during the training process, using the same scenario described above with the difference that this time part of the data will be poisoned.
In this sense, the federated training is carried out by poisoning 25, 50, 75 and 100% of the data of 1, 2 and 3 different organizations, representing 20%, 40% and 60% malicious clients, respectively.These configurations are used because potential malicious clients may not poison all their data and just one portion, in order to go undetected and make their activity more difficult to identify.So, a total of 12 adversarial scenarios have been created (4 poising percentages * 3 possible malicious organizations).This setup is generated by modifying the labels of the training data, changing the value of each label to a random value between 1 and 4 that is not the value of the original label.The poisoned organizations are ORG1, ORG2 and ORG4 (in that order for 1, 2, 3 malicious clients).
Figure 7 shows the results when FedAvg is applied as aggregation algorithm in the 12 previous adversarial scenarios (as well as when no label-flipping attack is applied).
As can be seen, aggregation by averaging offers good performance up to a 50% poisoning, maintaining the accuracy over 0.9.However, accuracy drops rapidly to hit rates close to 0% when the poisoning is 75% or higher.Therefore, FedAvg cannot be considered a robust aggregation method in the presence of the label-flipping attack.Next, 3 different aggregation methods will be analyzed in the following in order to check which one offers better performance.

Robust aggregation methods
Next, several aggregation methods focused on improving the model resilience to malicious clients will be evaluated and compared to the default FedAvg algorithm.

Coordinate-wise median aggregation
Coordinate-wise median [25] follows the scheme of the aggregation by average with the difference that the combination of the weights is done by calculating the median of each weight of the local models.In short, following Algorithm 1, the averaging aggregation step is substituted by a median operation.
Next, Figure 8 depicts the accuracy results when the different attack setups are applied when using median aggregation.Coordinate-wise median follows a similar pattern to FedAvg aggregation, dropping from 50% poisoning rate.However, it has performed better especially when there is only one poisoned organization (20% malicious clients).While FedAvg dropped to 0.20-0.40, the median has remained around ≈0.9.

Krum aggregation
The idea behind Krum [5] is to select one of the m local models that is most similar to the rest as the global model.The idea is that even if the selected model is a poisoned model the impact would not be so great since it would be similar to other models that are probably not poisoned.The aggregator calculates the sum of the distances between each model and its closest local models.Krum selects the local model with the smallest sum of distances as the global model.Figure 9 shows the results when Krum is applied as aggregation algorithm.As can be seen, Krum has remained constant for all configurations with an accuracy of 0.6896 .This is because this aggregation method chooses a single local model as the global model and discards the information from the rest of the local models.Therefore, what is happening is that it always chooses the same local model, and this one belongs to an organization that has not been poisoned, so the hit rate remains constant.Figure 10 shows that the resulting global model only recognizes device models of types 0 and 2.
On the other hand, this organization has not been poisoned, which explains that the performance remains constant since the resulting global model is identical regardless of the percentage of poisoning.Therefore, it can be concluded that Krum is selecting the resulting local model of organization 3 in all scenarios, loosing the information regarding the classes not seen in this organization (see Figure 3).

Zeno aggregation
Zeno [23] is suspicious of potentially malicious organizations and uses a ranking-based preference mechanism.The number of malicious organizations can be arbitrarily large, and only the assumption that 'clean' organizations exist (at least one) is used.Each organization is ranked based on the estimated descent of the loss function.The algorithm then aggregates the organizations with the highest scores.The score roughly indicates the reliability of each organization.In this sense, it could be seen as a combination of Krum and averaging aggregation mechanisms.Figure 11 shows the results when Zeno is applied as aggregation algorithm.In this case, Zeno has outperformed the aggregation by median and Krum when only one client is malicious (20% of the total), achieving 0.9072 accuracy.When there is only one poisoned organization Zeno remains constant without being altered by this attack.Figure 12 shows the confusion matrix of Zeno when one client is malicious.It can be appreciated how the performance decrease comes from the impossibility of classifying the second class, the one under represented in the scenario as there are only 5 RPi2 in the dataset.When there are 2 or 3 poisoned organizations, Zeno performance drops once the poisoning rate reaches 75% and 100%, but it still manages to maintain an acceptable performance above 0.50, considering the degree of the attack.Figure 13 compares the performance evolution of Zeno and coordinate-wise median with different number of poisoned organizations.As it can be appreciated, median performance is higher in all scenarios until the poisoning percentage goes above 50%.After that, Zeno shows a better or equal performance in all cases, being the greater difference when three organizations are completely malicious (60% malicious clients).

Conclusions and Future Work
In the present work it has been demonstrated that it is possible to identify device models using only statistical data concerning the CPU execution time of the device.An MLP model has been obtained capable of identifying four RPi device models with a 99.99% accuracy rate.Besides, the effectiveness of Federated Learning technique has been tested against centralized learning.For this setup, a scenario has been proposed where a total of 5 organizations aim to create a model capable of identifying the device models without sharing the actual data with each other.The resulting model has obtained identical performance in both cases, centralized and distributed.Thus taking advantage of the benefits offered by Federated Learning, training a data privacy and data security preserving model, while maintaining the efficiency of the model obtained through a traditional approach.On the other hand, different aggregation algorithms have been tested in order to check which one best fits the proposed scenario facing a label-flipping attack.Zeno has turned out to be the best performing aggregation method in the presence of attacks due to combining the Krum and mean aggregation methods.By selecting the m best models and aggregating them using mean aggregation, less information is lost than with Krum by ignoring certain organizations that are considered malicious.Finally, the data collected for the previous experimentation has been made publicly available due to the lack of performance fingerprinting datasets focused on device identification, and prepared for FL-based setups.
As future work, the efforts will be focused on experimentation with more types of device models with more complex scenarios such as making each device a single client instead of being grouped into organizations.On device identification, it is planned to focus on identifying individual devices with a high hit rate and not just identifying device models, as well as testing other modes of identification by collecting data from other hardware elements than the CPU.It would also be interesting to poison the local model weights instead of the local data (model poisoning) or experiment with other adversarial attack techniques such as Evasion attacks, where the goal is to trick the model once it is trained and not to poison the training process.

Figure 1 :
Figure 1: Min, Max, Mean and Median feature distributions

Figure 3 :
Figure 3: Data division in the federated learning scenario.

) Algorithm 1 Algorithm 1
defines the iterative training process for the model generation, assuming previous dataset normalization.Each client performs local updates of the model and returns them to the server for aggregation, repeating then the process for the desired number of rounds.FederatedAveraging.The K clients are indexed by k; B is the local minibatch size, E is the number of local epochs, and is the learning rate; are the model weights;is the local dataset of client .[12]was executed for 90 federated rounds, with one epoch per round.Figure5shows the evolution of the local validation accuracy for each one of the clients during the training process.It can be appreciated how the maximum performance is reached around epoch 50, and then the accuracy scores for each client keep oscillating between 0.95 and 1 until round 90.

Figure 12 :
Figure 12: Zeno confusion matrix with one malicious client.

Figure 13 :
Figure 13: Zeno and coordinate-wise median aggregation with different number of poisoned clients (X axis depicts poisoning percentage, Y axis depicts accuracy).

Table 1
Comparison of the most relevant model identification literature works.

Table 2
Devices employed in data collection.

Table 3
Dataset vector example for a RPi4 device.Values represent the time required to execute a function expressed in microseconds.

Table 4
MLP architecture for model identification.