1 Introduction

Communication networks are crucial components of the underlying digital infrastructure in any smart city setup, and the identification of anomalies and intrusions in networks is of paramount importance for providing the intended services to various stakeholders, such as public departments, enterprises, and citizens. These networks are heterogeneous, with large numbers of diverse sensing nodes collecting periodic data and transmitting them to various coordination and decision-making services. Integrating the data created in those networks into diverse platforms, providing stakeholder specific views based on their access rights as well as arriving at intelligent conclusions based on data are the key activities of any smart city framework.

The widespread usage of computer networks also brings many cyber security concerns, and every organization has to implement preventive measures to avoid compromising their valuable data and assets. In the growing landscape of cyber security threats, both organized and amateur attempts to access and jeopardize smart city infrastructure become a serious concern to public authorities [1, 2]. One of the necessary protective mechanisms is Intrusion Detection System (IDS), and this research area has received a lot of attention during the past decade [3]. IDS is a software or hardware system that monitors the events occurring in a computer system or network and analyzes those events for signs of intrusion or violations of security policies [4].

In recent years, artificial intelligence techniques are being widely used in the field of network security, especially for Intrusion Detection (ID). Machine learning (ML) algorithms can learn from data how to distinguish between normal and abnormal activities, and this ability has been proven to be very effective for the development of reliable IDSs [5,6,7]. We have been exploring efficient ML techniques for ID as part of two large EU projects InSecTTFootnote 1 and DAIS,Footnote 2 where the former provided us with a strong understanding of the applicability of various ML methods for ID, and in the latter we plan to apply them in a smart city application. During the discussions with our industrial partners, we have identified the following aspects that are of primary importance for any chosen approach:

  • Accuracy

  • Efficiency, in terms of the memory, computation and communication requirements

  • Privacy-preserving ability.

Many studies that compared the performances of different ML algorithms on different benchmark datasets concluded that the Random Forest (RF) algorithm has the highest accuracy [8,9,10,11,12]. RF requires a lot of data for training purposes, as any other ML algorithm, so one of the main obstacles is the security of the provided data. Centralizing the locally collected data can raise various privacy and security concerns that can be overcome by implementing a collaborative learning approach, without the need of data sharing, and this approach is called federated learning (FL) [13, 14]. FL is a decentralized learning technique that trains models locally on clients and transfers them to the centralized server [15, 16]. FL has three main alternatives, where various approaches are used to distribute data among different clients:

  • Vertical Federated Learning (VFL): Each client uses the same instances but has access to different features.

  • Horizontal Federated Learning (HFL): Each client uses the same features but has access to different instances.

  • Transfer Federated Learning (TFL): A combination of VFL and HFL where each client has limited access to both features and instances.

where, in the context of the paper, one instance is one network reading (e.g. a packet sent through the network), while one feature is specific information about the instance (e.g. protocol, duration, etc.).

Another concern is that the model itself can be attacked and vulnerable data can be extracted from the model. This can be solved by using Differential Privacy (DP), which is a mechanism that provides a quantifiable measure of data anonymization by adding random noise during the training process [17]. In this way, an attacker cannot derive any data by accessing the information of the model.

In this paper, we are using a FL framework based on RF that was previously proposed in [18]. The framework employs HFL approach and its main idea is to train independent RFs on clients using the local data, merge independent models into a global one on the server and send it back to the clients for further use. The developed framework was evaluated for attack detection on the most commonly used ID datasets (KDD, NSL-KDD, UNSW-NB15, and CIC-IDS-2017). The novelty of this framework lies in the provision of different alternatives to create the global RF in the server, for subsequent distribution to the different clients. This paper extends the framework by including DP into the different RFs. Additionally, the evaluation of the framework is extended by:

  • Evaluating different data division approaches.

  • Evaluating the framework performance for attack classification.

  • Evaluating the framework performance when using RF with DP for both attack detection and attack classification.

This paper is organized as follows. Section 2 presents the state of the art in federated learning, random forest in a federated learning setup, and differential privacy. Section 3 presents the main components of the used framework. The details of the datasets and preprocessing techniques used, as well as the experimental setup, are given in Section 4. In Section 5, we present the results of the conducted experiments, followed by the conclusion and plans for future work in Section 6.

2 Related work

2.1 Federated learning

Federated Learning (FL) [19] is a ML setting where computer devices learn a task in a collaborative way without sharing data with a centralised server. For example, ML algorithms can be trained across multiple devices and servers with decentralised data over multiple iterations [20]. FL is an iterative process of training a global ML model by aggregating a set of local ML models trained on multiple devices. In each training round, a set of devices is selected to receive the current version of the global model from the server. Then, each device trains a local version of the model on the locally present data and sends back the updated version to the server. The server aggregates all the local versions and repeats the process for another round until the target performance is attained. Various ML algorithms have been adapted to FL setting [19, 21,22,23] in various areas, including industry engineering, healthcare, computer vision, finances, etc.

FL setting is a natural candidate for training and deploying network intrusion detection systems, as it reduces the computational load on the central server, reduces the communication bandwidth (data remains locally), and enables data privacy. Surveys of the various ML methods using FL for related network intrusion detection tasks are presented in [15, 24]. Most of the presented works in these two surveys use deep learning techniques such as Neural Networks [25], Convolutional Neural Networks [26, 27], Generative Adversarial Networks [28] or Recurrent Neural Network [29].

Contrary to gradient-based methods, the implementation and the efficiency of Gradient Boosting Decision Trees in a FL setting is shown in [30]. In the same vein, [31] used a FL Gradient Boosted Decision Tree (GBDT) approach to solve a network intrusion detection task. According to them, one advantage of their algorithm compared to the Deep Learning approach is that GBDT is more interpretable while reaching similar predictive performances. Due to the same reasons, interpretability and scalability, we also opted for a tree-based approach in this work. However, because of their great parallelism and their high prediction performances, we consider Random Forests as a perfect candidate for FL in solving a network intrusion detection task.

In [32] and [33], federated version of RF is applied for healthcare related applications. In [32], their RF performs a weighted combination of the forests trained locally. They used the Matthews correlation coefficient (MCC) for boosting local models with high classification performance in the final combination. In [33], authors focused on the comparison between local models trained on incomplete information with a Federated RF. Both approaches showed a better performance of the Federated RF compared to local RF models.

2.2 Federated learning and privacy-preserving techniques

FL enables the preservation of data privacy by design from adversarial exposition, but private information can be reconstructed from the local models that are sent to the server [34]. Several approaches have been proposed to handle this scenario in FL, by adding privacy techniques such as k-anonymity [35], data encryption [36] or differential privacy [37].

The goal of privacy techniques is to monitor what can be learned from the data and several works enhanced RF with a privacy layer. For example, [38] proposed a version of RF by using anonymization methods on the data to preserve data privacy. Furthermore, [39] opted for the k-anonymity approach in RF for FL settings. Using data encryption techniques offers a security layer by controlling and protecting access to the data. In [40], authors incorporated a homomorphic encryption mechanism on the data in their implementation of a RF in FL ensuring a data security property. In [41], authors proposed a way to create a decentralized federated forest in a ID scenario based on the blockchain. Similar to these approaches, a Homomorphic Encryption and Secure Multi-Party Computation mechanisms are used for privacy by [42] in the context of Federated version of Gradient Boosted Decision Trees. However, homomorphic encryption techniques increase the algorithm complexity and are computationally time-consuming.

Differential privacy offers a rigorous definition of privacy. Assuming we consider an algorithm that queries or analyzes data and computes statistics about them, then Differential Privacy would be applied on the algorithm’s output. If by looking at the output, one cannot tell whether any private data of an individual was included in the original data set or not, then the algorithm is differentially private [37]. Thanks to that definition, we can guarantee that private information about individuals in a dataset will not be leaked.

Differential Privacy provides a formal notion of privacy by adding calibrated noise to the parameters of the ML algorithm [43]. In [44], authors surveyed how the addition of differential privacy affects ML algorithms such as their inputs, outputs, and objective function. In our work, we are focused on using Differential Privacy on tree-based methods and more especially to Random Forests in the FL setting.

Fig. 1
figure 1

Overview of Decision Tree. The decision nodes perform a choice according to a data attribute. The leaf nodes output the number of data samples belonging to each class

2.3 Random forest with differential privacy

The survey by Fletcher [45] provides an overall understanding of the tree-based approaches such as Random Forest with Differential Privacy property. This work mainly focuses on how to design a Decision Tree to preserve privacy without decreasing their classification capabilities. In Random Forest and more especially when a Decision Tree is built, data is queried by the algorithm to either split the node (to partition the data based on the best attribute) or for predicting the class label of the data records in the leaves. Patil et al. [46] were the first to adapt the definition of Differential Privacy from a greedy Decision tree to a Random Forest. Adding noise to Decision tree outputs, generally reduces the algorithm’s accuracy. To overcome that, the authors proposed a hybrid Decision Tree algorithm that balances the privacy and the classification accuracy of the Random Forest algorithm based on Differential Privacy. Fletcher and Islam [47] focused on the Gini index which is used while building Decision Trees. Based on it, they defined the quantity of noise added to make the forest differentially private. Having precise control of the added noise allows them to limit the accuracy performance loss by the Differential Privacy definition. A similar approach was used in [48], where the goal of controlling the quantity of noise on the outputs and the algorithm’s parameters tuning depends on the theory of Signal-to-Noise Ratios. All these approaches use the Laplacian method as a Differential Privacy mechanism.

An alternative way of applying Differential Privacy is to use the Exponential mechanism as in [17]. Fletcher and Islam [17] proposed that the Random Forest’s leaves return the majority class label instead of the class counts. A negative aspect of such approach is that it reduces the learning ability, however, it makes the algorithm more private by design. The authors experimentally validated this approach showing that Random Forest was accurate even after adding Differential Privacy.

Other approaches have been designed for making decision trees private such as [49,50,51,52]. In [49], authors proposed the use of permute-and-flip which randomly chooses a value from a set of options given a weight and a privacy parameter. This approach never performs worse than the Exponential mechanism in expectation. Sun et al. [50] combined several mechanisms for building the trees. The Exponential mechanism selects the split nodes and the Laplace one adds noise to leaf nodes which results in a tighter use of the privacy budget. In the same vein, in [52] the authors proposed to combine an Exponential mechanism and a Laplace one during the tree construction. The first mechanism, Exponential, is used to protect the sensitive features which are given as inputs to the second mechanism, Laplace, which ensure the protection of the leaf nodes. However, they opted for a Gradient Boosting Decision Trees approach instead of Random Forests as preferred in our work. Li et al. [51] used the Out-of-Bag Estimation which perturbs the true number of data for building the tree.

Among the presented related works the closest contributions to this paper can be found in [32] and [33], which are focused on healthcare applications and perform different merging approaches. In our work, we evaluate various approaches for combining local RF models for cybersecurity applications. In addition, differential privacy property is added to our model making it private by design and providing a countermeasure to potential data poisoning attacks [53]. To achieve that, we decided to follow the work from [17] where differential privacy is done with the Exponential mechanism giving strong privacy guarantees and high accuracy performances in practice. We extend their work from the centralised setting to a federated learning one.

Fig. 2
figure 2

Overview of Random Forest

3 Random Forest with differential privacy in a federated learning framework

3.1 Random forest

Random Forest (RF) combines the predictions of different Decision Tree (DT) algorithms into a final prediction [7]. DT is ML algorithm used for classification [54,55,56] and/or regression [57]. In this paper, we will focus only on classification, since that is the main objective of the intrusion detection research area. An example of a DT is presented in Fig. 1. As we can see, DT is formed by decision nodes and leaf nodes. A decision node takes the most relevant feature from the dataset that has not been used before and uses it as a condition to divide the dataset into subsets. If a node does not undergo further divisions, it is a leaf node that contains a final prediction. There are different methods to select the most relevant feature [58], and we are using the following two methods:

  • gini - attempts to find and isolate the largest homogeneous class from the rest of the data. For this purpose, the Gini Index (GI) is calculated for all the different features. The GI for a feature F (denoted by GI(D|F)) is calculated as follows:

    $$\begin{aligned} GI(D|F)= & {} \sum _{f \in F} \left( \frac{|D|F = f|}{|D|} \times GI(D|F = f)\right) \nonumber \\ \end{aligned}$$
    (1)
    $$\begin{aligned} GI(D|F = f)= & {} 1 - \sum _{c \in C} P(C = c, F = f)^2 \end{aligned}$$
    (2)

    where D is the entire dataset, F is a certain feature, and f is the value that the feature takes. \(|D|F = f|\) is equal to the number of instances within the dataset that takes f as the value for the feature F, while |D| is the number of instances of the entire data set. Finally, \(P(C = c, F = f)\) is the probability of selecting the class c out of all classes |C| within the dataset D|F when selecting f as the value for F. After calculating the GiniIndex of all the features, the one with the lowest value is selected as the parent node. Then, further divisions are performed following the same principle.

  • entropy - attempts to minimize the within-group diversity. For this reason, this method calculates the information gain (IG) in order to split the dataset into subsets using a certain feature F. This is done by using entropy (E) and it is calculated for all the different features. The information gain for a specific feature F (denoted by IG(D,F)) is calculated as follows:

    $$\begin{aligned} IG(D,F)= & {} E \left( D \right) - E(D|F)\end{aligned}$$
    (3)
    $$\begin{aligned} E(D|F)= & {} \sum _{f \in F} \left( \frac{| D|F = f |}{|D|} \times E(D|F = f)\right) \nonumber \\ \end{aligned}$$
    (4)
    $$\begin{aligned} E(D|F = f)= & {} \sum _{c \in C} ( P(C = c, F = f)\nonumber \\{} & {} \times \log _{2} (P(C = c, F = f))) \end{aligned}$$
    (5)

    where D is the entire dataset, D|F stands for the dataset after splitting it by a certain feature F, f is the value that F takes, c is a class of all possible classes (C) and \(P(C = c, F = f)\) is the probability of selecting the class c out of all classes (C) within the dataset D|F when selecting f as the value for F. Finally, \(|D|F = f|\) represents the number of cases left after assigning the value f to the selected feature, while |D| is the size of the entire dataset. The feature with a higher IG is then selected as the parent node and further divisions are made.

Fig. 3
figure 3

Overview of Decision Tree with Differential Privacy where the leaf nodes output the majority class instead of the class counts. In that case, looking at the output, one cannot tell whether any private data of an individual was included in the original dataset or not, making the algorithm private

Multiple DTs are used to build RF as shown in Fig. 2. The number of DTs that are used is one of the hyper-parameters of RF. In order to create those DTs different subsets of data must be used, since the usage of the same data will produce the exact same DT. The division on the subsets is performed randomly, using a certain percentage of the entire dataset. In addition, the different DTs in RF would normally use different features. However, we decided to keep all the features on all the trees.

The final step is to ensemble or to aggregate the predictions of the different DTs into the final prediction given by RF. In this paper, we are using two different ensemble methods:

  • Simple Voting (SV) - takes a majority vote as a predicted class.

  • Weighted Voting (WV) - takes a majority vote as a predicted class, but weights the accuracy of each DT for the predicted class by multiplying it with the average of the accuracy of all classes for that DT.

3.2 Differential privacy

Differential Privacy (DP) guarantees the privacy of individuals in datasets and prevents unauthorized extraction of private and sensitive information [37]. There exist many different ways of applying DP on the ML models, often called mechanisms. Usually, a mechanism adds probabilistic noise to the output to make it differentially private. Several mechanisms focus on adding noise to numerical predictions of the algorithms. The Laplacian mechanism, proposed by [59], is one of these cases. For example in a decision tree, the leaf node predicts the class label of the data by returning class counts. The Laplacian mechanism ensures DP by altering these counts by adding a noise sampled from the Laplace distribution, which is the case for differentially private decision trees proposed in [46,47,48, 60].

An alternative is to use the Exponential mechanism [61], which provides an approximation of the best elements from the set. For example, in tree-based algorithms, instead of returning class counts and adding noise to them, the goal is to return the approximate majority class, i.e. one of the classes with the highest number of counts. The probability of selecting one of the outputs (z), represented by \(Pr(f(x) = z)\) is given in (6).

$$\begin{aligned} Pr(f(x) = z) \propto exp \left( \frac{\epsilon \times u(z,x)}{2 \times \Delta (u)}\right) \end{aligned}$$
(6)

where u(zx) represents the scoring function of the output z with respect to the data (x), \(\Delta (u)\) refers to the sensitivity of u [59], which show the deviation when different inputs are used, and \(\epsilon \) is a parameter that allows adapting the strength of privacy.

In this paper, we followed the work done in [17], where authors integrated the differential privacy property into RF using the Exponential mechanism with a smooth sensitivity function from [62]. Different values of \(\epsilon \) were evaluated and it was concluded that a smaller value of \(\epsilon \) ensures high privacy but drastically impacts the prediction performances. The same approach is applied in this paper, but using an FL-based setting. In Fig. 3, we can see DT with DP.

Fig. 4
figure 4

Architecture of a Random Forest in a Federated Learning setup with differential privacy. There are N clients which train RFs locally and transfer them to the centralized server. The independent RFs are merged on the server to create a global RF which is returned to the clients for further use. The green leaf nodes in DTs represent leaf nodes with Differential Privacy, where the output is the majority class

3.3 Random Forest with differential privacy in a federated learning framework

In this paper, RF is used in a FL setup where each client receives data that are not available to others. Independent RFs are trained on these clients and sent to a server, where a global RF is created as a combination of DTs from the clients. The decisions made by individual DTs are preserved and integrated into the global RF. Each DT which is included in the federation contributes to the final decision based on its own classification outcome. With this setup, we can say that we are using a horizontal approach [63] since the data from different clients have the same structure (same number of features). The novelty of this framework lies in the inclusion of DP into the different RFs, as well as the provision of different alternatives to create the global RF in the server to later be distributed to the different clients. An overview of the proposed framework can be found in Fig. 4.

In order to select the DTs to be included in the global RF, the performance of each DT is evaluated using two different methods:

  • Accuracy (A) - general accuracy of the DT in the validation set

  • Weighed accuracy (WA) - general accuracy of the DT (in the validation set) multiplied by the average accuracy of the same DT for all different classes in the validation set. In this way, DTs that perform well in a larger number of classes are prioritized.

To perform this combination, different approaches were used to decide which DTs will be merged to create the global RF:

  • Global RF created by Sorting DTs per RF based on Accuracy (RF_S_DTs_A) - DTs per RF are sorted based on the accuracy and the best ones from each RF are selected

  • Global RF created by Sorting DTs per RF based on Weighed Accuracy (RF_S_DTs_WA) - DTs per RF are sorted based on the weighted accuracy, and the best ones from each RF are selected

  • Global RF created by Sorting All DTs based on Accuracy (RF_S_DTs_A_All) - all DTs are assembled, sorted based on the accuracy, and the best ones are selected

  • Global RF created by Sorting All DTs based on Weighed Accuracy (RF_S_DTs_WA_All) - all DTs are assembled, sorted based on the weighted accuracy, and the best ones are selected

The maximum number of DTs (MaxDTs) that can be used for generating the global RF is the number of DTs per RF multiplied by the number of clients. The number of the best DTs that will be included in the global RF is a hyper-parameter that may vary from 1 to MaxDTs for RF_S_DTs_A_All and RF_S_DTs_WA_All, and from the number of clients to MaxDTs for RF_S_DTs_A and RF_S_DTs_WA.

The global RF that is created on the server is returned to the clients to be used in the future, which provides clients more knowledge without the need of sharing data.

Table 1 Basic information about KDD, NSL-KDD, UNSW-NB15, and CIC-IDS-2017 datasets

4 Experiments

4.1 Datasets

The experiments were conducted on four publicly available datasets, that are among the most frequently used datasets in the ID research area [5]. In order to create those datasets, the network traffic was recorded during normal behavior and during different network attacks that were simulated. The traffic was recorded in the form of network packets and pre-processed to create the features. Each packet is characterized by certain features and labeled as normal or as some type of network attack. More information about each dataset is provided in Table 1.

4.2 Datasets pre-processing

From each of the datasets that were described in the previous section, we selected a certain part to use for the experiments (the last column in Table 1) and we removed the instances that belong to classes with less than 800 cases in CIC-IDS-2017. All features from the original datasets were used, except for CIC-IDS-2017, where two features were removed (flow of bytes and flow of packets per second). The original features were pre-processed depending on their type. Numeric features were normalized to a range between 0 and 1 using min/max approach, categorical features were one-hot encoded, while binary features were not changed. The output label was encoded into the numerical values for attack classification, while for attack detection normal instances were labeled with 0 and all the others with 1.

The datasets were divided into the training set, validation set, and testing set with a 70%-10%-20% distribution, and then split into subsets. One feature was used as a division criteria for an HFL setup: “protocol” was used for KDD and NSL-KDD, “service” was used for UNSW-NB15, and “destination port” was used for CIC-IDS-2017. Subsets that had less than 50 normal or malicious instances were not used. This process resulted in 3, 3, 6, and 14 subsets for KDD, NSL-KDD, UNSW-NB15, and CIC-IDS-201 dataset, respectively.

Summary information after performing HFL division and pre-processing is given in Table 2.

Table 2 Information and distribution of the used instances for KDD, NSL-KDD, UNSW-NB15, and CIC-IDS-2017 datasets after performing HFL division and pre-processing

4.3 Experimental setup

Four different experiments have been performed on the pre-processed data, for two different problems: Attack Detection (AD) and Attack Classification (AC). Python programming language is used to implement the algorithms. The sklearnFootnote 3 [68] ML library implementation of DT classifier was used. The differential privacy was implemented using the IBM differential privacy library (diffprivlibFootnote 4) [69]. The code is publicly available on GitHub.Footnote 5

An explanation of each experiment (EXP) is given below.

  • EXP 1 - Selection of RF hyper-parameters: The experiment was conducted before splitting the datasets into subsets, with the goal of finding the best combination of RF hyper-parameters for a specific dataset and specific problem. Hyper-parameters that were tested include the number of DTs (odd numbers between 1 and 100), splitting rule (gini or entropy), and ensemble method (SV or WV). The best combination of hyper-parameters that was discovered in this experiment was used as the RF setup for all subsequent experiments for the specific problem on a specific dataset.

  • EXP 2 - Evaluation of independent RFs on different clients: For each client an independent RF was trained on data from its subset, using the best combination of hyper-parameters from EXP1. The number of clients corresponds to number of subsets in each dataset, which can be seen in column S. of Table 2. Different methods of obtaining subsets were tested in this experiment:

    • EXP 2.1 - Subsets obtained using a specific feature as a division criteria, as explained in Section 4.2.

    • EXP 2.2 - Subsets obtained using random division of data among clients, such that each client gets the same amount of data.

    • EXP 2.3 - Subsets obtained using random division of data among clients, such that each client gets the same amount of data as in the EXP 2.1.

    For EXP 2.1 two different options were considered for testing: RFs were tested on the data from their own subsets and RFs were tested on the entire testing set. For EXP 2.2 and EXP 2.3 RFs were tested on the entire testing set.

  • EXP 3 - Global RF based on Federated Learning: Independent RFs were combined into a global one using four different merging methods (RF_S_DTs_A, RF_S_DTs_WA, RF_S_DTs_A_All, RF_S_DTs_WA_All) and varying number of DTs. The global RF was tested on the entire testing set and the performances of global RF were compared with the performances of independent RFs on the entire testing set.

  • EXP 4 - Global RF with differential privacy based on Federated Learning: Independent RF with differential privacy was trained for each client on data from its subset (with respect to the division criteria) and tested on the entire testing set. Four different values of \(\epsilon \) parameter were tested: 0.1, 0.5, 1 and 5. After that, the independent RFs were combined into a global one using the combination of the merging method and the number of DTs that had the best performance in EXP3 for the specific problem in the specific data set. The global RF was tested on the entire testing set and the results and performances of global RF were compared with the performances of independent RFs with differential privacy.

The performance of the ML algorithms was measured using different metrics: accuracy and F1 score [70].

Fig. 5
figure 5

EXP 1 - Selection of RF hyper-parameters: Accuracy of RF for AD on the validation set in (a) KDD, (b) NSL-KDD, (c) UNSW-NB15, and (d) CIC-IDS-2017, for different combinations of hyper-parameters. Notice that Y-axis range is from minimum to maximum accuracy on the specific dataset

Fig. 6
figure 6

EXP 1 - Selection of RF hyper-parameters: Accuracy of RF for AC on the validation set in (a) KDD, (b) NSL-KDD, (c) UNSW-NB15, and (d) CIC-IDS-2017, for different combinations of hyper-parameters. Notice that Y-axis range is from minimum to maximum accuracy on the specific dataset

5 Results

As explained in Section 4.3, results will be divided into four sections where the selection of the hyper-parameters is given in Section 5.1. Section 5.2 explores different division of the datasets into different clients and the performance of independent RFs. In Section 5.3, a global RF is formed by combining different trees from the RF trained on the clients and compared with the individual RFs. Lastly, in Section 5.4, we have performed the same experiments as in Sections 5.2 and 5.3 but using DP into the different DTs.

5.1 EXP 1 - Selection of RF hyper-parameters

As stated in Section 3, there is one important hyper-parameter in DT (splitting rule) and two in RF (number of trees and ensemble method). The results of the combination of the three can be found in Fig. 5 for AD and in Fig. 6 for AC. The first thing to mention is that the differences between the methods are minimal. The biggest difference is 0.7 percentage points in the case of UNSW-NB15 between the worse combination and the best independent of the problem that we are solving.

In the case of AD (Fig. 5), we can observe in the curves that entropy is the best splitting method for KDD, UNSW-NB15 and CIC-IDS-2017, while gini is better in NSL-KDD. With respect to the ensemble method, there are no big differences as the curves with the same splitting rule are crossing all the time. Just two exceptions, gini_WV in KDD and entropy_WV where the results are worse. If AC is considered (Fig. 6), we can see how entropy is clearly better in all the datasets. With respect to the ensemble method, the same as in AD is happening, there is no clear advantage of one method over the other. Finally, with respect to the number of trees, we can mention that there is a clear improvement from 0 to 15-30 (depending on the dataset and problem type) trees, but the performance afterwards does not improve much.

A summary of the best combination of hyper-parameters that is selected for each dataset and problem type is given in Table 3. These values will be used over the rest of the experiments.

Table 3 EXP 1 - Selection of RF hyper-parameters: Best combination of hyper-parameters in RF, per dataset

5.2 EXP 2 - Evaluation of independent RFs on different clients

In this subsection, we divided the data into the different clients in three different ways: dividing the data according to a specific feature as explained in Table 2 (EXP 2.1), and dividing the data randomly between the different clients with the same number of instances between the clients (EXP 2.2) or dividing it randomly but with the same number of instances as in EXP 2.1 (EXP 2.3).

Table 4 EXP 2.1 - Evaluation of independent RFs on different clients using a specific feature as a division criteria: Performance of independent RFs for AD and AC on different subsets from KDD, NSL-KDD, UNSW-NB15 and CIC-IDS-2017 dataset
Table 5 EXP 2.2 and EXP 2.3 - Evaluation of independent RFs on different clients using random generated subsets: Performance of independent RFs for AC on different subsets from KDD, NSL-KDD, UNSW-NB15 and CIC-IDS-2017 dataset
Fig. 7
figure 7

EXP 3 - Global RF based on Federated Learning: AD accuracy of the global RF on the testing set of (a) KDD, (b) NSL-KDD, (c) UNSW-NB15, and (d) CIC-IDS-2017 using four different merging methods. Notice that Y-axis range is from minimum to maximum accuracy on the specific dataset

Fig. 8
figure 8

EXP 3 - Global RF based on Federated Learning: AC accuracy and F1 score of the global RF on the testing set of (a)(b) KDD and (c)(d) NSL-KDD using four different merging methods. Notice that Y-axis range is from minimum to maximum value of both accuracy and F1 score on the specific dataset

In EXP 2.1, one RF was trained per client using different data as explained above. Then, these RFs were tested into two different types of testing sets: (1) the testing set only contains information from the specific subset, or (2) the testing set contains information independently of the subset. The results of RF in EXP 2.1 can be found in Table 4. It can be observed that, independently of whether it is AD or AC, and independently of which dataset it is, the accuracy of the different RFs is higher if they are tested on the testing data that belong to the same subset as trained than when we test on the entire testing set. This means that there is information that RF is missing and that it will not be able to classify.

The above statement is corroborated by EXP 2.2 and EXP 2.3. In this example, the data is divided randomly, which means that no specific value for a feature is followed to divide the dataset for the specific clients. The performance of independent RFs is shown in Table 5. We can observe how the performance of RF is really high in both experiments for AC and AD and independently of the dataset. This happens because the different clients had access to the whole range of data and did not miss any information. This strengthens our point of creating a global RF in the server, where the information of the different clients is shared without compromising the information by send it to the server through the network.

5.3 Experiment 3 - Global RF based on federated learning

The goal of this experiment was to find the best combination of the hyper-parameters for the global RF that is built on the server. We evaluated four different merging methods that are explained in Section 3.3 (RF_S_DTs_A, RF_S_DTs_WA, RF_S_DTs_A_All, RF_S_DTs_WA_All) and different number of DTs. The number of DTs that were evaluated includes every number from 1 to MaxDTs for the first two methods. For the remaining two methods we used multiplication of number of clients until we reach the MaxDTs. The only exception is CIC-IDS-2017 datasets were the maximum number of DTs that was evaluated is 500. The performances of global RF were tested for AD and AC in the entire testing set for all four datasets.

Figure 7 presents a comparison of the accuracy of global RF for AD in all four datasets. We can see that the methods that combine all DTs trees together before selecting the best ones for global RF have higher accuracy for three datasets (KDD, NSL-KDD, UNSW-NB15). Only for CIC-IDS-2017 dataset the methods which select the best DTs from each RF have better performance. When the sorting measurement is considered, there is not a big difference between A and WA, except for KDD, where using A gives a considerable improvement (around 30 percentage points).

Fig. 9
figure 9

EXP 3 - Global RF based on Federated Learning: AC accuracy and F1 score of the global RF on the testing set of (a)(b) UNSW-NB15 and (c)(d) CIC-IDS-2017 using four different merging methods. Notice that Y-axis range is from minimum to maximum value of both accuracy and F1 score on the specific dataset

Table 6 EXP 3 - Global RF based on Federated Learning: The best combination of parameters for global RF for AD for KDD, NSL-KDD, UNSW-NB15, and CIC-IDS-2017 dataset

Figures 8 and 9 show the comparison of the accuracy and F1 score of the global RF for AC in all four datasets. For KDD we can notice that RF_S_DTs_WA_All has considerably better results than other tree methods if we are using lower number of DTs. An interesting observation for NSL-KDD is that RF_S_DTs_A_All has in general better performances than RF_S_DTs_WA_All, but RF_S_DTs_WA_All has one peek when it outperforms all the others and it is better than RF_S_DTs_A_All for around 5 percentage points. For UNSW-NB15, RF_S_DTs_WA_All outperforms all the others, while for CIC-IDS-2017 all the methods have the similar accuracy. When it comes to F1 score, we can notice that it is not considerably lower than accuracy in any dataset, except CIC-IDS-2017 where the difference is around 10 percentage points.

The best combination of the number of DTs in global RF and the merging method per dataset, as well as the accuracy and F1 score that were achieved using this combination, are given in Table 6 for AD and Table 7 for AC. If more than one combination resulted with the same accuracy, the following criteria were applied to select the best one:

  1. 1.

    the one with the highest F1 score was selected

  2. 2.

    if F1 score is also the same the one that achieved those performances with the least number of DTs is selected

  3. 3.

    if the number of DTs is also the same, the fastest method is selected.

Table 7 EXP 3 - Global RF based on Federated Learning: The best combination of parameters for global RF for AC for KDD, NSL-KDD, UNSW-NB15, and CIC-IDS-2017 dataset
Table 8 EXP 3 - Global RF based on Federated Learning: Comparison of maximum, minimum, and average accuracy of independent RFs against the accuracy of global RF on the entire testing set of KDD, NSL-KDD, UNSW-NB15 and CIC-IDS-2017 dataset for AD
Table 9 EXP 3 - Global RF based on Federated Learning: Comparison of maximum, minimum, and average accuracy of independent RFs against the accuracy of global RF on the entire testing set of KDD, NSL-KDD, UNSW-NB15 and CIC-IDS-2017 dataset for AC
Table 10 EXP 3 - Global RF based on Federated Learning: Comparison of maximum, minimum, and average F1 score of independent RFs against the F1 score of global RF on the entire testing set of KDD, NSL-KDD, UNSW-NB15 and CIC-IDS-2017 dataset for AC
Table 11 EXP 4 - Global RF with differential privacy based on Federated Learning: Comparison of maximum, minimum, and average accuracy of independent RFs against the accuracy of global RF, both options with DP, on the entire testing set of KDD, NSL-KDD, UNSW-NB15 and CIC-IDS-2017 dataset for AD

The global RFs were compared with the performances of independent RFs on the entire testing set and the results are presented in Table 8 for AD and in Table 9 for AC. As a performance measure for the independent RFs we use the maximum, average, and minimum accuracy of all independent RFs. For AD can see how the global RF improves the maximum accuracy of individual RFs for KDD and NSL-KDD, and it is very close for UNSW-NB15. For CIC-IDS-2017 it fell behind the maximum, but it performed better than the average accuracy. Also, for AC can see how the global RF improves the maximum accuracy of individual RFs for three out of four datasets, only for CIC-IDS-2017 it fell behind the maximum, but it is very close to the average. The same can be concluded if F1 score is considered (Table 10).

Table 12 EXP 4 - Global RF with differential privacy based on Federated Learning: Comparison of maximum, minimum, and average accuracy of independent RFs against the accuracy of global RF, both options with DP, on the entire testing set of KDD, NSL-KDD, UNSW-NB15 and CIC-IDS-2017 dataset for AC
Table 13 EXP 4 - Global RF with differential privacy based on Federated Learning: Comparison of maximum, minimum, and average F1 score of independent RFs against the F1 score of global RF, both options with DP, on the entire testing set of KDD, NSL-KDD, UNSW-NB15 and CIC-IDS-2017 dataset for AC

5.4 Experiment 4 - Global RF with differential privacy based on federated learning

In this section, we test how our proposed setup is affected by adding DP to the different DTs. The results can be found in Tables 11 and 12, for AD and AC, respectively. Firstly, we test how the hyper-parameter \(\epsilon \) affects the performance of the algorithm. We can say that with the tested values, \(\epsilon \) does not have a big impact on the results, except for NSL-KDD where higher differences can be noticed for both AD and AC.

Secondly, if we combine the trees into a global RF, we can see how the performance of the global RF is better than the performance of the best RF in the clients, except for CIC-IDS-2017 and NSL-KDD in AC. However, the performance is close. In addition, for CIC-IDS-2017 in AD the results of the best independent RF and the global one are the same or very similar. If F1 score is considered for AC (Table 13) the same conclusions can be found except for KDD where the performance of global RF with DP is not better than the maximum, but it is close.

If instead of comparing to the best performance, we compare it with the average, in all cases the global RF is better than the individual ones except for CIC-IDS-2017 in AC where the results are very close to each other.

Lastly, if we compare the results without DP (Tables 8 and 9) and with DP (Tables 11 and 12), we can see how adding DP decreases the results of RF for all the datasets in AC. On the case of AD, we can see how the same is happening in NSL-KDD and UNSW-NB15, while in KDD and CIC-IDS-2017, RF improves the performance when DP is added. These two datasets are the ones with more instances which can be an indication that adding noise will result in a more general tree, which can be useful in this case.

6 Conclusion

This paper extends the evaluation of the previously proposed federated learning framework based on random forest by adding differential privacy into random forest, as well as performing experiments for both attack detection and attack classification. The experiments were conducted on four well-know intrusion detection datasets: KDD, NSL-KDD, UNSW-NB15 and CIC-IDS-2017.

The results have shown that combining independent RFs into a global one on the server outperforms the average accuracy of the RFs on the clients for both AD and AC. Additionally, it is concluded that adding differential privacy to random forest penalizes the performance to a major extent in some cases. However, if we compare a global random forest on the server with the independent random forests on the clients the accuracy can be improved even when using differential privacy.

The proposed framework is recommended in the applications where the data cannot be centralized and the goal is to apply AI, while protecting the data as much as possible. It is also proved that the proposed framework is beneficial in the cases where the model can be attacked or an unauthorized access to the model can happen, and differential privacy has to be implemented as an additional protection mechanism to prevent the extraction of the data from the model. An example of such application is AI-based healthcare solutions that use patients’ personal medical data to identify global outbreaks of emerging pandemics. If anonymity of local models can be ensured, more individuals and regions might be willing to share their data, which can greatly support faster diagnosis, detection, and controlling the spread of such diseases. Collaborative manufacturing, smart cities and intelligent system of systems from multiple (even mutually competing) vendors also demand privacy preservation and selective sharing of local models.

The main challenge in practical implementations lies in minimizing the overheads associated with differential privacy. Communication overhead and scalability issues may also arise, particularly in scenarios involving a large number of participating entities with complex internal organisation and demanding data privacy policies. Additionally, limitation of computational resources of the entities can influence the selection of hyperparameters, which in turn affects the model performance and hence the feasibility and applicability in a given context.

As the future work we plan to evaluate the proposed framework in different real-world scenarios where decentralized learning is required. This evaluation will provide insights into its practical applicability and scalability. Additionally, we aim to extend the framework to include support for vertical federated learning. This extension will enhance framework’s applicability in scenarios where different feature spaces are used across different entities. Furthermore, we plan to add a combination of vertical and horizontal approach to address data access limitations, ensuring its applicability in scenarios with diverse data privacy concerns. Additionally, we aim to evaluate our framework using actual devices to measure memory requirements, response time, and other performance metrics, providing insights into its practical effectiveness and areas for optimisation.