1 Introduction

Many automated decision-making systems have been proposed to supplement humans in critical application areas subject to moral equivalence, including fraud detection, criminal re-conviction assessment, credit risk assessment, disease diagnosis and recruitment (Dobbe et al., 2021). However, the practical application of such Machine Learning (ML) methods has raised many concerns regarding their fairness, auditability, privacy-preservation and transparency (Emelianov et al., 2022).

Due to the escalating interest of the research community in issues of fairness and trustworthiness of learning algorithms, a substantial body of work already exists in this domain (Calders et al., 2009; Chakraborty et al., 2021; Hajian et al., 2015; Iosifidis & Ntoutsi, 2020; Kamiran & Calders, 2009, 2012; Kamiran et al., 2012; Zhang et al., 2019). However, real-world applications such as stock market platforms, e-commerce websites, and telemedicine web platforms rely on real-time distributed data streams. These real-time data streams evolve continuously and the statistical dependencies within the data also change over time (concept drift) (Liu et al., 2017). Concept drift, if not tackled properly, leads to compromised predictive performance of model. Therefore massive collections of streaming data necessitate fair, efficient, and concept drift aware data mining algorithms to generate non-discriminatory and high-quality predictions. Recent years have witnessed a few studies that focus on detecting and mitigating discrimination embedded in the streaming data in a centralized environment (Iosifidis et al., 2019; Iosifidis & Ntoutsi, 2020; Zhang et al., 2019). However, centralized access to large volumes of continuously arriving data is a prerequisite for training such conventional stream learning models. With the ubiquitous use of computing devices, data is growing exponentially and distributively. Collecting such large volumes of heterogeneous data on a centralized server raises many challenging concerns such as limited communication bandwidth, network connectivity issues, and substantial storage costs (Zhang et al., 2020). Furthermore, the contemporary advancements in legal constraints, such as the General Data Protection Regulation (GDPR) (Commission et al., 2016), have made societies more privacy-oriented, rendering data aggregation techniques utterly non-viable (Misselhorn, 2020). For example, automatic diagnosis-based telemedicine web platforms enable monitoring of remote patients’ vital signs with real-time data streams. Each patient’s local data can be useful for better diagnosis of other patients with similar conditions. However, a patient’s diagnostic data cannot be shared with other medical professionals or patients because of privacy concerns (Commission et al., 2016). Under the new normal of such pervasive data privacy concerns and continuously growing decentralized data silos, a viable alternative to traditional online ML methods is to design their federated adaptation. Federated Learning (FL) is an emerging decentralized learning paradigm of ML that provides privacy guarantees by offloading model training to the distributed devices (clients) that own the original data. FL enables a multitude of distributed devices to collaboratively train a single shared ML model by exchanging model parameters without revealing their private information.

A plethora of research in the field of FL systems focuses exclusively on improving server performance. For example, protecting the FL system from adversarial attacks (Mothukuri et al., 2021), adapting the FL system to process non-independent identically distributed (non-IID) data (Ma et al., 2022) and improving the communication costs involved in FL system’s optimization (Mills et al., 2019). There are also some works that ensure fairness in FL systems including fairness in client selection procedures (Yang et al., 2021) and incentive distribution (Yu et al., 2020). However, little to no attention has been paid to ensuring fairness in the predictions of FL system while improving/maintaining predictive performance in a stream learning environment.

In this work, we propose federated adaptation for fairness and concept drift aware stream classification. The key contributions of our work are as follows:

  • We propose a novel adaptation of Federated learning framework to mitigate discrimination while simultaneously handling concept drifts and improving its predictive performance in a stream learning environment.

  • We propose a novel adaptive data augmentation technique for discrimination mitigation.

  • In FL setup, data is not available on centralized server, therefore, it can be non-independent and identically distributed(non-IID). We have used real world datasets [Bank (Bache & Lichman, 2013), Default (Bache & Lichman, 2013), Adult Census (Bache & Lichman, 2013), Law School (Wightman, 1998)] and proved that even with non-IID data, FAC-Fed converges within a reasonable number of communication rounds.

  • We scrutinize the effectiveness of our proposed model by performing extensive experiments with a range of publicly available datasets including: Bank Marketing (Bache & Lichman, 2013), Default (Bache & Lichman, 2013), Adult Census (Bache & Lichman, 2013), Law School (Wightman, 1998). To the best of our knowledge this is the first attempt towards fairness and concept drift-aware federated adaptation for stream classification, therefore, we ensure the superiority of our proposed framework by comparing the results of centralized version of FAC-Fed with a range of centralized state-of-the-art stream classification baselines: FABBOO (Iosifidis & Ntoutsi, 2020), FAHT (Zhang et al., 2019), CSMOTE (Bernardo et al., 2020).

2 Related work

Our literature work is based on four research domains including: Fairness-aware learning, Fairness-aware stream learning, Federated-Learning, Fairness-aware federated learning.

2.1 Fairness-aware learning

Recently, state-of-the-art ML based methods presented in the literature for identifying and subsequently eliminating bias and discrimination have gained great attention. These techniques can be categorized into pre-processing, in-processing, and post-processing techniques.

2.1.1 Pre-processing techniques

Learner outcomes are significantly influenced by training data. There is a substantial likelihood that the learner will make biased predictions if the training data is biased. The literature contains a number of pre-processing methods that aim to provide solutions to fairness issues by manipulating the training data. The most basic pre-processing techniques include massaging (Kamiran & Calders, 2009), reweighting (Calders et al., 2009), preferential sampling (Kamiran & Calders, 2012), and Synthetic Minority Oversampling Technique (SMOTE) inspired fairness-aware upsampling (Chakraborty et al., 2021). However, completely unbiased training data can sometimes lead to biased predictions of the learner because the pre-processing approaches are not able to account for the bias introduced by the learner itself (Zhang et al., 2018).

2.1.2 In-processing techniques

These techniques tailor the classification model to generate fair outcomes. For example, Zhang et al. (2018) proposed an adversarial network to mitigate bias where the adversary tries to identify the relationship between a sensitive attribute and the predictor’s outcome, while the predictor’s goal was to optimize performance while deceiving the adversary. Furthermore, Zafar et al. (2019) and Padala and Gujar (2020) incorporated fairness constraints into the learner’s objective function to achieve fairness. Another strategy to reduce discrimination based on adaptive reweighting of training instances is introduced by Iosifidis and Ntoutsi (2019).

2.1.3 Post-processing techniques

These methods tweak the classifier decisions to mitigate bias, such as Kamiran et al. (2010) ameliorated discrimination by relabeling leaves of decision tree model. Kamiran et al. (2012) proposed decision theory based solutions for discrimination free classification. Another post-processing method removed discrimination by processing the fair patterns with k-anonymity (Hajian et al., 2015).

2.2 Fairness-aware stream learning

These types of learning techniques provide solutions to fairness issues in a stream learning environment. Iosifidis et al. (2019) proposed a chunk-based pre-processing technique to achieve fairness goals. A decision tree-based technique, FAHT (Fairness Aware Hoeffding Tree) (Zhang et al., 2019), resolved fairness issues in data streams by considering fairness gain along with the information gain in the splitting criterion of the decision tree. FABBOO (Iosifidis & Ntoutsi, 2020) is another decision tree-based method which changed the decision boundaries to achieve fairness. But FABBOO and FAHT have fixed the role of the sensitive group across the whole stream, therefore, they cannot deal with reverse discrimination, i.e. discrimination towards the privileged group.

All of these proposed methods for reducing discrimination in a stationary and non-stationary environment are based on the ML ansatz—the learner has access to complete training data. However, this assumption cannot be generalized to the FL settings.

2.3 Federated learning

Federated Learning (FL) (McMahan et al., 2017) was proposed as a decentralized solution to share clients’ model updates in the form of weights or gradients during the optimization process instead of their local data to protect clients’ privacy rights. This paradigm of ML brings many challenges, such as privacy leakage, limited communication bandwidth, handling non-IID data among distributed clients, and improving clients’ personalization experience. Several research works have been presented to overcome these challenges. For example, Bonawitz et al. (2017), Papernot et al. (2016) have proposed methods to avoid the issue of privacy leakage in FL systems by either encrypting the client’s training parameters or by adding differential privacy noise to the exchanged training parameters.

Mills et al. (2019) proposed the distributed form of Adam’s optimization algorithm to reduce the number of communication rounds and achieved optimal accuracy in fewer rounds. There are also other research works that deal with the problem of limited communication bandwidth in a federated setup (Abdellatif et al., 2022; Paragliola, 2022).

Zhu et al. (2021) investigated the impact of non-IID data on the classification performance of FL clients and found that accuracy drops significantly with non-IID data. To overcome this problem, several works have been presented (Fisichella et al., 2022; Singh et al., 2023; Wei et al., 2022; Yang et al., 2021; Younis & Fisichella, 2022).

Liu et al. (2021) proposed a method to improve the performance of FL framework for personalization improvement by cooperation of similar clients. A similar federated adaptation for improving personalization experience for clients is proposed by Wu et al. (2021).

2.4 Fairness-aware federated learning

In the current state-of-the-art only few studies have been conducted in this research area. There are some works in the literature that address fairness issues in FL. However, these studies focus exclusively on ensuring fairness in the client selection procedures (Huang et al., 2020; Yang et al., 2021) and incentive distribution (Zeng et al., 2020; Zhang et al., 2020, 2022; Yu et al., 2020). The area of ensuring fairness in the outcomes of a FL framework is still under explored. For example, FairFL (Zhang et al., 2020) provides a deep re-enforcement learning framework to reduce demographic bias (statistical parity) while respecting the clients’ privacy constraints. A gradient-based approach is presented by Cui et al. (2021) that provides fairness guarantees along with a consistent Pareto utility distribution across all clients. Agnostic-Fair (Du et al., 2021) is another fairness-aware FL framework which reduces discrimination by adding regularization terms to the learning model that reweights the training samples. All of these works focus solely on mitigating discrimination in a static learning environment.

To the best of our knowledge our work is the first attempt towards federated adaptation of fairness and concept drift-aware stream classification. We propose an in-processing technique to mitigate discrimination by adaptively augmenting each client’s local data within a defined window of instances in a streaming environment. Our proposed method is not only able to reduce the biases embedded in the clients’ data, but also achieves high balanced accuracy without sharing any sensitive information with the server except the model updates of the clients.

Fig. 1
figure 1

Conceptual model for federated adaptation of online fairness and concept drift-aware stream classification framework

3 Conceptual model

Figure 1 represents the conceptual model underpinning the proposed method. In this model, each client hosts a data stream, a concept drift detector, a local learner (a deep neural network), a discrimination detector and a discrimination mitigation module. In each communication round, every client trains its local learner and tries to mitigate discrimination embedded in the streaming data while simultaneously taking into account the concept drifts in the stream. The updated local learner weights are then shared with the global server. The global server averages the aggregated local learner weights. The updated global learner weights are then shared with selected range of clients in the next communication round.

4 Preliminaries

We first define some notations before illustration of the proposed methodology. Suppose we have n local clients (\(C_1,C_2, \ldots , C_n\)) in an FL environment and a global server G. Each client has its own local streaming dataset \(d_k\) with feature space X and output space Y. Each instance in the streaming dataset \(d_k\) of client \(C_k\) is defined as \(f^{k}_{j}=\{x_j,y_j\}\). We consider a binary classification problem, i.e., \(\textit{Y} \, \epsilon \, \{0,1\}\) because it is a fundamental and widely applicable problem in many fields where the cost of misclassification is high, such as fraud detection or disease diagnosis. The global server G learns the predictive function between the instances and their respective labels \(f(x)=y\) through the collaborative training of the local clients (\(C_1,C_2, \ldots , C_n\)). The basic steps involved in FL in a streaming environment are listed below:

  1. 1.

    The server G initiates the global model and sends the initial parameters \(w_g\) to a random selection of clients.

  2. 2.

    At round l, the client \(C_k\) receives the global parameters \(w^{l-1}_g\) and uses them to train the local model using its local streaming dataset \(d_k\) to achieve the optimal local parameters \(w^{l}_{k}\).

  3. 3.

    The server G receives local parameters \(w^{l}_{1}, w^{l}_{2}, \ldots , w^{l}_{n}\) from clients (\(C_1,C_2, \ldots ,C_n\)) and updates itself using the average of the received parameters using Eq. (1) (McMahan et al., 2017). The server then sends the updated global parameters \(w^{l}_{g}\) to all the clients.

    $$\begin{aligned} w^{l}_{g}=\frac{1}{n}\sum _{j=1}^{n}w^{l}_{j} \end{aligned}$$
    (1)
  4. 4.

    Repeat steps 2 and 3 until the end of stream.

We assume that the datasets used to train and test the proposed model have a single sensitive attribute (S) with binary values, where (P) and (\({\bar{P}}\)) represent protected group and non-protected group respectively. For example, if “race” is the sensitive attribute, then the likely protected group (P) could include all instances with the value “black” as the sensitive attribute and the non-protected group (\({\bar{P}}\)) could include all instances with the value “white” as the sensitive attribute. We gauge the discriminating behavior of the proposed method by two notions of fairness. There are many definitions of fairness in the literature (Verma and Rubin, 2018); however, there are no comprehensible criteria in the literature for choosing a particular notion of fairness for a particular problem. In this work, we select two group fairness notions, statistical parity (Stp) and equal opportunity (Eqop) (Verma and Rubin, 2018), to measure discrimination score. Stp ensures that each individual has an equal chance of being assigned to the positive class (\(y^{+}\)), irrespective of its participation in protected or non-protected group, as illustrated in Eq. (2). The positive class is the desired class of the model’s objective function.

$$\begin{aligned} Stp = P(f(x)=y^{+} \mid S \, = \, {\bar{P}}) - P(f(x)=y^{+} \mid S \, =\, P) \end{aligned}$$
(2)

Eqop ensures that individuals belonging to both the protected and non-protected group get positive outcome (\(y^{+}\)) at equal rates as shown in Eq. (3).

$$\begin{aligned} Eqop = P(f(x) = y^{+}\mid y=y^{+},S={\bar{P}})-P(f(x)=y^{+}\mid y=y^{+},S=P) \end{aligned}$$
(3)

5 Proposed methodology

The complete visual illustration of the proposed methodology is shown in Fig. 2. The pseudocode of the overall approach for adapting FL framework for concept drift detection and subsequently discrimination mitigation in streaming environment is presented in Algorithm 1. Each client hosts local streaming data and a local online deep neural network (ODNN) model (Fig. 2A). The global server also has the same ODNN model (Fig. 2G). Section 5.1.1 illustrates the details of the ODNN model used in this work. Every client begins its training by first initializing the ODNN model parameters using the global server’s ODNN model parameters. Each client trains its local ODNN model with new incoming instances until the stream ends or until the global server requests the client to share the parameters.

Fig. 2
figure 2

Federated adaptation of online Fairness and concept drift-aware stream classification framework: A Local Online Learner B EDDM–Concept Drift Detection C Update Window D Discrimination Detection E CFSOTE–Discrimination Mitigation F Global Server Weights Aggregation G Update Global Online Learner

In this setup, for each new instance, the label is predicted by the learner and the evaluation metrics are updated. We assume that the data stream is infinite and non-stationary, i.e., there is a continuous presence of concept drifts which may lead to compromised predictive performance of the learner. Therefore, we employ a concept drift detection mechanism EDDM (Early drift Detection Method) (Baena-Garcıa et al., 2006) (Fig. 2B). Once EDDM detects a concept drift, the sliding window is cleared and a new window of instances is initiated to store the next instances (Fig. 2C).

Using the prequential evaluation strategy, the discriminatory behavior of the model is quantified (Fig. 2D) by one of the aforementioned fairness notions i.e., Stp or Eqop. If the discrimination score (disc: Stp or Eqop) exceeds a user-defined threshold \(\epsilon\), the proposed continuous fairness-aware synthetic over-sampling technique (CFSOTE) is employed to mitigate the discrimination (Fig. 2E). CFSOTE uses the variable window of instances maintained by EDDM to mitigate discrimination. Then, the local online learner is trained using the newly synthesized instances. The extensive algorithmic details of CFSOTE are elaborated in Sect. 5.3.1.

figure a

The clients share their respective local learning parameters to the global server as soon as they receive the request to share the parameters from the global server. The server then aggregates and averages (Fig. 2F) the clients’ local model parameters. The global ODNN model (Fig. 2G) is updated using the averaged weights of the clients. The detailed methodology is explained in the following subsections.

5.1 Step A: local online learner

Every participating client in the system maintains its own local streaming data and an online deep neural network (ODNN) model (depicted in Fig. 2A). Section 5.1.1 illustrates the details of the ODNN model used in this work. At the start of training, each client initializes its ODNN model parameters using the corresponding parameters from the global server’s ODNN model. Subsequently, the client proceeds to train its local ODNN model using newly arrived instances from the data stream, continuing this process until either the stream concludes or the global server requests the client to share its parameters (as outlined in Algorithm 1: lines 1 to 7). The global server periodically requests the clients to share their respective local parameters (Algorithm 1: global_server_parameter_request). Each local ODNN model trains using prequential evaluation setup, i.e., test first, then train (Gama, 2010; Zhang et al., 2019) (Algorithm 1: lines 6 to 7). In this configuration, for every incoming instance, the learner makes predictions for its label and updates the evaluation metrics accordingly. Following the prediction phase, the true label of the instance is disclosed to the learner, enabling the model to be updated based on this new information.

5.1.1 Online deep neural network (ODNN) model

Our base model is an online deep neural network inspired by Sahoo et al. (2018). ODNN uses hedge backpropagation to efficiently update the parameters of the DNN in an online environment. The Hedge Backpropagation (HBP) technique extends the backpropagation algorithm to train the DNNs in a streaming environment by utilizing the classifiers of different depths with the Hedge algorithm Freund and Schapire (1997). ODNN initializes with an overcomplete network and automatically adapts the length of the network in an online manner. The network is initialized with maximum L hidden layers, each hidden layer is followed by a softmax classification layer. ODNN works on the principle of online learning with expert advice, where the experts are the DNNs with varying depths. The final prediction of this ODNN model is a weighted combination of classfiers at depth 0, 1, ..., L. The weight of each classifier at depth L (\(\alpha ^{(l)}\)) is learnt during the learning procedure of ODNN model and also shared with the global server. The global server aggregates and averages these weights (\(\alpha ^{(l)}\)) of classifiers along with the weights of the layers of each ODNN model. In the training phase of each ODNN model, we set the binary cross-entropy loss function as the optimization objective. Since most of our datasets are imbalanced, we use the class weighting module when training the ODNN models. When the ratio between positive and negative class is 1:p, we force the ODNN model to give p times more importance to the positive class instances than the negative class instances using the class weighting module.

5.2 Step B–C: drift detection

The Early drift Detection Method (EDDM) (Baena-Garcıa et al., 2006) (Fig. 2B) maintains a sliding window of variable length to store the most recent instances of the data stream, and is able to automatically detect and adjust the size of the window according to the current rate of change. EDDM keeps track of the average distance between two classification errors (\(e_{j}\)), its standard deviation (\(sd_{j}\)), maximum-average error distance (\(e\_max_{j}\)), and the maximum standard deviation (\(sd\_max_{j}\)). The average error distance at the \(j^{th}\) error (\(e_{j}\)) is the average number of examples between two classification errors as presented in Eq. (4) where \(dis_{i}\) is the number of examples between the current error and the previous error, \(e_{i-1}\) is the average error distance calculated when the previous error occurred, and \(n_{ei}\) is the number of classification errors seen so far. The standard deviation of average error distance (\(sd_{j}\)) is calculated using Eq. (5). In this equation, \(var_{j}\) is the running variance of average error distance. This drift detection method defines the threshold \(\eta\) shown in Eq. (6) to ensure the detection of concept drifts. When left hand side of this relation exceeds the pre-defined threshold \(\eta\), EDDM declares that a concept drift has occurred.

$$\begin{aligned}{} & {} e_{j}= \sum _{i=0}^{j} \frac{dis_{i}-e_{i-1}}{n_{ei}} \end{aligned}$$
(4)
$$\begin{aligned}{} & {} sd_{j} = \sqrt{\frac{var_{j}}{n_{ej}}}\,\,\,\, and \,\,\,\, var_{j} = \sum _{i=0}^{j} (dis_{i}-e_{i})*(dis_{i}-e_{i-1}) \end{aligned}$$
(5)
$$\begin{aligned}{} & {} \frac{e_{j} + 2 * sd_{j}}{e\_max_{j} + 2 * sd\_max_{j}} < \eta \end{aligned}$$
(6)

When EDDM identifies a concept drift, it triggers the clearing of the sliding window. Subsequently, a new window is initialized to store the upcoming instances (as illustrated in Algorithm 1: lines 8 to 12) (shown in Fig. 2C).

5.3 Step D–E: discrimination detection and mitigation

By employing the prequential evaluation strategy, the model’s discriminatory behavior is measured (depicted in Fig. 2D, as described in Algorithm 1: line 13) using one of the fairness notions mentioned earlier, such as Stp or Eqop.

We hypothesize that discrimination is often deeply rooted in training data, due to the non-trustworthy labelling or the selection bias. Therefore, we propose a data augmentation-based strategy, the Continuous Fairness-aware Synthetic Oversampling Technique (CFSOTE), to mitigate discrimination. This is an adaptation of the Continuous Synthetic Minority Oversampling Technique (CSMOTE) (Bernardo et al., 2020). The proposed method performs data augmentation using the sliding window of instances of each client maintained by the concept drift detector EDDM (Algorithm 1: lines 14 to 27). Then, the local online learner is trained using the newly synthesized instances (\(X\_syn, Y\_syn\)) (Algorithm 1: lines 28 to 29). For data augmentation, we divide each client’s training dataset based on the output class (positive class: \(C_{+}\), negative class: \(C_{-}\)) and the sensitive attribute (\(P, {\bar{P}}\)) into four groups: \(N(C_{-},P)\), \(N(C_{-},{\bar{P}})\), \(N(C_{+},P)\), \(N(C_{+},{\bar{P}})\).

Real-world datasets often suffer from the inherent problem of class imbalance. Most fairness-aware learning methods disregard the importance of class imbalance and attempt to mitigate discrimination at the cost of the true-positive rate of the minority class, resulting in poor balanced accuracy. We use class weighting module to address this issue of class imbalance. Furthermore, our discrimination mitigation strategy itself has the ability to improve and maintain balanced accuracy.

For each client, in every communication round, we use prequential evaluation to train the local ODNN model. Through prequential evaluation we keep track of the discrimination score (disc: Stp or Eqop) over the stream. If the discrimination score exceeds user-defined threshold \(\varepsilon\), then we up-sample certain groups (\(N(C_{+},P)\), \(N(C_{-},{\bar{P}})\)) of data contained in the local sliding window maintained by EDDM to reduce the discrimination embedded in the dataset. The groups are chosen for upsampling based on the number of total positive predictions and the total number of positive labels in the data stream. If the number of positive predictions is less than or equal to the total number of positive labels in the data stream then we upsample the positive protected group \(N(C_{+},P)\) by a proportion (\(\lambda\)) of negative non-protected group (\(N(C_{-},{\bar{P}})\)) using CFSOTE. Otherwise, we increase the number of samples in the negative non-protected group \(N(C_{-},{\bar{P}})\) by a proportion (\(\lambda\)) of positive protected group (\(N(C_{+},P)\)) using CFSOTE. The algorithmic details of CFSOTE are given in Sect. 5.3.1. The up-sampling proportion \(\lambda\) is calculated through the formula given in Eq. (7), where disc is the discrimination score (Stp or Eqop) measured through prequential evaluation of the local ODNN model. \(\lambda _{initial}\) and \(disc_{tol}\) are hyperparameters. The parameter \(disc_{tol}\) controls the effect of disc on \(\lambda\). The higher the value of \(disc_{tol}\) the less will be the effect of disc on \(\lambda\) and vice versa.

$$\begin{aligned} \lambda = \lambda _{initial}*(1+(disc/disc_{tol})) \end{aligned}$$
(7)

FAC-Fed handles positive discrimination (discrimination towards the protected group) as well as negative discrimination (discrimination towards the non-protected group). To handle negative discrimination, we swap the roles of the protected and non-protected attribute and the rest of the algorithm remains the same.

figure b

5.3.1 Continuous fairness-aware synthetic oversampling technique (CFSOTE)

CFSOTE is an adaptation of the Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al., 2002). We propose this algorithm to upsample a selected group (\(N(C_{+},P)\), \(N(C_{-},{\bar{P}})\)) from the local sliding window of instances maintained by EDDM. Algorithm 2 explains the procedure we follow for upsampling a selected group. In contrast to the traditional SMOTE algorithm, we do not select all samples of the selected group for up-sampling, but only z samples from the group, where z is a hyperparameter (Algorithm 2: line 1). For each selected sample, we need to generate m/z i.e., b (Algorithm 2: line 2) new samples by linear interpolation between the selected sample and its k nearest neighbors, where m is computed in Algorithm 1. k and z are hyperparameters. Nearest neighbors are sought utilizing the K-Nearest Neighbors (KNN) algorithm (Piegl & Tiller, 2002) (Algorithm 2: line 4). KNN calculates the distance between the queried sample and other data samples using the Euclidean distance metric. It then sorts the data samples in ascending order according to their respective distance from the queried sample and returns the first k samples.

The predictions of the classification model should be independent of the sensitive attribute, which eventually leads the model to the ultimate goal of achieving fairness in its decisions. Therefore, we assume that the samples belonging to the positive protected group \(N(C_{+},P)\) and to the positive non-protected group \(N(C_{+},{\bar{P}})\) are in close proximity to each other, with the only difference of the sensitive attribute. Therefore, if we need to up-sample the positive protected group \(N(C_{+},P)\), we find the nearest neighbors in the search space which includes the positive protected group \(N(C_{+},P)\) as well as the positive non-protected group \(N(C_{+},{\bar{P}})\) (Algorithm 1: instances\(\_\)pool2 = (\(N(C_{+},P)\) & \(N(C_{+},{\bar{P}})\))). We assign the protected value to the sensitive attributes of the newly synthesized instances. Figure 3 illustrates the CFSOTE method proposed for up-sampling positive protected group (\(N(C_{+},P)\)). However, if we want to up-sample the negative non-protected group \(N(C_{-},{\bar{P}})\), we select only the nearest neighbors from the group itself (Algorithm 1: instances\(\_\)pool2= \(N(C_{-},{\bar{P}})\)). We do not include the negative protected group \(N(C_{-},P)\) in this search space because datasets have non-trustworthy labelling therefore, there is a possibility that many samples belonging to the negative protected group \(N(C_{-},P)\) are biasedly labelled as negatives.

Fig. 3
figure 3

An illustration of the Continuous Fairness-aware Synthetic Over Sampling Technique (CFSOTE) for up-sampling \(N(C_{+},P)\); KNN algorithm finds K nearest neighbors of the randomly selected sample \(r\_sample_{i}\) from the groups \(N(C_{+},P)\) and \(N(C_{+},{\bar{P}})\); m/k newly synthesized samples with sensitive attribute as P added in the group \(N(C_{+},P)\)

Once the nearest neighbors are sought, we perform linear interpolation between the queried sample and its nearest neighbors to synthesize new samples and assign the protected value to the sensitive attributes of all the newly synthesized instances (Algorithm 2: lines 5 to 8).

5.4 Step F–G: global server

Upon receiving a request from the global server, the clients promptly share their individual local learning parameters. Subsequently, the server performs parameter aggregation and averaging (as depicted in Fig. 2F) using Eq. (1). The resulting averaged weights of the clients are then used to update the global ODNN model (shown in Fig. 2G). The updated global parameters (\(w^{l+1}_{g}\)) are transmitted to the selected clients for the subsequent communication round.

6 Experimental setup

6.1 Hyperparameters selection

For concept drift detection, we chose the value of \(\eta\) as 0.9 for Eq. (6), as suggested by Baena-Garcıa et al. (2006). For the ODNN model, we performed a grid search and initialized each model with a maximum of \(L=5\) hidden layers and 40 neurons per layer. If we increase these values, the performance of ODNN remains the same; however, the performance degrades if we decrease these numbers. For CFSOTE, we performed a grid search for each dataset and choose the value 5 for k and 5 for z. Since we are upsampling based on a window of instances, k and z are bounded by the current size of the instance group to be upsampled. If we decrease the values of k and z, then the newly synthesized instances will most likely be near duplicates of randomly selected samples; however, if we increase these values, then the performance of the framework will remain comparable. For Eq. (7), we choose the value 0.2 for \(disc_{tol}\) and the value 0.05 for \(\lambda _{initial}\). These values of \(disc_{tol}\) and \(\lambda _{initial}\) keep the effect of discrimination score on \(\lambda\) in a moderate range to avoid the undesirable synthesis of large number of instances, which can lead to a high reverse discrimination score.

Table 1 Description of datasets

6.2 Datasets

We evaluate the proposed methodology using a range of real world datasets including Bank Marketing (Bank M.) (Bache & Lichman, 2013), Law School (Law S.) (Wightman, 1998), Default (Bache & Lichman, 2013), and Adult Census (Adult C.) (Bache & Lichman, 2013). These datasets vary in their dimensionality (#Inst.), number of attributes (#Attr.), sensitive attribute (Sen. att.), and imbalance ratio (Im. ratio); the details are presented in Table 1. To adapt the datasets to FL environment, we randomly split each dataset into 3 and 5 clients. Most of the datasets (except Bank M.) used in this work are static datasets, therefore, to ensure reliability we report the results as the average of results obtained by experiments performed on 10 random shuffles of each static dataset. To demonstrate the ability of FAC-Fed to handle non-IID data, we also distribute each dataset among three clients, based on a particular attribute. We choose ‘age’ attribute for splitting Bank M., Default, and Adult C. datasets and ‘income’ attribute for splitting Law S. dataset. These attribute choices are deliberate, as they ensure that each client hosts a distinct data distribution, thus establishing the non-IID nature of the data.

6.3 Baselines

This section is dedicated to explaining the details of the baseline methods employed for comparison with our proposed approach. To the best of our knowledge, our work is the first attempt towards federated adaptation for fairness and concept drift-aware stream classification. Therefore, we lack fairness aware federated baselines for streaming data to compare our results against. Nonetheless, we have conducted a comparison of the centralized version of our methodology with state-of-the-art centralized stream classification methods. This enables us to assess the performance and efficacy of our approach in a centralized setting.

  • CSMOTE Bernardo et al. (2020) is not fairness-aware, but it is designed to handle class imbalance in a non-stationary environment by re-sampling the minority class in a defined window of instances.

  • Fairness Aware Hoeffding Tree (FAHT) Zhang et al. (2019) is a fairness-aware adaptation of Hoeffding tree. It incorporates the fairness gain (Stp score) along with the information gain into the partitioning criteria of the decision tree. This model is not able to deal with class imbalance and concept drifts and is not agnostic with respect to fairness notion; therefore, we report the results only for the case of Stp based optimization.

  • FABBOO Iosifidis and Ntoutsi (2020) is an online boosting approach that handles class imbalance by monitoring class ratios in an online fashion. It employs boundary adjustment methods to handle discrimination.

  • AC-Fed is the proposed federated adaptation for concept drift-aware stream classification. This method is incapable to handle fairness issues.

  • FAC-Fed is the proposed fairness and concept drift-aware federated adaptation for stream classification.

6.4 Evaluation metrics

We evaluate our proposed method for both utility and fairness. Since almost all datasets used in this study are imbalanced therefore we use the evaluation metric “balanced accuracy” to measure the utility of the proposed model. We also use “gmean” to measure the effectiveness of proposed method. To gauge the discriminatory behavior of FAC-Fed, we use two fairness notions: statistical parity (Stp) and equal opportunity (Eqop). The details of the fairness notions are already explained in Sect. 4.

Fig. 4
figure 4

Comparison of Balanced accuracy (BA) and Statistical parity (Stp) achieved by FAC-Fed and AC-Fed through all communication rounds for Bank M., Law S., Default, and Adult C. datasets with R3C data split

Fig. 5
figure 5

Comparison of Balanced accuracy (BA) and Equal Opportunity (Eqop) achieved by FAC-Fed and AC-Fed through all communication rounds for Bank M., Law S., Default, and Adult C. datasets with R3C data split

Table 2 Performance measures obtained by proposed method FAC-Fed for Statistical Parity (Stp). Note that RnC implies random split of dataset among n clients and Attr3C denotes attribute-based distribution of data among 3 clients
Table 3 Performance measures obtained by proposed method FAC-Fed for Equal Opportunity (Eqop). Note that RnC implies random split of dataset among n clients and Attr3C denotes attribute-based distribution of data among 3 clients

7 Results and discussion

Table 4 Comparison of performance measures obtained by proposed method FAC-Fed and the baseline methods in a centralized environment for statistical parity, with best and second best values shown in bold and italic
Table 5 Comparison of performance measures obtained by proposed method FAC-Fed and the baseline methods in a centralized environment for statistical parity, with best and second best values shown in bold and italic

We perform experiments on a set of real-world datasets. For each dataset, we have presented the results for the random distribution of data among 3 and 5 clients (R3C, R5C). All the evaluation metrics obtained by FAC-Fed with \(disc = Stp\) and \(disc = Eqop\) are presented in Tables 2, 3, respectively. From Table 2, we can see that FAC-Fed obtained high balanced accuracy and gmean while keeping Stp score between 0.002 and 0.008 for both R3C and R5C data splits of all datasets. Similarly, from Table 3, we can observe that FAC-Fed can achieve high balanced accuracy and gmean while keeping the Eqop score under 0.007 for all datasets and all data splits. From Tables 2, 3, we can deduce that FAC-Fed is agnostic with respect to the notion of fairness used for optimization, since it achieves similar balanced accuracy and gmean while maintaining very low discrimination scores in both cases when we use Stp and Eqop as the optimization criteria. We assess the efficacy of proposed framework against non-IID data by distributing data among three clients based on a specific attribute. From Tables 2, 3, we observe that FAC-Fed maintains its superior performance in terms of both utility and discrimination mitigation when the data is distributed based on a specific attribute among the clients. This highlights the framework’s capability to effectively handle non-IID data.

Figure 4 shows a comparison of the balanced accuracy and Stp score achieved by FAC-Fed and AC-Fed for R3C split of all datasets. From this figure, we can see that FAC-Fed achieves comparable balanced accuracy as AC-Fed for all datasets and maintains it across all communication rounds. Moreover, the Stp score achieved by FAC-Fed is much lower than that of AC-Fed. Similarly, Fig. 5 shows a comparison of the Eqop score and balanced accuracy achieved by FAC-Fed and AC-Fed for R3C split of all datasets. This figure illustrates that FAC-Fed achieves comparable balanced accuracy as AC-Fed for all datasets and maintains it across all communication rounds. However, the Eqop score achieved by FAC-Fed is much less than that of AC-Fed. This proves that the proposed strategy to mitigate discrimination has minimal impact on the utility of the proposed federated framework.

To the best of our knowledge, this is the first attempt towards fairness and concept drift-aware stream classification. Therefore, we compare the performance measures achieved by the centralized version of FAC-Fed with three centralized stream classification models (FABBOO, FAHT, CSMOTE). The results obtained by prequential evaluation of centralized FAC-Fed and the competing baselines with Stp and Eqop as the optimization criteria are shown in Tables 4, 5 respectively. From Table 4 we can see that FAC-Fed achieves the best Stp score, balanced accuracy and gmean for all datasets except for the Bank Marketing dataset. For the Bank Marketing dataset, the centralized version of FAC-Fed follows the performance of CSMOTE in terms of balanced accuracy and gmean with a difference of only \(0.45\%\) and \(0.43\%\), respectively. However, the Stp score achieved by FAC-Fed (\(-0.0009\)) is much lower compared to that of CSMOTE (0.0829). Similarly, in Table 5, for all datasets with Eqop as the optimization criterion, we can observe that FAC-Fed achieves the best balanced accuracy, gmean, and Eqop score compared to all baselines except the Bank Marketing dataset. For Bank Marketing dataset, FAC-Fed achieves comparable balanced accuracy and gmean as that achieved by CSMOTE. However, the Eqop score of FAC-Fed (\(-0.0021\)) is much lower than that of CSMOTE (0.0229). For Bank dataset, FABBOO achieves the best Eqop score (0.0012), FAC-Fed follows it with a close margin, nevertheless, FAC-Fed achieved \(6.49\%\) higher balanced accuracy than that achieved by FABBOO. With Default dataset, FABBOO achieves best Eqop score (0.0014) whereas FAC-Fed follows it by a narrow margin (\(-0.0081\)), while the balanced accuracy and gmean values are \(2.95\%\) and \(4.74\%\) higher than those of FABBOO, respectively. The difference between the balanced accuracy and the gmean achieved by FABBOO is large for most datasets, suggesting that FABBOO achieves a lower discrimination score at the expense of either true-positive rate or the true-negative rate. In contrast, FAC-Fed achieves much lower discrimination scores (Stp, Eqop) compared to FABBOO, while the balanced accuracy and gmean reported by FAC-Fed are close.

Fig. 6
figure 6

Comparison of Balanced accuracy (BA) and Statistical parity (Stp) achieved by centralized version of FAC-Fed and FABBOO with prequential evaluation through out the stream for Bank M., Law S., Default, and Adult C. dataset

Fig. 7
figure 7

Comparison of Balanced accuracy (BA) and Equal Opportunity (Eqop) achieved by centralized version of FAC-Fed and FABBOO with prequential evaluation through out the stream for Bank M., Law S., Default, and Adult C. dataset

Figures 6, 7 show a comparison of performance measures obtained by FABBOO and centralized FAC-Fed with prequential evaluation over the entire data stream for all datasets. From these plots we can observe that although the fairness performance of FABBOO and FAC-Fed are quite similar yet FAC-Fed achieves higher balanced accuracy than FABBOO. Results show that FAC-Fed achieves high balanced accuracy, Stp and Eqop even in the centralized environment, although, it is designed for a federated environment. If we compare the results of federated version of FAC-Fed and centralized version of FAC-Fed, we observe that the difference in performance measures is not substantial. For instance, in Table 2, for the Bank M. dataset, the federated FAC-Fed achieved balanced accuracies of \(82.84\%\) and \(82.51\%\), as well as Stp scores of \(-0.0021\) and 0.007 for the R3C and R5C splits of the dataset, respectively. On the other hand, the centralized version of FAC-Fed achieved a balanced accuracy of \(82.46\%\) (Table 4) and an Stp score of 0.0009, which are very close to the results obtained by the federated version of FAC-Fed. A similar trend can be observed for the Adult C., Default, and Law S. datasets, indicating that the proposed methodology is robust and reliable in both federated and centralized environments.

8 Conclusion

To the best of our knowledge, we proposed a pioneering work in the domain of federated stream learning that mitigates the discrimination inherent in the client data while improving the framework’s predictive performance (FAC-Fed). The experimental results demonstrate the effectiveness of FAC-Fed in terms of predictive performance and fairness and highlight the following key advantages of the proposed framework:

  • FAC-Fed is able to reduce the discrimination score and maintain it over the stream.

  • FAC-Fed is agnostic in nature with respect to the fairness notion used during optimization.

  • For datasets with severe class imbalance, FAC-Fed is able to ensure significantly better predictive performance while maintaining low discrimination scores.

  • FAC-Fed demonstrates consistent predictive and discrimination mitigation performance even with non-IID data.

  • Fairness is ensured for each client.

  • The proposed framework has the potential to be used as a centralized fairness-aware learning framework as well. For all the datasets, the centralized version of proposed method is able to ensure significantly better predictive performance than the competing baselines while maintaining low discrimination scores.

With the advances in sensor networks, distributed and heterogeneous data sources generate data regularly and dynamically. A possible extension of the proposed work could be to adapt the FAC-Fed to asynchronously train large number of clients with continuously arriving streaming data.