Keywords

1 Introduction

Federated learning (FL) is a machine learning (ML) technique that allows an artificial neural network (ANN) model to be trained through the use of decentralized edge devices i.e. workers that maintain data locally, without sharing the data with the server i.e. the service provider. This method requires a central server to broadcast the ANN model to multiple workers, coordinating transmission and responses. The workers will locally fit the ANN and send the updated weights back to the server. FL principles [10] require that the server responsible to orchestrate the learning process does not receive workers’ data under any circumstances, allowing a neural model to be trained without compromising data privacy. Thus, it is possible to overcome problems related to the processing and storing of personal data, and to obtain trained predictive models. However, in a FL environment data can be unevenly distributed within the workers, leading to under-representation of one or more specific population subgroups. This can result in unfair prediction, statistical disparity, and inequity [11].

Considering a classical FL approach [10], the service provider has no means to ensure that data is evenly proportioned, or to estimate the impact of the data distribution across the whole set of workers on ML predictions. Zero knowledge proof (ZKP) [2] can be used to prove a statement having no knowledge of the statement itself. The proposed approach consist of implementing the Schnorr’s ZKP authentication protocol [3, 8] that can be used to infer the data distribution of the remote workers without data exchange. Motivated by experimental results that show unfair treatment for imbalanced data (Sects. 2, 5), the following research question is investigated: to what extent can ZKP inferred data about the proportions of population groups in a federated learning environment mitigate federated learning bias while in compliance with GDPR and EU guidelines for data ethics and trustworthy AI?

By performing differentiated evaluations on an ANN model trained on imbalanced data, it is possible to observe an average increment of disparity that leads to prediction bias as described in Sect. 5. This outcome drove the design of a self-balancing ZKP FL environment called Z-Fed to support a fair, privacy-preserving learning process for datasets with multiple sensitive population categories. The technical process followed to achieve Z-Fed is as follows: 1) The federated server generates tokens to authenticate all the possible workers with the ZKP Schnorr’s protocol [3]; 2) Workers encrypt their feature labels i.e. categorical labels, fit the learning model, and send the update to the server; 3) The server can zero-knowledge prove that workers belong to group identified by a certain feature label by retaining the encrypted version of the workers’ labels and count individuals; 4) The server uses a self-balancing queue system to accept updates in a manner that ensures the clients will not compromise the balance.

In this paper, we use an ANN based on the statistical gradient descent (SGD) algorithm for weights update. This model is used for supervised training tasks on the UTK dataset [12] and is trained using images of faces to predict their age. We implemented an FL framework and trained the ANN with balanced and imbalanced samples of the dataset in order to select appropriate metrics of comparison.

The main contributions of the present paper are the following: 1) Identification of a set of metrics i.e. EOD, EPD, and SPD to measure equality degradation in imbalanced FL training; 2) Design of a self-balancing ZKP FL framework, Z-Fed, implementing zero knowledge authentication to avoid malicious workers updating the model; 3) Implementation of ZKP inference of worker data distribution to allow data augmentation and rejection of imbalanced updates and counter effect bias; 4) Evaluation of the self-balancing framework based on an stochastic gradient descent (SGD) ANN. The experimental results can be summarized as follows: with respected to an imbalanced FL framework, the measured scores relative to absolute multi-class (Sect. 5) statistical parity difference (SPD), equal opportunity difference (EPD), and equal odds difference (EOD) are considerably improved in the experiments conducted using self-balancing Z-Fed. Detailed results are available in the evaluation Sect. 5.

2 Background and Motivations

Fairness and equity are general ideas not restricted to AI. An application that implies decision-making processes can show discriminatory bias towards some specific groups and thus must be evaluated in terms of fairness. The EU guidelines for trustworthy AI [6] define disparate treatment as a major concern in AI. In the fair credit reporting act (FCRA), fairness regards individual attributes such as gender, race, religion, age, sexual orientation and more. An unfair or disparate treatment occurs when the outcome of a decision is biased by such factors. While for explainable algorithms it can be easier to identify possible discrimination, this represents a major challenge in FL [11].

This section reviews the main research findings in the area of machine learning in FL environments involving the use of unbalanced data. It will be discussed how the presented approach contributes in relation to the existing effort. The three main areas of research were 1) fairness and advances of FL, 2) machine learning with unbalanced data, and 3) ZKP methods.

Federated Learning. A central server can collect fitted ANN model weights using synchronous or asynchronous protocols [10]. While recent advancements in FL led to the design of solutions to deal with the accuracy reduction due to uneven distribution of data using mediators [4] or probabilistic approaches [9], the proposed method focus on improving the fairness of the predictive model by performing ZKP self-balancing.

Imbalanced Data Machine Learning. Ensemble methods are proposed to reduce bias in imbalanced data learning [7], but they can suffer the presence of outliers typical of FL. The main proposed solutions for dealing with imbalance data are sampling and augmenting data [5]. Over-sampling, often implemented by artificially creating minority classes to counter the effect of disproportions [13], shows promising results, but requires access to data, and hence is not suitable for FL environments.

Zero Knowledge Proof. ZKP can be used to enhance data privacy in online communication [2] and can be implemented using iterative [1] or non-iterative methods [3]. Iterative ZKP is inconvenient in FL, since it considerably increases the communication overhead. Non-iterative implementations of ZKP are often used for authentication [8] without involving exchange of privacy protected data, which make this method suitable for FL. ZKP authentication allows a server to prove that a client knows certain information without revealing it. This is possible through the use of encrypted tokens, i.e. proofs and signatures, that ensure that only authorized clients can be authenticated.

3 Requirements

While performing FL, the federated server requires the users to ZKP authenticate to prove that 1) they are authorized to contribute training and 2) they are not holding data which does not belong to any subgroup. While distributing the model for ML, the server is able to count the number of samples of each category used for training.

ZKP Framework Initializer. ZKP authentication is enabled by public data structures, i.e. tokens that use elliptic curves and generator points to prove that a specific authentication proof come from the same token that was generated from the server during the registration process. Since there is no exchange of private data in the registration phase, a malicious remote client could force the server to register its features even if they do not belong to the features public dictionary. This is possible because the server would receive only an encrypted version of the label based on the private number n of the ZKP client. In this case, this malicious behavior leads to distortion in the count of groups and results in ineffective control of data balance. Moreover, the framework would suffer higher computational and storage payload by generating multiple authentication tokens for each worker.

In order to counter the effects of unreliable clients, keep the learning environment trustworthy, and to reduce the computational resource needed, it is possible to create a very limited number of authentication tokens during an early server setup phase.

Learning Model. Any machine learning model can be used as long as it presents the following APIs: 1) initialize the learning model, setting e.g. the learning rate \(\eta \), 2) read the values of the trainable parameters e.g. weights together with the configuration settings, 3) fit an input list X given a label list y e.g. propagate the inputs through the neural layers, measure the loss and update the training parameters, 4) load external training parameters received e.g. from remote clients, and 5) produce a list of predictions \(y_{pred}\) given a list of inputs X.

4 Design of Z-Fed

A ZKP protocol is designed to enable the server to register every possible subgroup within the dataset. To do so, the server must be aware in advance of the possible categorical features that data can possibly have, e.g. values of ethnic group and gender. For this purpose, it is possible to create a number of client prototypes equal to the number of subgroups present in the data. Client prototypes are not used for ANN training, but only for creating registration tokens. The server can use them to set up encrypted dictionaries that are used to count the number of samples belonging to specific subgroups. A service called framework initializer is required to generate a private number n that can be transmitted to the workers and produces the client prototypes. In the proposed architecture, the federated framework is initialized using the aforementioned service.

ZKP Server. The ZKP server must be initialized with a private password to prevent the tokens from being vulnerable by using encryption. The server can use the password to generate ZKP signatures. Having the server countersigning a client signature allows the client to prove that the server is legit. The server must store a copy of the features data structure. From now on, we will refer to the possible features in the dictionary of the UTK dataset as feature name, e.g. Ethnicity and Gender, and will refer to the possible values as feature labels, e.g. Female, Male, etc.

The server, during the registration phase, is able to create tokens for authenticating authorized workers. An authorized ZKP client can send its signature to the server, and later the server can use the client signature to create a token. The registration phase ends when all the possible subgroups combinations, (e.g., Gender: Female, Ethnicity: Asian, \(\dots \), Gender: Male, Ethnicity: White) have a server-side token representation. The server can authenticate clients by checking if they have a proof that is compatible with any of the tokens, meaning that the client belongs to a specific subgroup.

The server retains an encrypted representation of the client subgroup categories in its encrypted dictionary. At any given moment, the server can assess whether the distribution of data is even or not. Before a worker is requested to contribute to the ML process, the server can check if the worker would result in an uneven data distribution, and in this case it will reject any update from it. When a worker is not able to train the distributed model because of potential imbalance, the server is able to register the workers’ identifier to possibly connect to it later in case its update would not result in imbalance. To optimize the process, the server retains a priority queue data structure. Moreover, if the dataset is highly imbalanced, the server can augment the training mechanism by requesting multiple epochs of training for under-represented groups.

The ZKP Server requires a data structure to store the ZKP parameters needed for registration and authentication [2, 3], such as the elliptic curve of choice curve, the public Salt value, the private number n, the hash function of choice hash, and the curve generator point g.

The ZKP Server needs a dictionary structure named groups to count encrypted versions of subgroups. Since every feature label has a token representation on the server, for each of the k feature names on the shared features dictionary, a client must show to have k feature labels compatible with the feature structure in order to authenticate. Once authenticated, the server will receive k count updates indexed with k hexadecimal hash number. The k hashes will be summed and used as a dictionary key to manage the FL server queuing protocol.

ZKP Client. A ZKP client is responsible for representing a specific individual tuple feature name, feature label in the distributed dataset. Given the number k of feature names in the features dictionary, every worker will instantiate k ZKP clients. Every ZKP client store the \(feature \, name\), the \(feature \, label\), the ZK data structure analogue to the one of the ZKP server. ZKP clients can generate a signature encrypted using a password. In this case, the ZKP client password, i.e. secret, is the hashed value of the \(feature \, label\) joint with the private number n: \( secret = hash(feature \, label | n) \), where | is a string operator, e.g. concatenation. Using the private number n the ZKP server is not able to decode the client token to read the label. Moreover, the ZKP client can create an \(encrypted \,\, label\) using a different method to joint the \(feature \, label\) with the number n, e.g. \( encrypted \,\, label = hash(n | feature \, label) \). The client is safe to publicly send the value of \(encrypted \,\, label\) without revealing the secret or the \(feature \, label\). The \(encrypted \, label\) value is used to server side count the subgroups.

Worker Registration and Authentication. A server S, i.e. verifier, and a client, c i.e. prover, are such that c can prove to S that a given condition results true, avoiding sharing any information but the fact that the condition is true. The server chooses a password \(S_{password}\) and the client chooses a secret \(c_{secret}\) e.g. the value of the ethnic group of belonging that does not want to share with S. Based on [3], S and c choose the following public parameters respectively: an elliptic curve \(S_{curve}, c_{curve}\) with elliptic curve generator points \(S_g, c_g\), a hash function \(S_{hash}, c_{hash}\), and a relatively big random number \(S_{salt}, c_{salt}\). In addition, ZKP server and client produce random private variables, \(S_n, c_n\) respectively, used to compute a specific point on the elliptic curve. Using these settings it is possible to create a signature i.e. token of the form of \(token = g \times hash(secret | salt) \,\text {mod}\, n\) which can be shared publicly revealing no information about the secrets. After c sends its signature to S, the latter can subsequently sign the received token, publish the newly signed token, and retain public client parameters, i.e. registration, this way the server can prove if a further token comes from the same client that used the same server signature in the past, i.e. authentication.

ZKP Framework Initializer. We propose a trusted external service in charge of generating one client prototype for each of the possible q combination of feature names and values. The proposed framework initializer has the duty of coordinating with the ZKP server to register the client prototypes and generate the q authentication tokens. This can be achieved only if all the clients share the same private number n, that allows them to sign the server tokens to generate proofs. For this reason, all the workers must connect to this service and get the value of n prior to authenticate to the ZKP server. This results in an additional step in the FL process, but the framework allows doing this asynchronously. The initialization process is described in Fig. 1a.

Federated Server. A self-balancing federated server must be able to discern updates based on the subgroup of belonging of the client. After ZKP authentication, the server estimates whether the count of subgroups would result in imbalance, and, under this circumstance, rejects the update. The server may identify workers with an identification number \(w_{ID}\). This allows the server to organize rejected workers into queues and efficiently select workers for further balanced updates. To simplify, the presented model executes synchronous FL, meaning that the server elaborates updates one at a time. The server retains a dictionary of queues, used to store examples belonging to different subgroups. Since the workers have one or more hexadecimal hashed labels representing the subgroups, it is possible to sum the values to create the index of a hash-table used to access the specific subgroup queue. Having the count of subgroups, it is possible to check if an update will keep the model balanced.

Federated Worker. The federated worker is responsible for training the distributed model and provide weight updates to the server. Workers retain a number k of ZKP clients equals to the number of feature names present in features. Workers present a data structure to store the parameters required for model training, and a local copy of the learning model. A worker can retain a list of pairs of training features and ground truth, X and y respectively. Additionally, workers retain a dictionary of the secret feature names and labels for subgroup count. A generic Z-Fed worker must load the model received from the server, propagate the model using the X, y pairs, calculate loss and update the weights, send the updates of weights and subgroups count to the server. The components of Z-Fed are shown in Fig. 1b.

Fig. 1.
figure 1

Design of the Z-Fed framework

The workflow for the initialization, groups count, queue management, data augmentation, and model training of Z-Fed is described as follows: 1) The framework initializer generates a random private number n and uses features to create as many client prototypes as population subgroups; 2) Asynchronously, the server can instantiate the ML model and prepare weights; 3) Once the client prototypes are ready, the framework initializer can request the server to produce the required authentication tokens; 4) Workers are initialized and updated using the client prototypes, from this moment they can retrieve authentication tokens from the server, authenticate and receive the ML model for FL; 5) Server authenticates workers and uses the updates to train the global ML model; 6) Rejected workers are organized into a structure of queues to reschedule the training efficiently. A diagram of the Z-Fed workflow is shown in Fig. 2.

Fig. 2.
figure 2

Z-Fed workflow diagram: initialization, registration, data augmentation with population proportion analysis, federated model training, and re-balancing of workers by ZKP count of subgroups.

5 Evaluation

In this paper, we focus mainly on: 1) the difference in rate of favorable outcomes for unprivileged groups with respect to privileged groups i.e. statistical parity difference (SPD) across subgroups, 2) the difference in rate of true positive prediction outcomes between privileged and unprivileged groups i.e. the equal odd difference (EOD) across subgroups, and 3) the difference of probability to get true positive and false positives between privileged and unprivileged groups, i.e. the equal opportunity index (EPD).

Benchmark Measurements. To the best of our ability, scientific research presenting fairness measurements against imbalanced UTKFace (UTK) datasets [12] could not be found for benchmark. The UTK dataset presents the records of 23706 persons, providing their age, ethnicity, gender, and a black and white picture. An exploratory search was conducted based on the following hypothesis: in a federated environment, it is possible to measure bias using EPD, EOD, and SPD if the training dataset is class imbalanced.

To assess the influence of imbalanced training data in FL, we trained ANNs with an up-sampled UTK dataset and measured fairness metrics afterwards. We used face images to predict four age ranges i.e. 0 to 9, 10 to 19, 20 to 29, and 30 to 39 considering four ethnic groups i.e. Asian, Black, Indian, White and two gender groups i.e. Female, Male. Ethnic and gender groups can be considered in all the 8 possible combinations to form subgroups.

Imbalanced Datasets. A simple way to create class inequity is to define a privileged (PR) class that is over-represented with respect to the other unprivileged classes. Four different dataset are built, choosing one ethnic group as privileged, with a class proportion distributed as follows: 85% of the samples belong to the privileged group and the remaining 15% of the sample are equally split among the rest of the unprivileged ethnic groups. All the datasets have the gender and the age range features balanced. The datasets described previously will be further identified as ASIAN-PR, BLACK-PR, INDIAN-PR, WHITE-PR. In addition, an ethnic-gender class balanced dataset (BAL) is set up for training, evaluation and comparison.

ML Model Architecture. The ANN has the following fully-connected layer (FCL) structure: \(2304 \times 96 \times 4\) neurons plus one bias neuron per FCL, and uses a sigmoid activation function and a mean square error (MSE) loss function. The ANN model has a total of 884,736 training parameters i.e. weights \(\textbf{w}\) and achieve an average accuracy of 50.8%, variance 1.07% after fitting 16,000 samples in one epoch with \(\eta = 0.025\) on UTK.

FL Settings. Every worker holds a sample of size one, and the FL framework is set up to compute one epoch per training cycle. In these settings, the model performed an average SPD of 1.86%, EPD of 4.6%, and EPD of 1.84% on BAL, and we measure an SPD of 15.02%, EPD of 15.01%, and EPD of 5.17% on ASIAN-PR. In addition, the model shows a negligible difference in average absolute EOD on both BAL and WHITE-PR, while showing a flat slope on BAL and significant growth of inequity in ASIAN-PR during the model update rounds as shown in Fig. 3. Moreover, we tested the accuracy of the ANN against a specific ethnic group, measuring the variance among subgroups. Considering an accuracy variance of 0.09% on BAL, the ANN shows a subgroups accuracy variance of 3.22% on ASIAN-PR, meaning that it is more likely to have different treatment in case of imbalanced data. Figure 3 shows equality scores of ANN while training.

Fig. 3.
figure 3

Measure of equality in terms of SPD, EPD, and EOD on different balanced and imbalanced datasets.

The settings of the experiments performed involve having multiple unprivileged classes and one privileged class. This requires to calculate the SPD, EPD, and EOD metrics one time for each unprivileged class, with respect to the privileged class. Since the purpose of Z-Fed is to mitigate the effect of imbalanced data in a FL environment, we decided to treat both kind of discriminatory behaviors, i.e. favoring privileged groups and favoring unprivileged groups, with the same importance. For a privileged class PR, and l unprivileged classes \(UPR_{i}\), with \(i = 1, \dots , l\), we calculate the SPD, EPD, and EOD values l times with respect to PR to have a fine-grained measurement of equity. These evaluations can be expensive and difficult to interpret in presence of a high number of different subgroups, considering e.g. the possible combinations of ethnicity, gender, age, etc. For this reason, we consider a more convenient absolute value of equity |m| such that \(0 \le |m| \le 1\) and present the average of all the values obtained from the l unprivileged subgroups. To summarize, for each measurement m, across l unprivileged groups, the absolute multi-class equity score is: \( \sum _{i=1}^{l} |m_i|/l \). In this paper, we refer to the absolute multi-class measurements of statistical parity difference, equal opportunity difference, and equal odds difference, as SPD, EPD, and EOD, respectively.

We used the datasets and the results of the experiment described in Sect. 2 as a baseline. In the Z-Fed framework, the support for self-balancing learning can be arbitrarily disabled for testing purposes. We use the imbalanced datasets described and multiple instances of the same learning model for each experiment, and run training sessions on Z-Fed in order to obtain: 1) the Z-Fed model trained with highly imbalanced classes and self-balancing mode disabled, denoted as imbalanced, and 2) the Z-Fed model trained with the same highly class imbalanced dataset and self-balancing mode enabled, denoted as rebalanced. The self-balancing Z-Fed is set up to perform multiple training epochs on under-represented groups to counter-effect the fact of having a relatively small number of examples in the dataset. Identically as it was done for the FL experiment in Sect. 2, we tested the multiple instances of the same learning model on Z-Fed, using the face images as training features and predicting the age ranges. It is important to point out that the features used for creating imbalanced data, i.e. ethnicity, are not training features, means that the influence that can have on predictions is indirect. The age range chosen as feature to predict is, in each dataset, balanced, meaning that for the four age ranges 0–9, 10–19, 20–29, 30–39 have a proportion of 25% ± 1% each in every experiment. The test sets, used to measure the equity scores in the imbalanced Z-Fed experiments, were sampled maintaining the original class proportion of the privileged and unprivileged subgroups. To test the performance of the rebalanced experiments we used a class balanced test, this decision is taken to respect the proportions of the balanced training set. We measure the SPD, EOD, and EPD for each of the four experiments, the results are presented in Table 1. By analyzing the proportion of the population groups, Z-Fed is able to request more training epochs to worker belonging to under-represented classes. This results in a bigger number of training updates for the rebalanced experiments. In terms of SPD, Z-Fed successfully reduce the class bias in ASIAN-PR, INDIAN-PR, and WHITE-PR by 79.3%, 77.79%, and −32.95% respectively. Z-Fed produces a small SPD increment of 5.63% in BLACK-PR, meaning that the overall accuracy of the rebalanced learning model has the tendency of favoring either privileged or unprivileged groups in this particular experiment. The measures of EPD show a considerable improvement in fairness in all the experiments ASIAN-PR, BLACK-PR, INDIAN-PR, and WHITE-PR, with a decrement of opportunity disparity of the 80.8%, 16.89%, 80.14%, and 36.34% respectively. The proportions about true positive results and false positive results in predictions improve considerably with the use of Z-Fed. The EOD measurements also show a notable improvement in fairness across all the experiments. In ASIAN-PR, BLACK-PR, INDIAN-PR, and WHITE-PR, the odd disparity was reduced by 81.46%, 23.02%, 81.2%, and 39.97% respectively. The true positive rate measurement within privileged and unprivileged groups is considerably improved by the use of Z-Fed.

Table 1. Z-Fed measurements of SPD, EPD, and EOD

6 Conclusions

FL is a promising ML method that assists data privacy. However, we show how imbalanced data leads to disparity (unfairness) in the UTK dataset. The Z-Fed framework proposed is able to mitigate FL bias by reducing disparities without compromising privacy. We show that ZKP enables to count the number of population samples keeping track of the proportion of subgroups, e.g. ethnicity, gender. Subgroups proportion can be used to rebalance the FL samples and augment ML data, achieving an increment of fairness in terms of three measures: statistical parity difference, equal odd difference, and equal opportunity difference. On average, Z-Fed improves the EPD of 53.54%, the EOD of 56.41%, and the SPD of 46.1% on imbalanced UTK datasets.