To Be Forgotten or To Be Fair: Unveiling Fairness Implications of Machine Unlearning Methods

The right to be forgotten (RTBF) is motivated by the desire of people not to be perpetually disadvantaged by their past deeds. For this, data deletion needs to be deep and permanent, and should be removed from machine learning models. Researchers have proposed machine unlearning algorithms which aim to erase specific data from trained models more efficiently. However, these methods modify how data is fed into the model and how training is done, which may subsequently compromise AI ethics from the fairness perspective. To help software engineers make responsible decisions when adopting these unlearning methods, we present the first study on machine unlearning methods to reveal their fairness implications. We designed and conducted experiments on two typical machine unlearning methods (SISA and AmnesiacML) along with a retraining method (ORTR) as baseline using three fairness datasets under three different deletion strategies. Experimental results show that under non-uniform data deletion, SISA leads to better fairness compared with ORTR and AmnesiacML, while initial training and uniform data deletion do not necessarily affect the fairness of all three methods. These findings have exposed an important research problem in software engineering, and can help practitioners better understand the potential trade-offs on fairness when considering solutions for RTBF.


Introduction
Machine learning (ML) systems play an important role in high-stake domains.For example, ML is used to identify human faces in images and videos [1], recommend products to customers [2], and recognize criminals accurately [3].ML has been called software 2.0 because its behaviours are not written explicitly by programmers, but instead are learned from large datasets [4].
When ML software learns about individuals, it uses datasets collected about them.This data contains a broad range of information that may be used to identify individuals, such as personal emails, credit card numbers, and employee records.Governments or data subjects may sometimes ask ML service providers to remove sensitive information from their datasets for security or privacy purposes or for regu-latory requirements.For example, Clearview AI1 , a facial recognition company owning more than 20 billion images, was requested by France's Commission Nationale Informatique et Libertés to delete data due to a data protection law.In 2014, the Court of Justice of the European Union ordered Google, a multinational technology company, to remove links to sensitive personal data from its internet search results 2 .Later on, Europol3 , the European Union Agency for Law Enforcement Cooperation, was asked to delete individuals' data having no criminal activity.Such types of demands are expected to grow in the future as regulation and privacy awareness increases.
The "right to be forgotten" (RTBF) is covered in legislation in different regions, such as the General Data Protection Regulation (GDPR) in the European Union [5], the California Consumer Privacy Act (CCPA) in the United States [6], and the Personal Information Protection and Electronic Documents Act (PIPEDA) in Canada [7].These have given the data subject, i.e., service users, the right to request the deletion of their personal data and somehow get rid of their past [8].When ML service providers receive such requests, they have to remove the personal data from the training set as well as update ML models to satisfy legislative purposes.Moreover, the data deletion is supposed to be deep and permanent due to the prime purpose of this right, exposing a key research challenge in various ML applications [9].
Researchers have proposed machine unlearning approaches to enable the RTBF to be efficiently implemented when constructing ML models.Specifically, machine unlearning is the problem of making a trained ML model forget the impact of one or multiple data points in the training data.As ML models capture the knowledge learned from data, it is necessary to erase what they have learned from the deleted data to fulfill the RTBF requirements.A naïve strategy is to retrain ML models from scratch by excluding the deleted data from the training data.However, this process may incur significant computational costs and may be practically infeasible [10].Machine unlearning aims to avoid the large computational cost of fully retraining ML models from scratch and attempts to update ML models to enable the RTBF.
In recent years, machine unlearning has been extensively investigated to address these problems [11][12][13][14][15][16].There are two main types of machine unlearning approaches: exact machine unlearning, and approximate machine unlearning.While the exact machine unlearning approach ensures that the data deletion has no impact on the updated ML model by totally excluding it from the training set, and the approximate machine unlearning approach attempts to update the trained ML model weights to remove the deleted data's contribution from the trained ML model.
Current machine unlearning research focuses on efficiency and the RTBF satisfaction, but overlooks many other critical AI properties, such as AI fairness.AI fairness is a non-functional property of ML software.It concerns algorithmic bias in ML models and whether they are biased toward any protected attribute classes, such as race, gender, or familial status.There is a rich literature about AI fairness [17][18][19][20][21][22][23][24].For example, Biswas and Rajan [22] conducted an empirical study, employing 40 models collected from Kaggle, to evaluate the fairness of ML models.The results help AI practitioners to accelerate fairness in building ML software applications.Zhang and Harman [21] later presented another empirical study on the influence of feature size and training data size on the fairness of ML models.It suggests that when the feature size is insufficient, the ML models trained on a large training dataset have more unfairness than those trained on a small training dataset.This work also assists us to ensure ML models' fairness in practice.
To the best of our knowledge, there is no prior work studying the fairness implications of machine unlearning methods.However, ignoring fairness in the construction process of machine unlearning systems will adversely affect the benefit of people in protected attribute groups such as race, gender, or familial status.For this reason, ML systems built based on these machine unlearning methods, may violate anti-discrimination legislation, such as the Civil Rights Act [25].In this paper, we conduct an empirical study to evaluate the fairness of machine unlearning models to help AI practitioners understand how to build the fairness ML systems satisfying the RTBF requirements.We aim to answer the following research questions.RQ1: (Initial training) What are the impacts of machine unlearning methods on fairness before the "right to be forgotten" requests arrive?RQ2: (Uniform distribution) What are the impacts of machine unlearning methods on fairness when the deleted data has uniform distribution?RQ3: (Non-uniform distribution) What are the impacts of machine unlearning methods on fairness when the deleted data has non-uniform distribution?
To conduct the empirical study, we employ two popular machine unlearning methods, i.e., SISA and AmnesiacML on three AI fairness datasets.SISA (Sharded, Isolated, Sliced, and Aggregated) [13] and AmnesiacML [16] are an exact machine unlearning method and an approximate machine unlearning method, respectively.The three datasets, such as Adult, Bank, and COMPAS, have been widely used to evaluate the fairness of machine learning systems on different tasks, i.e., income prediction, customer churn prediction, and criminal detection.We use four different evaluation metrics, i.e., disparate impact, statistical parity difference, average odds difference, and equal opportunity difference, to measure the fairness of machine unlearning methods.We then analyze the results to answer the research questions.
The main contributions of our paper are as follows: • We designed and conducted an empirical study to evaluate the impacts of machine unlearning on fairness.Specifically, we employed two wellrecognized machine unlearning methods on three AI fairness datasets and adopted four evaluation metrics to measure the fairness on machine unlearning systems.
• Our results show that adopting machine unlearning methods does not necessarily affect the fairness during initial training.When the data deletion is uniform, the fairness of the resulting model is hardly affected.When the data deletion is non-uniform, SISA leads to better fairness than other methods.Through these findings, we shed light on fairness implications of machine unlearning, and provide knowledge for software engineers about the potential trade-offs when selecting solutions for RTBF.

Background
This section provides the background knowledge, including machine unlearning methods and AI fairness metrics.

Machine Unlearning Methods
The classification problem is a type of task that many machine learning systems aim to solve and in which machine unlearning can be leveraged.Given a dataset of input-output pairs D = (x, y) ∈ X × Y, we aim to construct a prediction function F D : X → Y that maps these inputs to outputs.The prediction function F D is often learned by minimizing the following objective function: min where L(.), Ω(F D ), and λ are the empirical loss function, the regularization function, and the trade-off value, respectively.Let D r and D u represent the retained dataset and the deleted dataset respectively.D r and D u are mutually exclusive, i.e., D r ∩ D u = Ø and D r ∪ D u = D. When the "right to be forgotten" (RTBF) requests arrive, a machine unlearning system needs to remove D u from D and update the prediction function F D .Machine unlearning attempts to achieve a model F Dr , only trained from the retained dataset D r , without incurring a significant computational cost.Hence, the model F Dr is often used to evaluate the performance of machine unlearning methods.
There are mainly two types of machine unlearning approaches, such as exact machine unlearning and approximate machine unlearning.We present a typical method for each machine unlearning approach.Specifically, SISA and AmnesiacML are selected to represent the exact machine unlearning approach and the approximate machine unlearning approach, respectively.These methods, adopted for deep learning models, are efficient and effective in dealing with RTBF requests.We will briefly describe them in the following subsections.

SISA [13]
This is an exact machine unlearning method aiming to reduce the computational cost of the retraining process by employing a data partitioning technique.Figure 1  by gradually increasing the number of slices.Note that all the parameters of the DL model are kept in storage.After finishing the training process, SISA contains multiple DL models.Finally, the output results are collected by employing a voting mechanism on a list of outputs of DL models.When RTBF requests arrive, SISA automatically locates the shards and the slices containing the deleted data D u .SISA then retrains the DL models of these shards from the particular cached stage, i.e., before the slices of the deleted data were put into the DL models.The priori probability is the probability of an event happening when we have a limited number of possible outcomes that equally occur [26].Machine unlearning methods can easily improve their performance when we know the priori probability of data deletion from different groups.For example, wealthy families prefer to keep their privacy for safety purposes, so they tend to send RTBF requests compared to other people [27].Another example is that people with a  higher educational background are more likely to remove their personal information from public [28].
There are two strategies for SISA to leverage the priori probability to speed up the training process, hence reducing the computational cost.The first strategy is to allocate the instances with a higher deletion probability into the same shards.This means the retraining process would happen on fewer shards compared with randomly allocating the instances.The second strategy is to allocate the instances with a higher deletion probability to the last slices.In this case, the retraining process would happen on fewer slices compared with randomly allocating the instances.Figure 2a and Figure 2b briefly describe the first and second strategies, respectively.

AmnesiacML [16]
This is a method of approximate machine unlearning approach.AmnesiacML makes use of the characteristics of batch training in neural networks.During the training process, the updated parameters of a DL model for each batch are recorded and kept in storage.The training process is expressed as follows: where θ initial is the initial parameters of the DL model, E and B represent the total number of epochs and the total number of batches in each epoch, respectively.The updated parameters are stored as When we receive the RTBF requests, AmnesiacML automatically locates the batches containing the instances that need to be deleted.After that, the DL model's parameters are rolled back to remove the impact of the deleted data on the trained DL model as follows: A strategy for AmnesiacML is easily adopted when we comprehend the priori probability of deleted data from different groups.For example, instances with a higher priori probability of being removed can be placed into the same batches.Hence, the process of updating parameters in the DL model will require a less computational cost.
Similar to SISA, AmnesiacML shows its efficiency and effectiveness in machine unlearning problems.However, it does not ensure the impact of deleted data being completely forgotten in the updated DL model.The open-source repository of AmnesiacML can be found at https://github.com/lmgraves/AmnesiacML

AI Fairness Metrics
The goal of AI fairness is to correct machine learning (ML) models with the assumption that models should not be biased between any protected classes, i.e., race, sex, familial status, etc.Each protected class partitions a population into different groups, such as the privileged group and the unprivileged group.In this section, we employ four different fairness metrics, such as disparate impact, statistical parity difference, average odds difference, and equal opportunity difference, to evaluate the impact of machine unlearning methods on fairness.These metrics are widely adopted in measuring the fairness of ML systems [17][18][19][20][21][22][23][24].
Let x s ∈ {0, 1} indicates the binary label of a protected class (x s = 1 for the privileged group).Let ŷ ∈ {0, 1} be the predicted outcome of a ML classification model (ŷ = 1 for the favourable decision).Let y ∈ {0, 1} be the binary classification label (y = 1 is favourable).We present the four fairness evaluation metrics as follows.Disparate impact (DI) [32] measures the ratio of the favourable outcome of the unprivileged group (x s = 0) against the privileged group (x s = 1).
Statistical parity difference (SPD) [33] is the difference of the favourable outcome of the unprivileged group (x s = 0) against the privileged group (x s = 1).
Average odds difference (AOD) [34] calculates the average of difference in true positive rate and false positive rate between unprivileged and privileged groups.
Equal opportunity difference (EOD) [34] evaluates the difference in true positive rate between unprivileged group and privileged groups.
All fairness metrics are range from -1 to 1.Among them, DI achieves the greatest fairness of the classification model when it equals 1.The remaining fairness metrics, i.e., SPD, AOD, and EOD, attain the greatest fairness when their values are 0.

Methodology
This section first describes our experimental design and setup.Then we briefly present the datasets, the data deletion strategies, and our evaluation metrics.

Experiment Design
Our empirical study starts by first collecting the benchmark fairness datasets.For each dataset, we preprocess and split it into training and testing datasets.The training dataset is then employed to train machine unlearning models.We use six evaluation metrics to measure the performance and fairness of these models.Figure 3 briefly presents an overview framework of our experimental design.
To identify the fairness datasets, we first refer to the work on fairness testing for machine learning models that employed six datasets, such as German Credit, Adult, Bank, US Executions, Fraud Detection, and Raw Car Rentals [35].Among these datasets, only German Credit, Adult, and Bank are available.We also collect the Heart Disease dataset [36], referring to the presence of heart disease in patients, and the COMPAS dataset [37], aiming to predict the probability of criminals reoffending.In total, we acquire five datasets, i.e., German Credit, Adult, Bank, Heart Disease, and COMPAS, across various domains.As machine unlearning methods are efficient and effective on large datasets [13,16], we remove datasets that have fewer than 1,000 instances including German Credit and Heart Disease.Hence, there are three datasets, i.e., Adult, Bank, and COM-PAS, that are employed to evaluate the impacts of machine unlearning methods in our experiments.
We apply the same data preprocessing approach for all three datasets.Specifically, we employ the AI Fairness 360 toolkit [38], which is an open-source library for fairness metrics, to clean up invalid or missing values, transform categorical values into a one-hot encoding, and convert non-numerical binary values to a binary label (e.g., male: 1, female: 0).We further preprocess the datasets to employ them for fairness evaluation.Specifically, we specify favourable labels or the predicted outcome of our model.We also identify sensitive features (or protected classes) for the privileged and unprivileged groups.For example, in the Adult dataset, the prediction label is a favourable label, indicating whether a person has a high annual salary.We define sex as a sensitive feature.We assume that a male often has a higher annual salary than a female; hence, the male should be put in the privileged group while the female should be in the unprivileged group regarding the sensitive feature sex.
For each dataset, we shuffle and split it into the training dataset (80%) and the testing dataset (20%).We then feed the training dataset into our models.
To conduct our experiments, we employ a multilayer perceptron (MLP), a simple feedforward network [39].The MLP model includes an input layer, a hidden layer, and an output layer.We train the MLP model by optimizing a cross-entropy loss function [40].Two machine unlearning methods, such as SISA and AmnesiacML, are built based on the MLP model.A naïve approach of using original training and retraining (denoted as ORTR) is also built based on the MLP model as the baseline.We consider two experimental scenarios.
• Scenario 1: Before any "right to be forgotten" (RTBF) requests, what are the impacts of machine unlearning methods on fairness?In this setting, the training dataset is put into three different models, such as ORTR, SISA, and AmnesiacML (see Figure 3), to train these models.We then employ the testing dataset to evaluate the performance and fairness of these trained models.
• Scenario 2: When the RTBF requests arrive, what are the impacts of machine unlearning methods on fairness?In this setting, we employ data deletion strategies (see Figure 3) to remove instances from the training dataset.For each data deletion strategy, we compare the performance and fairness of ORTR with two machine unlearning methods, such as SISA and AmnesiacML.
For each dataset, we apply 5-fold cross-validation and take the mean of the results.We have conducted our experiments using an Nvidia T4 GPU and an Intel Xeon Silver 4114 CPU with 16 GB RAM and 12 GB RAM, respectively.The OS is Debian 10.10 LTS 64 bit.The machine learning framework is PyTorch

Data deletion strategies -uniform distribution -non-uniform distribution
We conduct the experiment five times and take the mean of results

Datasets
We conduct our experiments by employing three widely-used fairness datasets to evaluate the impacts of machine unlearning methods on fairness.These datasets are briefly described as follows.
• Adult [41].This dataset is extracted from the 1994 Census Bureau database 4 .Its task is to predict whether a person can earn over $50,000 USD per year.The dataset includes 48,842 instances and 14 features.The sensitive features for this dataset are sex and race.
• Bank [42].The dataset is collected from marketing campaigns of a Portuguese banking institution.Its task is to predict whether a client will subscribe to a bank term deposit.The dataset contains 45,211 instances and 17 features.We use age as the sensitive feature for this dataset.
• COMPAS [37].The dataset contains recidivism records, which are used to build a prediction sys-4 https://www.census.gov/programssurveys/ahs/data/1994.htmltem to forecast the possibility of a criminal defendant reoffending.The dataset has 7,215 instances and seven features.The sensitive features are defined as sex and race.
All the sensitive features are selected by following the previous work [21,22,24].

Data Deletion Strategies
To send the "right to be forgotten" (RTBF) requests, we adopt two data deletion strategies.Each strategy has various settings presented as follows.Uniform distribution.For this strategy, we assume that the deleted data has a uniform distribution, i.e., each instance has an equal probability of being removed from the training dataset.To select a range of proportions of the total amount of deleted data, we leverage the work of Bertram et al [43].Specifically, we randomly remove 1%, 5%, 10%, and 20% of the training data.Non-uniform distribution.For this strategy, we assume that the deleted data has a non-uniform distribution, i.e., each instance has a different probability of being removed from the training dataset.Some people have a higher probability of sending RTBF requests compared to other people.For example, people who are from wealthy families or have a high educational background prefer to keep their sensitive information private for security and privacy purposes [27,28].As these personal details are unavailable in our datasets, to better understand the fairness implications under different cases when the deleted data is a non-uniform distribution, we first assume that the people who request the RTBF are predominantly from privileged groups, and we assume another scenario that people exercising the RTBF are predominantly from unprivileged groups.

Evaluation Metrics
We consider two types of evaluation metrics in our experiments, which are performance and fairness.Performance measure.Before evaluating the fairness of models, we calculate their performance in terms of accuracy and F1 score.
• Accuracy: The ratio of true predictions among the total number of predictions [44].
Fairness measure.To measure the fairness of models, we adopt the four fairness metrics, i.e., disparate impact (DI), statistical parity difference, average odds difference, and equal opportunity difference, briefly mentioned in Section 2.2.For simplicity in presenting and observing, we convert all the fairness metric values into their absolute values.As disparate impact (DI) value differs from other fairness metrics, we use |1 -DI| to evaluate the fairness of our models.In this case, all four fairness metrics achieve the greatest fairness when their values equal 0.

Experiments
In this section, we provide results and insights from the experimentation, to answer our research questions.

RQ1: (Initial training)
What are the impacts of machine unlearning methods on fairness before the "right to be forgotten" requests arrive?SISA, modify how data is fed into machine learning models, affecting the fairness of these models before RTBF is requested, i.e., initial training.This research question is aimed to understand the impact of machine unlearning methods on fairness in initial training.Specifically, we compare SISA with ORTR, a naïve approach built based on a MLP model.Note that the approximate machine unlearning methods, such as AmnesiacML, only update the ML models' parameters without modifying their architecture.We therefore ignore AmnesiacML in this research question.
We evaluate the impact of SISA and ORTR on fairness across different numbers of shards (5,10,15,20) and numbers of slices (1,5).We execute the experiments on three different datasets, such as Adult, Bank, and COMPAS.For ease of observation, we de-note Adult, Bank, and COMPAS as A, B, and C, respectively.Sex, Race, and Age, which are the sensitive features, are represented as S, R, and Y, respectively.The number of shards and the number of slices are represented as h and c, respectively.For example, given 1,000 instances, 5h5c means these instances are split into five shards.Each shard is then further split into five slices.In the end, each shard contains 200 instances and each slice includes 40 instances.
Figure 4a shows the fairness evaluation results of SISA initial training with 5/10/15/20 shards and one slice.The baseline is ORTR.We can see that for some datasets and features, the |1 − DI| value gets better when the number of shards increases, including B-Y and A-R, while for C-S and A-S the value gets worse when the number of shards increases.Similarly, for other metrics, the trends are not always one way along the increasing number of shards.Although there are some tendencies within each dataset, overall across all datasets, we cannot come up with an interpretation towards any outstanding fairness impact from the SISA method and its number of shards.
In terms of performance, there is degradation of less than 10% in accuracy for Adult and Bank datasets, while there is no apparent degradation for the COMPAS dataset.This could be because the COMPAS dataset is much smaller than the Adult and Bank datasets and has fewer useful features, making it easier to converge and less likely to experience performance degradation from data partitioning.
The fairness evaluation results of SISA at initial training with five slices are shown in Figure 4c.Comparing the fairness between one slice and five slices, we find no noticeable difference across all fairness metrics.We have the similar observation on performance metrics shown in Figure 4b and Figure 4d, and this is expectedly identical to what was reported in the SISA paper [13].
During initial training, no significant fairness impacts are observed from using machine unlearning methods, such as SISA.In addition, compared with ORTR, SISA has performance degradation on larger datasets.

RQ2: (Uniform distribution) What are the impacts of machine unlearning methods on fairness when the deleted data has uniform distribution?
A uniform data deletion strategy assumes that every instance has an equal possibility of being removed from trained models.In this research question, we want to explore how much these machine unlearning methods impact fairness when the deleted data is in uniform distribution.
For this research question, we employ a range of deletion rates from small to large (1%, 5%, 10%, 20%) chosen from the statistics [43].For SISA, we apply its default setting (i.e., 5h1c).For AmnesiacML, we train its model according to the requirements in the paper [16].
Figure 5 presents the results of fairness in various deletion proportions.We see that there is no clear trend indicating which methods achieve better results across all datasets (i.e., Adult, Bank, and COMPAS) and sensitive features (Sex, Race, and Age). Figure 6 shows the difference in fairness before and after applying the deletion strategy.It indicates that Amne-siacML is likely to be prone to fairness loss caused by this deletion strategy, while SISA is the most robust.However, the difference in fairness between before and after data deletion is unclear in Adult and COM-PAS datasets.The main reason is that the deleted data is in uniform distribution, i.e., each instance has an equal probability of being removed from trained models, leading to similar fairness results in this setting.We also see that all methods have a relatively large variation in fairness on the Bank dataset.The reason is that this dataset is highly imbalanced compared to other datasets, such as Adult and COM-PAS.Specifically, among 45,211 instances, only 963 instances (2.13%) are labeled as negative instances in the Bank dataset.
Figure 7 illustrates the performance of this data deletion strategy.It shows that the deleted data has minimal impact in terms of performance on trained models.As the deletion proportion is 1% -20%, we believe the deleted data might be insufficient to cause non-trivial performance degradation.
Under the data deletion of uniform distribution, the fairness is not clearly affected by machine unlearning methods, while ORTR outperforms SISA and AmnesiacML on performance metrics.
RQ3: (Non-uniform distribution) What are the impacts of machine unlearning methods on fairness when the deleted data has non-uniform distribution?
People from different groups have the equal right to send RTBF requests to remove their sensitive information, but they may have varied probabilities [28].In this research question, we aim to understand the impacts of machine unlearning methods on fairness when the deleted data has non-uniform distribution.The simplest way to conduct the experiments is to remove the deleted data so that it has a similar distribution to the percentage of each group (privileged or unprivileged groups) for each sensitive feature on the whole dataset.As our datasets are imbalanced on some features, this RTBF simulation strategy highly likely leads to empty groups.To overcome this problem, we simplify our scenario by removing the data only from either the privileged group or the unprivileged group.Specifically, we remove 50% of the data for each group, making the potential impact on fairness more apparent.Note that we assume the prior probability of a certain group (the privileged group or the unprivileged group) is known.
Figure 8a and Figure 8b present the results on data deletion from the privileged group and the unprivileged group, respectively.From the charts we can see that SISA with a sharding strategy (see Figure 2a) achieve the best |1−DI| values for nine out of ten combinations.Figure 9 shows that SISA with a sharding strategy may also have fairness improvements after data deletion.The extent of improvements varies under different datasets, such as Adult, Bank, and COMPAS, and different sensitive features, i.e., Sex, Race, and Age.Furthermore, we plot the differences between SISA with and without the sharding strategy in Figure 10.Overall, the fairness is likely to be improved across all metrics when the sharding strategy is applied.Moreover, such improvements are likely to happen on those datasets and sensitive features with more imbalanced distributions.
-0.  SISA with a slicing strategy (see Figure 2b) is also likely to outperform ORTR on |1 − DI|.However, it achieves less performance compared to SISA with a sharding strategy.For ORTR, we observe that the   fairness changes between before and after retraining are weak.Similarly, AmnesiacML tends to be close to the ORTR across all indicators.
Performance-wise, we observed no significant performance difference between before and after the data deletion, or between methods with and without strategies applied.The changes in performance indicators are always less than 5%.The performance differences between different methods are likely to be inherited from the methods instead of escalated from unlearning strategies or distribution settings.
Under the data deletion of non-uniform distribution, SISA with a sharding strategy achieves better fairness.The performance has no significant degradation from deletion using machine unlearning methods.

Discussion
Our research explores machine unlearning methods on fairness and has gained empirical observations about fairness regarding initial training, data deletion with uniform distribution, and data deletion with non-uniform distribution.We will discuss these observations as follows.
Before the "right to be forgotten" requests arrive, we see that there are no significant impacts of machine unlearning methods, such as SISA, on fairness.The observations also indicate that SISA achieves lower performance on large datasets, such as Adult and Bank, in this setting.
When the deleted data is in uniform distribution, there is no clear impact of machine unlearning methods on fairness.The observations also show that ORTR, a naive approach that retrains a model from scratch, outperforms SISA and AmnesiacML in terms of accuracy and F1 score on large datasets, i.e., Adult and Bank.
When the deleted data is in a non-uniform distribution, SISA with a sharding strategy (see Figure 2a) is more likely to achieve better fairness compared to other models.Moreover, we also see that there is no significant performance difference between before and after the data deletion in machine unlearning methods.
6 Threat to Validity

Internal validity
To perform our empirical study, we employed two machine unlearning methods on three AI fairness datasets.For machine unlearning algorithms, we reused existing implementations by following their open-source repositories.All three datasets are wellknown and widely used by AI fairness researchers.We employed the AIF360 library to preprocess the datasets for fairness evaluation.We have carefully checked the code and data, but there might be some remaining errors.Although there was some randomness involved in the experiments, we have tried to minimize this threat by conducting experiments multiple times (5-fold cross-validation).

External validity
Threats to external validity refer to the generalizability of the study.In our experiments, we only used three AI fairness datasets, collected from three tasks, i.e., income prediction, customer churn prediction, and criminal detection, with a total of five protected classes and two machine unlearning methods to perform our experiments.We also performed two data deletion strategies.This may be a threat to external validity as these datasets, tasks, methods, and data deletion strategies may not be generalized beyond our studies.As the datasets and methods are widely adopted in AI fairness and machine unlearning research fields respectively, we believe that there is minimal threat to external validity.In the future, we plan to investigate more machine unlearning methods and AI fairness datasets.

Construct validity
Threats to construct validity indicate evaluation metrics failing to be selective.We employed different evaluation metrics, widely used to measure fairness in machine learning models, to minimize threats to construct validity.

Related Work
This section introduces the work related to machine unlearning and AI fairness.

Machine Unlearning
Machine unlearning was first presented by Cao and Yang [11].Its objective is to build a system that can remove the impact of a data point in the training data.Early works on machine unlearning focused on traditional machine learning (ML) models, i.e., support vector machines, linear classification, logistic regression, etc., by facilitating incremental and decremental learning techniques to efficiently retrain ML models after adding or removing multiple data points from the training set [46][47][48][49].Since then, machine unlearning has been extensively studied to reduce the computational cost of retraining deep learning (DL) models [11][12][13][14][15][16].Specifically, there are two main research approaches for employing machine unlearning in deep neural networks, i.e., exact machine unlearning and approximate machine unlearning.
The exact machine unlearning approach requires a new model to be trained from scratch by removing the deleted data from the training set.This approach ensures that the deleted data has no impact on the new model as we exclude it from the training set.To make the retraining process more efficient, previous works [12,13] divided the training data into multiple disjoint shards and DL models were trained on each of these shards.Hence, when a request to remove data points from the training set comes, we only need to retrain the models containing the removed data points.The exact machine unlearning approach necessitates changes in the DL architecture, making testing and maintaining the DL system challenging.
The approximate machine unlearning approach starts with the trained DL model and attempts to update its weights so that the model will no longer be affected by the removed data points from the training data.Izzo et al. [50] showed that we may achieve a linear time complexity in machine unlearning by updating a projective residual of the trained DL models.Guo et al. [51] and Golatkar et al. [52] employed a Newton step on the model weights to eliminate the influence of removed data points.Graves et al. [16] later proposed an amnesiac unlearning method by storing a list of batches and their updated weights; hence, DL models only need to undo the updated weights from the batches containing the removed data points.The approximate machine unlearning approach is more computationally efficient than the exact machine learning approach.However, we are unsure whether the removed data points have been completely forgotten in the trained model.
Although machine unlearning methods have been comprehensively studied, their fairness has not been investigated in the process of building machine unlearning systems.To fill in this gap, we perform an extensive study on the two machine unlearning approaches, i.e., extract and approximate, to reveal their fairness implications.

AI Fairness
AI fairness or machine learning (ML) fairness has been deeply investigated during the last decade [17][18][19][20][21][22][23].Its basic idea is that the prediction model should not be biased between different individuals or groups from the protected attribute class (e.g., race, sex, familial status, etc.).There are mainly two major types of AI fairness, i.e., group fairness and individual fairness [19,23].
Group fairness requires the prediction model to produce different predictive results for different groups in the protected attribute class.Several studies proposed a specific kind of utility maximization decision function to satisfy a fairness constraint and derive optimal fairness decisions [53][54][55][56].Hardt et al. [54] employed the Bayes optimal non-discriminant to derive fairness in a classification model.Corbett-Davies et al. [53] considered AI fairness as a constrained optimization problem to maximize accuracy while satisfying group fairness constraints.Menon and Williamson [55] investigated the trade-off between accuracy and group fairness in AI models and proposed a threshold function for the fairness problem.Group fairness often ignores the individual characteristics of the group, leading to permit unfairness in training ML models [57].
Individual fairness on the other hand expects the prediction model to produce similar predictive results among similar individuals who are only different in protected attributes.Udeshi et al. [58] presented Aequitas, a fully automated and directed test generation framework, to generate test inputs and improve the individual fairness of ML models.Aggarwal et al. [35] employed both symbolic execution (together with local explainability) to identify factors making decisions and then generate test inputs.Sun et al. [59] combined both input mutation and metamorphic relations to improve the fairness of machine translation.
Other works explore the effectiveness and efficiency of existing ML methods for software fairness [21,22,24].Specifically, researchers focus on improving fairness in ML systems by leveraging mitigation techniques [22], removing biased instances in training data [24], or improving the quality of features in the datasets [21].
Even though AI fairness has been widely adopted, its properties have not been revealed in machine unlearning.We perform an empirical study on the three AI fairness datasets, i.e., Adult, Bank, and COM-PAS to understand the impacts of machine unlearning models on fairness.

Conclusion and Future Work
Machine unlearning emerges with the need to implement the "right to be forgotten" (RTBF) efficiently while existing studies overlook its impact on fairness.To the best of our knowledge, we are the first to perform an empirical study on the impacts of machine unlearning methods on fairness.We designed and conducted experiments on two typical machine unlearning methods (SISA and AmnesiacML) along with a retraining method (ORTR) using three fairness datasets under three different deletion strategies.We found that SISA leads to better fairness compared with AmnesiacML and ORTR, while initial training and uniform data deletion do not necessarily affect the fairness of all three methods.Our research has shed light on fairness implications of machine unlearning and provided knowledge for software engineers about the trade-offs when considering machine unlearning methods as a solution for RTBF.In the future, more research efforts are needed to broaden the understanding of fairness implications into other machine unlearning methods as well as investigate the underlying causes of their impact on fairness.

Figure 1 :
Figure 1: An overview framework of SISA.The dataset is first sharded into multiple shards.Each shard is further sliced into multiple slices.Each shard is put into a deep learning model trained by gradually increasing the number of slices.The output of the DL models is combined using a voting-based aggregation.

Figure 2 :
Figure 2: SISA's strategies aim to reduce the computational cost of the retraining process.

Figure 3 :
Figure 3: Experimentation to evaluate the performance and fairness of machine unlearning methods under different scenarios..

Figure 5 :
Figure 5: Fairness (the smaller, the better) evaluation results of different training methods after uniform data deletion under various deletion proportions.

Figure 6 :
Figure 6: Difference of fairness between before and after the deletion.Value 0 indicates no fairness change, while positive values and negative values indicate worsen fairness and improved fairness respectively.

Figure 7 :
Figure 7: Performance (the higher, the better) results of different training methods after uniform data deletion under various deletion proportions.

Figure 8 :
Figure 8: Fairness (the smaller, the better) evaluation results of non-uniform deletion.The results are shown as the distances from the ORTR results (baseline).

Figure 9 :
Figure 9: Fairness change after deletion using SISA with sharding strategy.

Figure 10 :
Figure 10: Fairness difference between SISA with and without applying sharding strategy.