1 Introduction

Machine learning (ML) systems play an important role in high-stake domains. For example, ML is used to identify human faces in images and videos [40], recommend products to customers [30], and recognize criminals accurately [39]. ML has been called software 2.0 because its behaviors are not written explicitly by programmers, but instead are learned from large datasets [36].

When ML software learns about individuals, it uses datasets collected about them. These data contain a broad range of information that may be used to identify individuals, such as personal emails, credit card numbers, and employee records. The right to be forgotten (RTBF) is legislated in some regions, such as by the General Data Protection Regulation (GDPR) in the European Union [22], the California Consumer Privacy Act (CCPA) in the United States [12], and the Personal Information Protection and Electronic Documents Act (PIPEDA) in Canada [4]. These have given data subjects, i.e., service users, the right to request the deletion of their personal data and service history [38]. When ML service providers receive such requests, they may have to remove the personal data from the training set and update their ML models. There have already been examples including cases related to Clearview AIFootnote 1, GoogleFootnote 2, and EuropolFootnote 3. Such types of demands are expected to grow in the future as regulation and privacy awareness increase. Moreover, data deletion may need to be deep and permanent, exposing a key research challenge in various ML applications [41].

Researchers have proposed machine unlearning approaches to enable the RTBF to be efficiently implemented when constructing ML models. Specifically, machine unlearning is the problem of making a trained ML model forget the impact of one or multiple data points in the training data. As ML models capture the knowledge learned from data, it is necessary to erase what they have learned from the deleted data to fulfill the RTBF requirements. A naïve strategy is to retrain ML models from scratch by excluding the deleted data from the training data. However, this process may incur significant computational costs and be practically infeasible [43]. In recent years, machine unlearning has been extensively investigated to address these problems [10, 13, 24]. These methods aim to avoid the large computational cost of fully retraining ML models from scratch and attempt to update ML models to enable the RTBF.

Current machine unlearning research focuses on efficiency and the RTBF satisfaction, but overlooks many other critical AI properties, such as AI fairness. AI fairness is a property of ML software that concerns algorithmic bias in ML models, and whether they are biased toward any protected attribute classes, such as race, gender, or familial status. There is a rich literature about AI fairness [3, 9, 15, 20, 21, 32, 47, 48], and this work assists AI practitioners to ensure ML models’ fairness in practice.

Compared to traditional machine learning methods, machine unlearning methods are different in terms of how data are fed into the model and how training is done. This may subsequently have different impacts on fairness. To the best of our knowledge, there is no prior work studying the fairness implications of machine unlearning methods. However, ignoring fairness in the construction process of machine unlearning systems will adversely affect the benefits of people in protected attribute groups such as race, gender, or familial status. For this reason, ML systems built based on these machine unlearning methods, may violate anti-discrimination legislation, such as the Civil Rights Act [34]. In this paper, we conduct an empirical study to evaluate the fairness of machine unlearning methods to help AI practitioners understand how to build the fairness ML systems satisfying the RTBF requirements.

We employ two popular machine unlearning methods, i.e., SISA and AmnesiacML on three AI fairness datasets. SISA (Sharded, Isolated, Sliced, and Aggregated) [10] and AmnesiacML [24] are an exact machine unlearning method and an approximate machine unlearning method, respectively. The three datasets (Adult, Bank, and COMPAS) have been widely used to evaluate the fairness of ML systems on various tasks, i.e., income prediction, customer churn prediction, and criminal detection. We use four different evaluation metrics to measure the fairness of machine unlearning methods: disparate impact, statistical parity difference, average odds difference, and equal opportunity difference. We then analyze the results to answer the research questions. The main contributions of our paper are as follows:

  • empirical study to evaluate the impacts of machine unlearning on fairness. Specifically, we employed two well-recognized machine unlearning methods on three AI fairness datasets and used four evaluation metrics to measure the fairness on machine unlearning systems.

  • Our results show that machine unlearning methods do not necessarily affect fairness during initial training. When data deletion is uniform, the fairness of the resulting model is hardly affected. For non-uniform data deletion, the variant of SISA leads to better fairness than other methods. These findings shed light on fairness implications of machine unlearning, and inform AI practitioners about potential trade-offs when building solutions for RTBF.

2 Background

This section provides the background knowledge, including machine unlearning methods and AI fairness metrics.

2.1 Machine unlearning methods

Classification is a type of task that many machine learning systems aim to solve and for which machine unlearning can be used. Given a dataset of input–output pairs \({\mathcal {D}}= (x, y) \in {\mathcal {X}} \times {\mathcal {Y}}\), we aim to construct a prediction function \({\mathcal {F}}_{{\mathcal {D}}}: {\mathcal {X}} \rightarrow {\mathcal {Y}}\) that maps these inputs to outputs. The prediction function \({\mathcal {F}}_{{\mathcal {D}}}\) is often learned by minimizing the following objective function:

$$\begin{aligned} \underset{{\mathcal {F}}_{{\mathcal {D}}}}{min} \sum _{i}{\mathcal {L}}({\mathcal {F}}_{{\mathcal {D}}}(x_i), y_i) + \lambda \Omega ({\mathcal {F}}_{{\mathcal {D}}}), \end{aligned}$$
(1)

where \({\mathcal {L}}(.)\), \(\Omega ({\mathcal {F}}_{{\mathcal {D}}})\), and \(\lambda\) are the empirical loss function, the regularization function, and the trade-off value, respectively. Let \({\mathcal {D}}_{r}\) and \({\mathcal {D}}_{u}\) represent the retained dataset and the deleted dataset, respectively. \({\mathcal {D}}_{r}\) and \({\mathcal {D}}_{u}\) are mutually exclusive, i.e., \({\mathcal {D}}_{r} \cap {\mathcal {D}}_{u} = \text{\O }\mathcal {}\) and \({\mathcal {D}}_{r} \cup {\mathcal {D}}_{u} = {\mathcal {D}}\). When right to be forgotten (RTBF) requests arrive, a machine unlearning system needs to remove \({\mathcal {D}}_{u}\) from \({\mathcal {D}}\) and update the prediction function \({\mathcal {F}}_{{\mathcal {D}}}\). Machine unlearning attempts to achieve a model \({\mathcal {F}}_{{\mathcal {D}}_{r}}\), only trained from the retained dataset \({\mathcal {D}}_{r}\), without incurring a significant computational cost. Hence, the model \({\mathcal {F}}_{{\mathcal {D}}_{r}}\) is often used to evaluate the performance of machine unlearning methods.

There are two main types of machine unlearning approaches: exact machine unlearning, and approximate machine unlearning. We present a typical method for each machine unlearning approach. Specifically, SISA and AmnesiacML are selected to represent the exact machine unlearning approach and the approximate machine unlearning approach, respectively. These methods, adopted for deep learning models, are efficient and effective in dealing with RTBF requests. They are briefly described below.

2.1.1 SISA [10]

This is an exact machine unlearning method aiming to reduce the computational cost of the retraining process by employing a data partitioning technique. Figure 1 briefly describes an overview framework of SISA. In the beginning, the original data \({\mathcal {D}}\) is split into \({\mathcal {S}}\) shards, such as \(\cap _{i \in |{\mathcal {S}}|} D_i = \text{\O }\) and \(\cup _{i \in |{\mathcal {S}}|} D_i = {\mathcal {D}}\). Each shard \(D_i \in {\mathcal {D}}\) is then further split into K slices, i.e., \(\cap _{k \in |K|} D_{ik} = \text{\O }\) and \(\cup _{k \in |K|} D_{ik} = D_i\). A deep learning (DL) model is constructed on each shard. The DL model is updated by gradually increasing the number of slices. Note that all the parameters of the DL model are kept in storage. After finishing the training process, SISA contains multiple DL models. Finally, the output results are collected by employing a voting mechanism on a list of outputs of DL models. When RTBF requests arrive, SISA automatically locates the shards and the slices containing the deleted data \({\mathcal {D}}_u\). SISA then retrains the DL models of these shards from the particular cached stage, i.e., before the slices of the deleted data were put into the DL models.

Fig. 1
figure 1

Framework of SISA. The dataset is sharded into multiple shards. Each shard is further sliced into multiple slices. Each shard is put into a ML model trained by gradually increasing the number of slices. The output of the DL models is combined using a voting-based aggregation

Fig. 2
figure 2

SISA’s strategies aim to reduce the computational cost of the retraining process

There are two strategies for SISA to leverage the priori probability to speed up training, and reduce the computational cost. The first is to allocate instances with a higher deletion probability into the same shards. This means the retraining process would happen on fewer shards compared with randomly allocating the instances. The second strategy is to allocate instances with a higher deletion probability to the last slices. In this case, the retraining process would happen on fewer slices compared with randomly allocating the instances. Figure 2a, b briefly describes the first and second strategies, respectively.

2.1.2 AmnesiacML [24]

AmnesiacML is a method of approximate machine unlearning that makes use of the characteristics of batch training in neural networks. During the training process, the updated parameters of a DL model for each batch are recorded and kept in storage. The training process is expressed as follows:

$$\begin{aligned} \theta _{M} = \theta _{\textrm{initial}} + \sum _{e=1}^{E}\sum _{b=1}^{B}\Delta _{\theta _{e,b}}, \end{aligned}$$
(2)

where \(\theta _{\textrm{initial}}\) is the initial parameters of the DL model, E and B represent the total number of epochs and the total number of batches in each epoch, respectively. The updated parameters are stored as \(\{ \gamma _b \mid \gamma _b = \sum _{e=1}^E\Delta _{\theta _{e,b}} , 1 \le b \le B\}\). When we receive RTBF requests, AmnesiacML automatically locates the batches containing the instances that need to be deleted. After that, the DL model’s parameters are rolled back to remove the impact of the deleted data on the trained DL model as follows:

$$\begin{aligned} \theta _{M'} = \theta _{M} - \sum _{{\hat{b}}=1}^{{\hat{B}}}\gamma _{{\hat{b}}}. \end{aligned}$$
(3)

A strategy for AmnesiacML is easily adopted when we comprehend the priori probability of deleted data from different groups. For example, instances with a higher priori probability of being removed can be placed into the same batches. Hence, the process of updating parameters in the DL model will require less computation.

2.2 AI fairness metrics

The goal of AI fairness is to ensure that ML models are not be biased between protected classes such as race, sex, or familial status. Each protected class partitions a population into different groups, some of which might be less privileged. In this section, we employ four different fairness metrics (disparate impact, statistical parity difference, average odds difference, and equal opportunity difference) to evaluate the impact of machine unlearning methods on fairness. These metrics are widely adopted in measuring the fairness of ML systems [3, 9, 15, 20, 21, 32, 48].

Let \(x_s \in \{0, 1\}\) indicate the binary label of a protected class (\(x_s = 1\) for the privileged group). Let \({\hat{y}} \in \{0, 1\}\) be the predicted outcome of a ML classification model (\({\hat{y}}=1\) for the favorable decision). Let \(y \in \{0, 1\}\) be the binary classification label (\(y=1\) is favorable). We present the four fairness evaluation metrics as follows.

Disparate impact (DI) Chouldechova [17] measures the ratio of the favorable outcome of the unprivileged group (\(x_s=0\)) against the privileged group (\(x_s=1\)).

$$\begin{aligned} \frac{P[{\hat{y}} = 1 \mid x_s = 0]}{P[{\hat{y}} = 1 \mid x_s = 1]} \end{aligned}$$
(4)

Statistical parity difference (SPD) Calders and Verwer [11] is the difference of the favorable outcome of the unprivileged group against the privileged group.

$$\begin{aligned} P[{\hat{y}} = 1 \mid x_s = 0] - P[{\hat{y}} = 1 \mid x_s = 1]. \end{aligned}$$
(5)

Average odds difference (AOD) Hardt et al. [26] calculates the average of difference in true-positive rate and false-positive rate between unprivileged and privileged groups.

$$\begin{aligned} \begin{aligned} \frac{1}{2} (|P[{\hat{y}} = 1|x_s = 0, y=1] - P[{\hat{y}} = 1|x_s = 1, y=1] |\\ +|P[{\hat{y}} = 1|x_s = 0, y=0] - P[{\hat{y}} = 1|x_s = 1, y=0]|) \end{aligned} \end{aligned}$$
(6)

Equal opportunity difference (EOD) Hardt et al. [26] evaluates the difference in true-positive rate between unprivileged group and privileged groups.

$$\begin{aligned} P[{\hat{y}} = 1|x_s = 0, y=1] - P[{\hat{y}} = 1|x_s = 1, y=1]. \end{aligned}$$
(7)

All fairness metrics range from \(-1\) to 1. Among them, DI achieves the greatest fairness of the classification model when it equals 1. The remaining fairness metrics, i.e., SPD, AOD, and EOD, attain the greatest fairness at 0.

Fig. 3
figure 3

Experimentation to evaluate the performance and fairness of machine unlearning methods under different scenarios

3 Methodology

This section describes our experimental design and setup. We further present the datasets, the data deletion strategies, and our evaluation metrics.

3.1 Experimental design

Our empirical study starts by first collecting the benchmark fairness datasets. For each dataset, we preprocess and split it into training and testing datasets. The training dataset is then employed to train machine unlearning models. We use six evaluation metrics to measure the performance and fairness of these models. Figure 3 briefly presents an overview framework of our experimental design.

To conduct our experiments, we employ a multilayer perceptron (MLP), a simple feedforward network [35]. The MLP model has an input layer, a hidden layer, and an output layer. We train the MLP model by optimizing a cross-entropy loss function [31]. The SISA and AmnesiacML machine unlearning methods are built based on the MLP model. A naïve approach of using original training and retraining (denoted as ORTR) is also built based on the MLP model as the baseline. We consider two experimental scenarios.

  • Scenario 1 Before any “right to be forgotten” (RTBF) requests, what are the impacts of machine unlearning methods on fairness? In this setting, the training dataset is put into three different models, such as ORTR, SISA, and AmnesiacML (see Fig. 3), to train these models. We then employ the testing dataset to evaluate the performance and fairness of these trained models.

  • Scenario 2 When the RTBF requests arrive, what are the impacts of machine unlearning methods on fairness? In this setting, we employ data deletion strategies (see Fig. 3) to remove instances from the training dataset. For each data deletion strategy, we compare the performance and fairness of ORTR with two machine unlearning methods, such as SISA and AmnesiacML.

We have conducted our experiments using an Nvidia T4 GPU and an Intel Xeon Silver 4114 CPU with 16 GB RAM and 12 GB RAM, respectively. The OS is Debian 10.10 LTS 64 bit. The machine learning framework is PyTorch v.1.12 with CUDA 11.3, and the Python language version is 3.7.

3.2 Datasets

We conduct our experiments by employing three widely used fairness datasets to evaluate the impacts of machine unlearning methods on fairness. These datasets are briefly described as follows:

  • Adult [1]. This dataset is extracted from the 1994 Census Bureau database. Its task is to predict whether a person can earn over $50,000 USD per year. The dataset includes 48,842 instances and 14 features. The sensitive features for this dataset are sex and race.

  • Bank [5]. The dataset is collected from marketing campaigns of a Portuguese banking institution. Its task is to predict whether a client will subscribe to a bank term deposit. The dataset contains 45,211 instances and 17 features. We use age as the sensitive feature for this dataset.

  • COMPAS [19]. The dataset contains recidivism records, which are used to build a prediction system to forecast the possibility of a criminal defendant reoffending. The dataset has 7,215 instances and seven features. The sensitive features are defined as sex and race.

For each dataset, we employ the AI Fairness 360 toolkit [7], which is an open-source library for fairness metrics, to clean up invalid or missing values, transform categorical values into a one-hot encoding, and convert non-numerical binary values to a binary label (e.g., male: 1, female: 0). We further preprocess the datasets to employ them for fairness evaluation. Specifically, we specify favorable labels or the predicted outcome of our model. Following the previous work [9, 15, 47], we identify sensitive features (or protected classes) for the privileged and unprivileged groups.

3.3 Data deletion strategies

To send the right to be forgotten (RTBF) requests, we adopt two data deletion strategies. Each strategy has various settings presented as follows.

Uniform distribution For this strategy, we assume that the deleted data has a uniform distribution, i.e., each instance has an equal probability of being removed from the training dataset. To select a range of proportions of the total amount of deleted data, we leverage the work of Bertram et al. [8]. Specifically, we randomly remove 1%, 5%, 10%, and 20% of the training data.

Non-uniform distribution For this strategy, we assume that the deleted data has a non-uniform distribution, i.e., each instance has a different probability of being removed from the training dataset. Some groups may have a higher probability of sending RTBF requests. For example, people from wealthy families or with a high educational background may be more likely to act to keep their information private [16, 46]. As indicators for such characteristics are unavailable in our datasets, to better understand the fairness implications under different cases when the deleted data are a non-uniform distribution, we first assume that the people who request the RTBF are predominantly from privileged groups, and we assume in another scenario that people exercising the RTBF are predominantly from unprivileged groups.

3.4 Evaluation metrics

We consider two types of evaluation metrics in our experiments, which are performance and fairness.

Performance metrics Before evaluating the fairness of models, we calculate their performance in terms of accuracy and F1 score.

Fairness metrics To measure the fairness of models, we adopt the four fairness metrics, i.e., DI, SPD, AOD, and EOD. For simplicity in presenting and observing, we convert all the fairness metric values into their absolute values. As disparate impact value differs from other fairness metrics, we use |1 - DI| to evaluate the fairness of our models. All four fairness metrics achieve the greatest fairness at 0.

Fig. 4
figure 4

Fairness (the smaller, the better) and performance (the higher, the better) evaluation results of SISA with different shards (5/10/15/20 h) and slices (1/5/10c)

4 Experiments

In this section, we provide results and insights from the experimentation, to answer our research questions.

RQ1: (Initial training) What are the impacts of machine unlearning methods on fairness before the right to be forgotten requests arrive?

This research question aims to understand the impact of machine unlearning methods on fairness in initial training. Specifically, we compare SISA with ORTR, a naïve approach built based on a MLP model. Note that the approximate machine unlearning methods, such as AmnesiacML, only update the ML models’ parameters without modifying their architecture. We therefore ignore AmnesiacML in this research question.

We evaluate the impact of SISA and ORTR on fairness across different numbers of shards (5, 10, 15, 20) and numbers of slices (1, 5). We execute the experiments on three different datasets, such as Adult, Bank, and COMPAS. For ease of observation, we denote Adult, Bank, and COMPAS as A, B, and C, respectively. Sex, Race, and Age, which are the sensitive features, are represented as S, R, and Y, respectively. The number of shards and the number of slices are represented as h and c, respectively. For example, 5h5c means these instances are split into five shards, and each shard is then further split into five slices.

Fig. 5
figure 5

Fairness (the smaller, the better) evaluation results of after uniform data deletion under various deletion proportions

Fig. 6
figure 6

Difference of fairness between before and after the deletion. Value 0 indicates no fairness change, while positive values and negative values indicate worsen fairness and improved fairness respectively

Figure 4a shows the fairness evaluation results of SISA initial training with 5/10/15/20 shards and one slice, with ORTR as baseline ORTR. For some datasets and features, \(|1-\text {DI}|\) improves when the number of shards increases, including B-Y and A-R, while for C-S and A-S the value gets worse when the number of shards increases. Similarly, the trends of other metrics are not always one way along the increasing number of shards. Although there are some tendencies within each dataset, overall across all datasets, we cannot come up with an interpretation towards any outstanding fairness impact from SISA and its number of shards. In terms of performance, there is degradation of less than 10% in accuracy for Adult and Bank datasets, while there is no apparent degradation for the COMPAS dataset. This could be because the COMPAS dataset is much smaller than the Adult and Bank datasets and has fewer useful features, making it easier to converge and less likely to experience performance degradation from data partitioning. We have the similar observation on performance metrics shown in Fig. 4b.

The fairness evaluation results of SISA at initial training with five slices are shown in Fig. 4c, d. Comparing the fairness between one slice and five slices, we find no noticeable difference across all fairness metrics, and this is expectedly identical to what was reported in the SISA paper [10].

RQ2: (Uniform distribution) What are the impacts of machine unlearning methods on fairness when the deleted data have uniform distribution?

Fig. 7
figure 7

Performance (the higher, the better) results of after uniform data deletion under various deletion proportions

A uniform data deletion strategy assumes that every instance has an equal possibility of being removed from trained models. In this research question, we want to explore how much these machine unlearning methods impact fairness when the deleted data are in uniform distribution.

In this research question, we want to explore how much these machine unlearning methods impact fairness when the deleted data is in uniform distribution, i.e., every instance has an equal possibility of being removed from trained models. For this research question, we employ a range of deletion rates from small to large (1%, 5%, 10%, 20%) chosen from the statistics [8]. For SISA, we apply its default setting (i.e., 5h1c). For AmnesiacML, we train its model according to the requirements in the paper [24].

Figure 5 presents the results of fairness in various deletion proportions. We see that there is no clear trend indicating which methods achieve better results across all datasets (i.e., Adult, Bank, and COMPAS) and sensitive features (Sex, Race, and Age). Figure 6 shows the difference in fairness before and after applying the deletion strategy. It indicates that AmnesiacML is likely to be prone to fairness loss caused by this deletion strategy, while SISA is the most robust. However, the difference in fairness between before and after data deletion is unclear in Adult and COMPAS datasets. The main reason is that the deleted data is in uniform distribution, i.e., each instance has an equal probability of being removed from trained models, leading to similar fairness results in this setting. We also see that all methods have a relatively large variation in fairness on the Bank dataset. The reason is that this dataset is highly imbalanced compared to other datasets, such as Adult and COMPAS. Specifically, among 45,211 instances, only 963 instances (2.13%) are labeled as negative instances in the Bank dataset. Figure 7 illustrates the performance of this data deletion strategy. It shows that the deleted data has minimal impact in terms of performance on trained models. As the deletion proportion is 1–20%, we believe the deleted data might be insufficient to cause non-trivial performance degradation.

RQ3: (Non-uniform distribution) What are the impacts of machine unlearning methods on fairness when the deleted data has non-uniform distribution?

In this research question, we aim to understand the impacts of machine unlearning methods on fairness when the deleted data have non-uniform distribution. The simplest way to conduct the experiments is to remove the deleted data so that it has a similar distribution to the percentage of each group (privileged or unprivileged groups) for each sensitive feature on the whole dataset. As our datasets are imbalanced on some features, this RTBF simulation strategy highly likely leads to empty groups. To overcome this problem, we simplify our scenario by removing the data only from either the privileged group or the unprivileged group. Specifically, we remove 50% of the data for each group, making the potential impact on fairness more apparent. Note that we assume the prior probability of a certain group (the privileged group or the unprivileged group) is known.

Fig. 8
figure 8

Fairness (the smaller, the better) evaluation results of non-uniform deletion. The results are shown as the distances from the ORTR results (baseline)

Fig. 9
figure 9

Fairness change after deletion using SISA with sharding strategy

Fig. 10
figure 10

Fairness difference between SISA with and without applying sharding strategy

Figure 8a, b presents the results on data deletion from the privileged group and the unprivileged group, respectively. From the charts we can see that SISA with a sharding strategy achieve the best \(|1-\text {DI}|\) values for nine out of ten combinations. Figure 9 shows that SISA with a sharding strategy may also have fairness improvements after data deletion. The extent of improvements varies under different datasets, such as Adult, Bank, and COMPAS, and different sensitive features, i.e., Sex, Race, and Age. Furthermore, we plot the differences between SISA with and without the sharding strategy in Fig. 10. Overall, the fairness is likely to be improved across all metrics when the sharding strategy is applied. Moreover, such improvements are likely to happen on those datasets and sensitive features with more imbalanced distributions.

Performance-wise, we observed no significant performance difference between before and after the data deletion, or between methods with and without strategies applied. The changes in performance indicators are always less than 5%. The performance differences between different methods are likely to be inherited from the methods instead of escalated from unlearning strategies or distribution settings.

5 Related work

We present research works related to machine unlearning and AI fairness.

5.1 Machine unlearning

The objective of machine unlearning is to build a system that can remove the impact of a data point in the training data, and this concept was raised by Cao and Yang [13]. Early works [14, 29, 37, 44] focused on improving traditional ML models, i.e., support vector machines or linear classification. Later work on neural networks can be categorized into two main research approaches, i.e., exact machine unlearning and approximate machine unlearning.

The exact machine unlearning approach requires a new model to be trained from scratch by removing the deleted data from the training set, so that it ensures the deleted data has no impact on the new model. To make the retraining process more efficient, previous works [6, 10] divided the training data into multiple disjoint shards and DL models were trained on each of these shards. Hence, when a request to remove data points from the training set comes, we only need to retrain the models containing the removed data points. The exact machine unlearning approach necessitates changes in the DL architecture, making testing and maintaining the DL system challenging.

The approximate machine-unlearning approach starts with the trained DL model and attempts to update its weights so that the model will no longer be affected by the removed data points from the training data. Specifically, we can efficiently update the DL model by updating its projective residual [28], employing a Newton step [23, 25], or applying an amnesiac unlearning method [24] to remove the data points in the training set without incurring a significant computational cost. The approximate machine unlearning approach is more computationally efficient than the exact machine learning approach. However, we are unsure whether the removed data points have been completely forgotten in the trained model.

5.2 AI fairness

AI fairness or machine learning (ML) fairness has been deeply investigated during the last decade [3, 9, 20, 21, 32, 47, 48]. Its basic idea is that the prediction model should not be biased between different individuals or groups from the protected attribute class (e.g., race, sex, etc.). There are two major types of AI fairness, i.e., group fairness and individual fairness [20, 32].

Group fairness requires the prediction model to produce statistically comparable results for different groups in the protected attribute class. Several studies proposed a specific kind of utility maximization decision function to satisfy a fairness constraint and derive optimal fairness decisions. Hardt et al. [27] employed the Bayes optimal non-discriminant to derive fairness in a classification model. Corbett-Davies et al. [18] considered AI fairness as a constrained optimization problem to maximize accuracy while satisfying group fairness constraints. Menon and Williamson [33] investigated the trade-off between accuracy and group fairness in AI models and proposed a threshold function for the fairness problem.

Individual fairness on the other hand requires the prediction model to produce similar predictive results among similar individuals who are only different in protected attributes. Udeshi et al. [45] presented Aequitas, a fully automated and directed test generation framework, to generate test inputs and improve the individual fairness of ML models. Aggarwal et al. [2] employed both symbolic execution (together with local explainability) to identify factors making decisions and then generate test inputs. Sun et al. [42] combined both input mutation and metamorphic relations to improve the fairness of machine translation.

6 Conclusion and future work

Machine unlearning can support the implementation of the right to be forgotten (RTBF), but prior work has overlooked its impact on fairness. To the best of our knowledge, we are the first to perform an empirical study on the impacts of machine unlearning methods on fairness. We designed and conducted experiments on two typical machine unlearning methods (SISA and AmnesiacML) along with a retraining method (ORTR), using three fairness datasets under different deletion strategies. We found that overall, the variant of SISA leads to better fairness compared with AmnesiacML and ORTR, while initial training and uniform data deletion do not necessarily affect the fairness of the three methods. Our research sheds light on the fairness implications of machine unlearning and provides knowledge for AI practitioners about the trade-offs when considering machine unlearning methods as a solution for RTBF. Though we selected two representative machine unlearning methods and evaluated their fairness on various datasets to minimize the threats to validity, there is still some randomness involved, and other machine unlearning methods have not been investigated. In the future, more research efforts are needed to broaden the understanding of fairness implications into other machine unlearning methods as well as investigate the underlying causes of their impact on fairness.