1 Introduction

Machine learning algorithms, as useful decision-making tools, are widely used in the society. These algorithms are often assumed to be paragons of objectivity. However, many studies show that the decisions made by these models can be biased against certain groups of people. For example, Abid et al. observed that large-scale language models capture undesirable racial bias [1] and Vigdor et al. [2] reported gender bias in credit ranking of Apple card. These events prove that discrimination can arise from machine learning, and one of the most important discrimination sources is from data, including data collection (imbalanced training set) and data preparation (biased content in the training set) [3]. Given the widespread use of machine learning to support decisions over loan allocations, insurance coverage, and many other basic precursors to equity, fairness in machine learning has become a significantly important issue [4]. Thus, how to design big data enabled machine learning algorithms that treat all groups equally is critical.

In recent years, many fairness metrics have been proposed to define what is fairness in machine learning. Popular fairness metrics include statistical fairness [5, 6], individual fairness [7,8,9,10] and causal fairness [11, 12]. Meanwhile, many algorithms have been developed to address fairness issues for both supervised learning settings [6, 13, 14] and unsupervised settings [15,16,17,18]. Generally, these studies have focused on two key issues: how to formalize the concept of fairness in the context of machine learning tasks, and how to design efficient algorithms that strike a desirable trade-off between accuracy and fairness. What is lacking is the research that considers semi-supervised learning (SSL) scenarios.

In real-world machine learning tasks, a large amount of data used for training is necessary, and is often a combination of labeled and unlabeled data. Therefore, fair SSL is a vital area of development. Like the other learning settings, achieving a balance between accuracy and fairness is a key issue. According to [19], increasing the size of the training set can create a better trade-off. This finding sparked an idea over whether the trade-off might be improved via unlabeled data. Unlabeled data is abundant in era of big data and, if it could be used as training data, we may be able to make a better compromise between fairness and accuracy. To achieve this goal, two challenges are ahead of us: (1) how to achieve fair learning from both labeled and unlabeled data; and (2) how to give labels for unlabeled data to ensure that the learning is towards a fair direction.

To solve these challenges, we propose two approaches to improve the trade-off with unlabeled data in graph-based SSL which is one of the most prominent methods in SSL. Graph-based SSL first constructs a graph, where nodes represent all samples, and weighted edges reflect the similarity between a pair of nodes. Then the label information of unlabeled samples can be inferred from the graphs based on the manifold assumption. Graph-based SSL mainly includes two lines, graph-based regularization [20,21,22] and graph neural networks (GNNs) [20, 23], and thus we design two approaches to achieve fairness in these two lines.

Graph-based SSL shares an assumption that smoothness (e.g., the labels of adjacent nodes are likely to be the same) should present in the local and global graph structure [22]. Regularization methods are used to smooth the predictions or feature representations over local neighborhoods. Our first approach, fair semi-supervised margin classifiers (FSMC), is formulated as an optimization problem, where the objective function includes a loss for both the classifier and label propagation, and fairness constraints over labeled and unlabeled data. Classification loss is to optimize the accuracy of training result; label propagation loss is to optimize the label predictions on unlabeled data; the fairness constraint is to lead optimization towards to a fairness direction. The optimization includes two steps. In the first step, fairness constraints enforce weights update towards a fair direction. This step can be solved by a convex problem and convex-concave programming when disparate impact and disparate mistreatment are used as fairness metrics respectively. In the second step, updated weights further direct labels assigned to unlabeled data in a fair direction by label propagation. Labels for unlabeled data can be calculated in a closed form. In this way, labeled and unlabeled data are used to achieve a better trade-off between accuracy and fairness.

GNNs have been widely used in supervised learning or semi-supervised learning tasks, such as convolutional GNNs and recurrent GNNs [23]. In SSL, GNNs aim to classify the data in a graph using a small subset of labeled data and all the data features. A large number of unlabeled data added to model training is able to help the utilization of structural and feature information of all data, and thus improves the classification accuracy. Our second approach, fair graph neural networks (FGNN), is built with GNNs, where the loss function includes classification loss and fairness loss. Classification loss optimizes the classification accuracy over all labeled data, and fairness loss enforces fairness over labeled data and unlabeled data. GNN models combine graph structures and features, and our method allows GNN models to distribute gradient information from the classification loss and fairness loss. Thus, fair representations of nodes with labeled and unlabeled data can be learned to achieve the ideal trade-off between accuracy and fairness.

With the aim of achieving fair graph-based SSL, the contributions of this paper are as follows.

  • First, we conduct the study of algorithmic fairness in the setting of graph-based SSL, including graph-based regularizations and graph neural networks. These approaches enable the use of unlabeled data to achieve a better trade-off between fairness and accuracy.

  • Second, we propose algorithms to solve optimization problems when disparate impact and disparate mistreatment are integrated as fairness metrics in the graph-based regularization.

  • Third, we consider different cases of fairness constraints on labeled and unlabeled data. This helps us understand the impact of unlabeled data on model fairness, and how to control the fairness level in practice.

  • Forth, we conduct extensive experiments to validate the effectiveness of our proposed methods.

The rest of this paper is organized as follows. The preliminaries is given in Section 2. The first proposed method FSMC is given in Sect. 3, and the second proposed method FGNN is given in Sect. 4. The experiments are set out in Sect. 5. The related work appears in Sect. 6, with the conclusion in Sect. 7.

2 Preliminaries

2.1 Notations

Let \( {X}=\{x_{1},\ldots ,x_{k}\}^{T} \in {\mathbb {R}}^{k \times v}\) denote the training data matrix, where k is the number of data point and v is the number of unprotected attributes; \( {\mathbf {z}}= \{z_{1},\ldots ,z_{k}\} \in \{0,1\}^{k}\) denotes the protected attribute, e.g., gender or race. Labeled dataset is denoted as \( {\mathcal {D}}_{l} = \{x_{i},z_{i},y_{l,i}\}_{i=1}^{k_{l}}\) with \( k_{l} \) data points, and \( \mathbf {y_{l}}= \{y_{l,1},\ldots ,y_{l,k_{l}}\}^{T}\in \{0,1\}^{k_l} \) is the label for the labeled dataset. Unlabeled dataset is denoted as \( {\mathcal {D}}_{u} = \{x_{i},z_{i}\}_{i=1}^{k_{u}}\) with \( k_{u} \) data points, and \( \mathbf {y_{u}}= \{y_{u,1},\ldots ,y_{u,k_{u}}\}^{T}\in \{0,1\}^{k_u} \) is the predicted labels for the unlabeled dataset.

Given the whole dataset, an adjacency matrix is denoted as \( A={\theta _{ij} \in {\mathbb {R}}^{k \times k}},\forall i,j \in 1,\ldots ,k, ( k=k_{l}+k_{u} ) \), where \( \theta _{ij} \) is the weight to evaluate the relationship of two data points. The degree matrix D is constructed as a diagonal matrix whose i-th diagonal element is \( d_{ii} = \sum _{j=1}^{k} \theta _{ij} \). We use L to denote Laplacian matrix, calculated as \( {L}={D}-A\). Our objective is to learn a classification model \( f(\cdot ) \) with the model parameters \( {\mathbf {w}} \) (or W) and \(\mathbf {y_u}\) over discriminatory datasets \( {\mathcal {D}}_{l} \) and \( {\mathcal {D}}_{u} \) that delivers high accuracy with low discrimination.

2.2 Fairness metrics

In our framework, we have applied disparate impact and disparate mistreatment as the fairness metrics [6, 24].

2.2.1 Disparate impact

A classification model does not suffer disparate impact if,

$$\begin{aligned} {\text {Pr}}({\hat{y}} = y \mid z=1)={\text {Pr}}({\hat{y}} = y \mid z=0) \end{aligned}$$
(1)

where \({\hat{y}}\) is the predicted label. When the rate of positive predictions is the same for both groups \( z=1 \) and \( z=0 \), then there is no disparate impact.

2.2.2 Disparate mistreatment

A binary classifier will not suffer disparate mistreatment if the misclassification rate of different groups with different values of sensitive feature z is the same. Here, three different kind of disparate mistreatments are adopted to evaluate the discrimination as follows,

  • Overall misclassification rate (OMR):

    $$\begin{aligned} Pr({\hat{y}} \ne y \mid z=1)= Pr({\hat{y}} \ne y \mid z=0) \end{aligned}$$
    (2)
  • False positive rate (FPR):

    $$\begin{aligned} Pr({\hat{y}} \ne y \mid z=1,y=0)= Pr({\hat{y}} \ne y \mid z=0,y=0) \end{aligned}$$
    (3)
  • False negative rate (FNR):

    $$\begin{aligned} Pr({\hat{y}} \ne y \mid z=1,y=1)= Pr({\hat{y}} \ne y \mid z=0,y=1) \end{aligned}$$
    (4)

In most cases, a classifier suffers discrimination in terms of disparate impact or disparate mistreatment. The discrimination level is defined as the differences in rates between different groups.

Definition 1

(Discrimination level) Let \(\gamma _z\) denote the probability of positive predictions of group z on a model f training with a dataset D in terms of a fairness metric. The discrimination level \(\Gamma ({\hat{y}})\) on a model f training with a dataset D is measured by the difference between groups:

$$\begin{aligned} \Gamma ({\hat{y}}) = \gamma _{0}({\hat{y}})-\gamma _{1}({\hat{y}}). \end{aligned}$$
(5)

Take disparate impact as an example, we have \( \gamma _{1} = Pr({\hat{y}}=1 \mid z=1)\), and the discrimination level is \( \Gamma ({\hat{y}}) = \mid Pr({\hat{y}}=1 \mid z=1)-Pr({\hat{y}}=1 \mid z=0) \mid \).

2.3 Fairness constraints

Many fairness constraints [6, 24, 25] have been proposed to enforce various fairness metrics, such as disparate impact and disparate mistreatment, and these fairness constraints can be used in our framework. The basic idea to design fairness constraints is that using the covariance between the users’ sensitive attributes and the signed distance between the feature vectors restricts the correlation between sensitive attributes and classification results. This can be described as,

$$\begin{aligned} {\text {Cov}}\left( {\mathbf {z}}, \mathbf {g_{w}}\right)&={\mathbb {E}}\left[ ({\mathbf {z}}-{\bar{z}})\left( \mathbf {g_{w}}-\bar{\mathbf {g_{w}}}\right) \right] \nonumber \\&\approx \frac{1}{k} \mathbf {g_{w}} \left( {\mathbf {z}}-{\bar{z}}\right) \end{aligned}$$
(6)

where \( \mathbf {\mathbf {g_{w}^{T}}} \in {\mathbb {R}}^{k} \) is a vector that denotes the signed distance between the feature vectors and the decision boundary of a classifier. \({\mathbf {z}}\) denotes the vector of the protected attribute, and \({\bar{z}}\) denotes the mean value of the protected attribute. The details of obtaining Eq.(6) can be found in [6]. The form of \( \mathbf {g_{w}} \) is different in fairness metrics, and we list them in the following,

  • Disparate impact

    $$\begin{aligned} \mathbf {g_{w}}= \mathbf {w^{T}}X \end{aligned}$$
    (7)
  • Overall misclassification rate

    $$\begin{aligned} \mathbf {g_{w}}=\min \left( 0, {\mathbf {y}}^{T}{\mathbf {y}} {\mathbf {w}}^{T}X \right) \end{aligned}$$
    (8)
  • False positive rate

    $$\begin{aligned} \mathbf {g_{w}}=\min \left( 0, \frac{{\mathbf {1}}-{\mathbf {y}}^{T}}{2} {\mathbf {y}} {\mathbf {w}}^{T}X \right) \end{aligned}$$
    (9)
  • False negative rate

    $$\begin{aligned} \mathbf {g_{w}}=\min \left( 0, \frac{{\mathbf {1}}+{\mathbf {y}}^{T}}{2} {\mathbf {y}} {\mathbf {w}}^{T}X \right) \end{aligned}$$
    (10)

2.4 Graph-based semi-supervised learning

2.4.1 Graph-base regularization

In graph-based regularization, the goal is searching for a function f on the graph. f has to satisfy two criteria simultaneously: (1) it should be as close to the given labels as possible, and (2) it should be smooth on the entire constructed graph. Graph stores the geometric structure in the data (such as similarity or proximity) and use this structure as a regularizer to infer labels of unlabeled data. Generally, the graph-based regularization methods adopt the following objective function,

$$\begin{aligned} {\mathcal {J}} = \mathcal {J_{C}} + \alpha \mathcal {J_{L}} \end{aligned}$$
(11)

where \( \mathcal {J_{C}} \) is the classification loss; \( \alpha \) is a balancing parameter; \( \mathcal {J_{L}}\) is a graph-based regularizer. Different methods can have different variants of the regularizer. In our paper, we consider Laplacian regularizer as it is the most common used regularizer, which is calculated by,

$$\begin{aligned} \mathcal {J_{L}} = \sum _{i,j} \theta _{ij}||f(x_{i})-f(x_{j})||^{2}. \end{aligned}$$
(12)

Here, \(\theta _{ij}\) is a graph-based weight. The edges in the graph between each pair of data points i and j is weighted. The closer the two points are in Euclidean space \( d_{ij} \), the greater the weight \( \theta _{ij} \). In this paper, we chose a Gaussian similarity function to calculate the weights, given as follows:

$$\begin{aligned} \theta _{i j}=\exp \left( -\frac{d_{i j}^{2}}{\sigma ^{2}}\right) =\exp \left( -\frac{\sum _{d}\left( x_{i}^{d}-x_{j}^{d}\right) ^{2}}{\sigma ^{2}}\right) \end{aligned}$$
(13)

where \( \sigma \) is a length scale parameter. This parameter has an impact on the graph structure; hence, the value of \( \sigma \) needs to be selected carefully [21].

2.4.2 GNN-based SSL

Another method that has received a lot of attention recently is GNNs [23, 26]. The main idea is that the representation vector of the node can contain information from the structure of the graph, and also on any associated feature information. A graph neural network aggregates the neighboring nodes’ features into a hidden representation for a central node. This aggregation operation can also be imposed on the hidden representation to form a deeper neural network. In general, for node i, a single aggregation operation can be represented as follows,

$$\begin{aligned} H^{l+1}= \upsilon \left( \Phi W^{l} H^{l}\right) \end{aligned}$$
(14)

where \( H^{l} \) is the hidden representation of l-th layer; W is the trainable weight matrix in the layer l; \(\upsilon \) is the activation function; \( \Phi \) denotes the rule of how to aggregate neighboring information. Predictions of each node is given on top of the hidden representation of the last layer.

3 Fairness constraints in SSL on margin classifiers

In this section, we first present the proposed framework in Sect. 3.1. Then fairness metrics of disparate impact and disparate mistreatment in logistic regression are analyzed in Sect. 3.2, and finally a discussion is given in Sect. 3.3.

3.1 The proposed framework

We formulate the framework of fair SSL as following, including the classification loss, the label propagation loss and fairness constraints.

$$\begin{aligned}&\min _{{\mathbf {w}},\mathbf {y_{u}}} {\mathcal {J}}_{{\mathcal {C}}}({\mathbf {w}},\mathbf {y_{u}}) +\alpha {\mathcal {J}}_{{\mathcal {L}}}(\mathbf {y_{u}}) \qquad \qquad s.t. s({\mathbf {w}})\le c \end{aligned}$$
(15)

where \( {\mathcal {J}}_{{\mathcal {C}}} \) is the classification loss between predicted labels and true labels; \( {\mathcal {J}}_{{\mathcal {L}}} \) is the loss of label propagation from labeled data to unlabeled data; \( \alpha \) is a parameter to balance the loss; \( s({\mathbf {w}}) \) is the expression of fairness constraints; and c is a threshold.

3.1.1 Classification loss

A classification loss function evaluates how well a specific algorithm models the given dataset. When different algorithms are used to train datasets, such as logistic regression or neural networks, a corresponding loss function is applied to evaluate the accuracy of the model.

3.1.2 Label propagation loss

According to [22], when Laplacian regularizer is used, the label propagation loss for \( {\mathcal {J}}_{{\mathcal {L}}} \) through SSL can be expressed as,

$$\begin{aligned} {\mathcal {J}}_{{\mathcal {L}}}= \min _{ \mathbf {y_{u}}} {\text {Tr}}( {\mathbf {y}}^{T} L {\mathbf {y}} ) \end{aligned}$$
(16)

where Tr denotes the trace, and the vector \( {\mathbf {y}}=[{\mathbf {y}}_{l};{\mathbf {y}}_{u}] \in {\mathbb {R}}^{k}\) includes labels of labeled and unlabeled data.

3.1.3 Fairness constraints

Adding fairness constraints is a useful method to enforce fair learning with in-processing methods. In SSL, labeled data and unlabeled data have different impacts on discrimination because of two reasons: (1) predicting labels for unlabeled data will bring noise to the labels; (2) labeled data and unlabeled data may have different data distributions. Therefore, the discrimination inherently in unlabeled data is different from the discrimination in labeled data. For these reasons, we impose fairness constraints on labeled and unlabeled data to measure discrimination to see the disparate impact of fairness constraints on labeled and unlabeled data. We consider four cases of fairness constraints enforced on the training data:

  • 1. Labeled constraint: The fairness constraint is on labeled data.

  • 2. Unlabeled constraint: The fairness constraint is on unlabeled data.

  • 3. Combined constraint: The fairness constraint is on labeled data and unlabeled data separately.

  • 4. Mixed constraint: The fairness constraint is on labeled and unlabeled data together.

3.2 Fair SSL of logistic regression

In this section, we propose algorithms to solve the optimization problem (15) with a binary logistic regression (LR) classifier. (Other margin classifiers can also be applied in our method, and we give another example of support vector machines in the supplemental material.) The classifier is subjected to the fairness metric of disparate impact with mixed labeled and unlabeled data. The objective function of LR is defined as,

$$\begin{aligned} {\mathcal {J}}_{{\mathcal {C}}}^{LR}= -\ln ({\mathbf {p}}) {\mathbf {y}}- \ln ({\mathbf {1}}-{\mathbf {p}}) ({\mathbf {1}}-{\mathbf {y}}) \end{aligned}$$
(17)

where \({\mathbf {p}}=\frac{1}{1+e^{-{\mathbf {w}}^{T} X}}\) is the probability distribution of mapping X to the class label \( {\mathbf {y}} \); \( {\mathbf {1}} \) denotes a column vector with all its elements being 1. Given the logistic regression loss, the label propagation loss and the fairness metric, the optimized problem (15) adopts the form,

$$\begin{aligned} \begin{aligned}&\min _{{\mathbf {w}}, \mathbf {y_u}} -\ln ({\mathbf {p}}) {\mathbf {y}}- \ln ({\mathbf {1}}-{\mathbf {p}}) ({\mathbf {1}}-{\mathbf {y}}) + \alpha {Tr}( {{\mathbf {y}}} ^{T} L {{\mathbf {y}}} ) \\&s.t. \mid \frac{1}{k} \mathbf {g_{w}} \left( {\mathbf {z}}-{\bar{z}}\right) \mid \le c \\ \end{aligned} \end{aligned}$$
(18)

3.2.1 Disparate impact

First, we solve the optimization problem with disparate impact as the fairness metric. The optimization of problem (18) includes two parts: learning the weights \( {\mathbf {w}} \) and predicted labels of unlabeled data \( \mathbf {y_{u}} \). The basic idea of solution is that because of the fairness constraint, the weight \( {\mathbf {w}} \) is updated towards a fair direction, and using the updated \( {\mathbf {w}} \) to update \( \mathbf {y_u}\) also ensures that \( \mathbf {y_u}\) is directed towards fairness. The problem is solved by updating \( {\mathbf {w}} \) and \( \mathbf {y_{u}} \) iteratively as follows.

Solving \( {\mathbf {w}} \) when \( \mathbf {y_{u}} \) is fixed, the problem (18) becomes

$$\begin{aligned} \begin{aligned}&\min _{{\mathbf {w}}} -\ln ({\mathbf {p}}) {\mathbf {y}}- \ln ({\mathbf {1}}-{\mathbf {p}}) ({\mathbf {1}}-{\mathbf {y}}) \\&s.t. \mid \frac{1}{k} {\mathbf {w}}^{T} X \left( {\mathbf {z}}-{\bar{z}}\right) \mid \le c \\ \end{aligned} \end{aligned}$$
(19)

Note that problem (19) is a convex problem that can be written as a regularized optimization problem by moving fairness constraints to the objective function. The optimal \( {\mathbf {w}}^* \) can then be calculated by using Karush-Kuhn-Tucker (KKT) conditions.

Solving \( \mathbf {y_{u}} \) when \( {\mathbf {w}} \) is fixed, the problem (18) becomes

$$\begin{aligned} \min _{\mathbf {y_{u}}} -\ln ({\mathbf {p}}) {\mathbf {y}}- \ln ({\mathbf {1}}-{\mathbf {p}}) ({\mathbf {1}}-{\mathbf {y}}) + \alpha {Tr}( {\mathbf {y}} ^{T} L {\mathbf {y}} ) \end{aligned}$$
(20)

Given that problem (20) is also a convex problem, the optimal \( \mathbf {y_u} \) can be obtained from the deviation of \(\mathbf {y_{u}} \) in problem (20). In order to calculate \(\mathbf {y_{u}} \) conveniently, we split Laplacian matrix L into four blocks after the l-th row and the l-th column: \(L=\left[ \begin{array}{cc}{L_{l l}} &{} {L_{l u}} \\ {L_{u l}} &{} {L_{u u}}\end{array}\right] \). The deviation of Eq.(20) is then calculated w.r.t. \( \mathbf {y_{u}} \) and setting to zero, we have

$$\begin{aligned} \alpha (2\mathbf {y_{u}}L_{uu} + L_{ul}\mathbf {y_{l}} + (\mathbf {y_{l}}L_{lu})^{T}) - [(ln({\mathbf {p}}))^{T}+(ln({\mathbf {1}}-{\mathbf {p}}))^{T}]=0 \end{aligned}$$
(21)

Note that L is a symmetric matrix and, after simplification, the closed updated form of \( \mathbf {y_{u}} \) can be derived from

$$\begin{aligned} \mathbf {y_{u}} = -L_{uu}^{-1} (L_{ul}\mathbf {y_{l}}+\frac{1}{2\alpha }[(ln({\mathbf {p}}))^{T}+(ln({\mathbf {1}}-{\mathbf {p}}))^{T}]) \end{aligned}$$
(22)

Note that the computed optimal \( \mathbf {y_u} \) is decimals, and it cannot be used to update \( {\mathbf {w}} \) directly because only integers are allowed to optimize \( {\mathbf {w}} \) in the next update. Due to this, we need to convert \( \mathbf {y_u} \) from decimals to integers to update \( {\mathbf {w}} \). Before using \( \mathbf {y_u}\) to update the next \( {\mathbf {w}} \), the value of \( y_{u,i} \in \mathbf {y_u},i=1,\ldots ,k_u \) is set to,

$$\begin{aligned} y_{u,i}=\left\{ \begin{aligned}&1,&y_{u,i} \ge \xi \\&0,&y_{u,i} < \xi \end{aligned} \right. \end{aligned}$$
(23)

where \( \xi \) is the threshold that determines the classification result. Then, the optimization problem (18) can be solved by optimizing \( {\mathbf {w}} \) and \( \mathbf {y_{u}} \) iteratively. \( {\textbf {Algorithm 1}} \) summarizes the solution of optimization problem (18) with the disparate impact.

figure h

3.2.2 Disparate mistreatment

Disparate mistreatment metrics include overall misclassification rate, false positive rate and false negative rate. For simplicity, overall misclassification rate is used to analyze disparate mistreatment. However, false positive rate and false negative rate can also be analyzed easily, and the result of three disparate mistreatment metrics are presented in the experiment.

With the overall misclassification rate as the fairness metric, the objective function is denoted as,

$$\begin{aligned} \begin{aligned}&\min _{{\mathbf {w}}} -\ln ({\mathbf {p}}) {\mathbf {y}}- \ln ({\mathbf {1}}-{\mathbf {p}}) ({\mathbf {1}}-{\mathbf {y}}) + \alpha {Tr}( {{\mathbf {y}}} ^{T} L {{\mathbf {y}}} ) \\&s.t. \mid \frac{1}{k} \mathbf {g_{w}} ({\mathbf {x}}) \left( {\mathbf {z}}-{\bar{z}}\right) \mid \le c \\ \end{aligned} \end{aligned}$$
(24)

Note that fairness constraints of disparate mistreatment are non-convex, and the solution of the optimization problem (24) is more challenging than the optimization problem in (18). Next, we convert these constraints into a Disciplined Convex-Concave Program (DCCP). Thus, the optimization problem (24) can be solved efficiently with the recent advances in convex-concave programming [27].

The fairness constraint of disparate mistreatment can be split into two terms,

$$\begin{aligned} \frac{1}{k} \mid \sum _{ {\mathcal {D}}_{0}}(0-{\bar{z}}) \mathbf {g_{w}}+\sum _{ {\mathcal {D}}_{1}}(1-{\bar{z}}) \mathbf {g_{w}} \mid \le c \end{aligned}$$
(25)

where \( {\mathcal {D}}_0 \) and \( {\mathcal {D}}_1 \) are the subsets of the labeled dataset \( {\mathcal {D}}_{l} \) and unlabeled dataset \( {\mathcal {D}}_{u} \) with values \( z = 0 \) and \( z = 1 \), respectively. \( k_0 \) and \( k_1 \) are defined as the number of data points in the \( {\mathcal {D}}_0 \) and \( {\mathcal {D}}_1 \), and thus \( {\bar{z}} \) can be rewritten as \( {\bar{z}}=\frac{0*k_{0}+1*k_{1}}{k}=\frac{k_{1}}{k}\). Then the fairness constraint of disparate mistreatment can be rewriten as,

$$\begin{aligned} \frac{k_1}{k} \mid \sum _{{\mathcal {D}}_{0}} \mathbf {g_{w}}+\sum _{ {\mathcal {D}}_{1}} \mathbf {g_{w}} \mid \le c \end{aligned}$$
(26)

Solving \( {\mathbf {w}} \) when \( \mathbf {y_{u}} \) is fixed, the problem (24) becomes

$$\begin{aligned} \begin{aligned}&\min _{{\mathbf {w}}} -\ln ({\mathbf {p}}) {\mathbf {y}}- \ln ({\mathbf {1}}-{\mathbf {p}}) ({\mathbf {1}}-{\mathbf {y}})\\&s.t. \frac{k_1}{k} \mid \sum _{ {\mathcal {D}}_{0}} \mathbf {g_{w}}+\sum _{ {\mathcal {D}}_{1}} \mathbf {g_{w}} \mid \le c \\ \end{aligned} \end{aligned}$$
(27)

The optimization problem (27) is a Disciplined Convex-Concave Program (DCCP) for any convex loss, and can be solved with some efficient heuristics [27].

Solving \( \mathbf {y_{u}} \) when \( {\mathbf {w}} \) is fixed, the problem (24) becomes

$$\begin{aligned} \min _{\mathbf {y_{u}}} -\ln ({\mathbf {p}}) {\mathbf {y}}- \ln ({\mathbf {1}}-{\mathbf {p}}) ({\mathbf {1}}-{\mathbf {y}})+\alpha {Tr}( {{\mathbf {y}}} ^{T} L {{\mathbf {y}}} ) \end{aligned}$$
(28)

The solution of Eq. (28) is the same as the solution of the Eq. (22). The closed form of \( \mathbf {y_u} \) can be obtained via Eq. (23), and then the optimization problem (23) can be solved by updating \( \mathbf {y_u} \) and \( {\mathbf {w}} \) iteratively. \( {\textbf {Algorithm 2}} \) summarizes this process.

figure i

3.3 Discussion

Based on above analysis, some conclusions can be drawn:

  1. 1.

    Since unlabeled data do not contain any label information, they do not label biased information so that we can take advantage of the unlabeled data to improve the trade off between accuracy and fairness. In our framework, due to the fairness constraint, the weight \( {\mathbf {w}} \) is updated towards a fair direction. Using the updated \( {\mathbf {w}} \) to update \( \mathbf {y_u}\) also ensures that \( \mathbf {y_u}\) is directed towards fairness. In this way, fairness is enforced in labeled and unlabeled data by updating \( {\mathbf {w}} \) and \( \mathbf {y_u} \) iteratively. Therefore, labels of unlabeled data are calculated in a fair way, which is beneficial to the accuracy of the classifier as well as the fairness of the classifier.

  2. 2.

    Fairness constraints on labeled data and unlabeled data have different impact on the training result because labeled and unlabeled data may present different covariance between the sensitive attribute and the signed distance between feature vectors to the decision boundaries.

4 Fairness regularizers in SSL on graph neural networks

In this section, we present the proposed method of how to achieve fair SSL on GNNs. The main idea of the proposed method is to impose fairness regularizers on GNNs that is implemented in the SSL setting. In this way, GNN models can allocate gradient information from the classification loss and the fairness loss to ensure fairness. Firstly, we introduce a framework for fair SSL on GNNs, and then present a case of fair graph convolutional networks.

4.1 The proposed methods

Our goal is to learn a neural network function f(W) that optimize two main objectives: the classification accuracy and fairness. The loss function of the model is defined as,

$$\begin{aligned} {\mathcal {J}}({\mathcal {D}} ; W)= {\mathcal {J}}_{{\mathcal {C}}}({\mathcal {D}} ; W)+\beta {\mathcal {J}}_{{\mathcal {F}}}({\mathcal {D}} ; W) \end{aligned}$$
(29)

where \( {\mathcal {J}}({\mathcal {D}} ;W) \) denotes the classification loss, and \( {\mathcal {J}}_{{\mathcal {F}}}({\mathcal {D}} ; W) \) denotes the fairness loss that imposes fairness regularizers on the output of the model. \( \beta \) adjusts the trade-off between fairness and accuracy loss. Typically, the cross-entropy loss is used to calculate the classification loss.

4.1.1 Fairness constraints

The second item in the loss function exerts fairness on the learning function. Since fairness constraints Eqs. (7)–(10) are not differentiable, fairness regularizers are defined according to literal definitions of fairness metrics, these regularizers are able to handle and optimize different fairness definitions so as to adjust the appropriate fairness definition according to the application.

The fairness regularizer of disparate impact is defined as,

$$\begin{aligned} {\mathcal {J}}^{DI}_{{\mathcal {F}}} = \mid \frac{\sum _{i=1}^{k} p_{i} z_{i}}{\sum _{i=1}^{k} z_{i}} - \frac{\sum _{i=1}^{k} p_{i}\left( 1-z_{i}\right) }{\sum _{i=1}^{k} 1-z_{i}} \mid \end{aligned}$$
(30)

where \( p_i \) denotes the predicted probability of the i-th data point belonging to one class calculated by a softmax function in the last layer of the network.

Disparate mistreatment, including FPR, FNR, and OMR are defined in the following,

$$\begin{aligned} {\mathcal {J}}^{FPR}_{{\mathcal {F}}}= & {} \mid \frac{\sum _{i=1}^{k} p_{i}\left( 1-y_{i}\right) z_{i}}{\sum _{i=1}^{k} z_{i}}-\frac{\sum _{i=1}^{k} p_{i}\left( 1-y_{i}\right) \left( 1-z_{i}\right) }{\sum _{i=1}^{k} 1-z_{i}} \mid \end{aligned}$$
(31)
$$\begin{aligned} {\mathcal {J}}^{FNR}_{{\mathcal {F}}}= & {} \mid \frac{\sum _{i=1}^{k}\left( 1-p_{i}\right) y_{i} z_{i}}{\sum _{i=1}^{k} z_{i}}-\frac{\sum _{i=1}^{k}\left( 1-p_{i}\right) y_{i}\left( 1-z_{i}\right) }{\sum _{i=1}^{k} 1-z_{i}} \mid \end{aligned}$$
(32)
$$\begin{aligned} {\mathcal {J}}^{OMR}_{{\mathcal {F}}}= & {} {\mathcal {J}}^{FPR}_{{\mathcal {F}}} + {\mathcal {J}}^{FNR}_{{\mathcal {F}}} \end{aligned}$$
(33)

4.2 Fair SSL of convolutional GNN

In this section, we study a case of fair graph convolutional network (GCN), where a multi-layer graph convolutional networks is used to optimize the classification loss in the Eq. (30). We take GCN as an example since GCN achieves high performance in SSL tasks, and our method can also apply in other GNNs. The GCN model combines the graph structure and vertex features in the convolution, in which the features of unlabeled vertices are mixed with those of neighboring labeled vertices, and then propagated to the graph through multiple layers.

The propagation rule of a multi-layer GCN is defined as [26],

$$\begin{aligned} H^{(l+1)}= \upsilon \left( {\tilde{D}}^{-\frac{1}{2}} {\tilde{A}} {\tilde{D}}^{-\frac{1}{2}} H^{(l)} W^{(l)}\right) \end{aligned}$$
(34)

where \( {\tilde{A}} = A +I_{N} \) is the adjacency matrix of the undirected graph with added self-connections and \( {\tilde{D}} = \sum _{j}{\tilde{A}}_{ij}\).

The model used in this paper is a two-layer GCN, and softmax classifier is applied to the output features,

$$\begin{aligned} S={\text {softmax}}\left( {\hat{A}} {\text {ReLU}}\left( {\hat{A}} X W^{(0)}\right) W^{(1)}\right) \end{aligned}$$
(35)

where \( {\hat{A}} = {\tilde{D}}^{-\frac{1}{2}} {\tilde{A}} {\tilde{D}}^{-\frac{1}{2}} \).

The loss function is defined as the cross entropy error of all labeled data points,

$$\begin{aligned} {\mathcal {J}}_{C}=-\sum _{l \in {\mathcal {Y}}_{L}} \sum _{f=1}^{F} Y_{l f} \ln S_{l f} \end{aligned}$$
(36)

where \( {\mathcal {Y}}_{L} \) is the set of indices of labeled vertices and F is the number of classes.

Given the GCN loss and the fairness regularizer, the Eq. (29) adopts the form,

$$\begin{aligned} -\sum _{l \in {\mathcal {Y}}_{L}} \sum _{f=1}^{F} Y_{l f} \ln S_{l f} + \beta {\mathcal {J}}_{{\mathcal {F}}}({\mathcal {D}} ; W) \end{aligned}$$
(37)

The model parameters W can be trained via gradient descent. In this paper, batch gradient descent is used to train datasets for each iteration.

4.3 Discussion

  1. 1.

    GCN naturally combines the structure and features of the graph in the convolution, and thus avoids graph Laplacian regularization. Our method allows the GCN model to allocate gradient information from the classification loss and the fairness loss. Therefore, fair representation of nodes with labeled data and unlabeled data can be learned to achieve fair SSL.

  2. 2.

    Parameter \( \beta \) adjusts the discrimination level. A higher \( \beta \) will impose a higher penalty on fairness loss, and thus decrease the discrimination level. However, a very large \( \beta \) may destroy the expression ability of the model.

5 Experiment

In this section, we first describe the experimental setup, including datasets, baselines, and parameters. Then, we evaluate our method on three real-world datasets under the fairness metric of disparate impact and disparate mistreatment (including OMR, FNR and FPR). The aim of our experiments is to assess: the effectiveness of our methods to achieve fair semi-supervised learning; the impact of different fairness constraints on fairness; and the extent to which unlabeled data can balance fairness with accuracy.

5.1 Experimental setup

5.1.1 Dataset

Our experiments involve three real-world datasets: Health datasetFootnote 1, Titanic datasetFootnote 2 and Bank datasetFootnote 3 When GNN models are used for training, structured datasets need processing into graphs. To construct graph-structured data based on structured data, we need to build an adjacency matrix to describe the topological relationship. In our experiment, we instinctively using Euclidean distance calculated by Eq. (13) as our adjacency matrix for simplicity.

  • The task in the Health dataset is to predict whether people will spend time in the hospital. In order to convert the problem into the binary classification task, we only predict whether people will spend any day in the hospital. After data preprocessing, the dataset contains 27,000 data points with 132 features. We divide patients into two groups based on age (\( \ge \)65 years) and consider ’Age’ to be the sensitive attribute.

  • The Bank dataset contains a total of 41,188 records with 20 attributes and a binary label, which indicates whether the client has subscribed (positive class) or not (negative class) to a term deposit. We consider ’Age’ as sensitive attribute.

  • The Titanic dataset comes from a Kaggle competition where the goal is to analyze which sorts of people were likely to survive the sinking of the Titanic. We consider "Gender" as the sensitive attribute. After data preprocessing, we extract 891 data points with 9 features.

5.1.2 Parameters

The sensitive attributes are excluded from the training set to ensure fairness between groups and are only used to evaluate discrimination in the test phrases. In the Health, Bank and Titanic datasets, data are all labeled. In the Health dataset, we sample 4,000 data points as labeled dataset, 4,000 data points as test dataset, and left as unlabeled dataset. In the Bank dataset, we sample 4,000 data points as labeled dataset, 4,000 data points as test dataset, and left as unlabeled dataset. In the Titanic dataset, we sample 200 data points as labeled dataset, 200 data points as test dataset, and left as unlabeled dataset. Therefore, \({\mathcal {D}}_{l}\) and \({\mathcal {D}}_{u} \) are collected from the similar data distribution.

In the experiments, the results are an average of 10 results by randomly sampling labeled dataset, test dataset and unlabeled dataset.

We set \( \alpha =1 \) and \( \xi = 0.5 \) in all datasets. \(\sigma \) is a length scale parameter. This parameter has an impact on the graph structure, and we set \( \sigma =0.5 \) in the Health dataset and Bank dataset, and \( \sigma =0.1 \) in the Titanic dataset by using binary search. \(\tau \) and \(\mu \) are parameters in DCCP. \(\tau \) is a parameter that trades off satisfying the constraints and minimizing the objective in DCCP, and we set \(\tau =0.05\) and \(\tau =1\) in Bank and Titanic dataset by binary search. \(\mu \) parameter sets the rate at which \(\tau \) increases inside the algorithm, and we set /mu as the default value 1.2 in Bank and Titanic datasets.

5.1.3 Baseline methods

The methods chosen for comparison are listed as follows. PS, US and FES are only applied in the fairness metric of disparate impact, so they are compared with the performance with our methods using disparate impact. FC and FMLP are compared with the performance with our methods using disparate impact and disparate mistreatment. It is worth to note that [28] also used unlabeled data on fairness. However, they only applied the equal opportunity metric, which is different to ours. Hence, we did not compare the proposed method with them.

  • Fairness Constraints (FC): Fairness constraints are used to ensure fairness for classifiers. [6]

  • Uniform Sampling (US): The number of data points in each groups is equalized through oversampling and/ undersampling. [29]

  • Preferential Sampling (PS): The number of data points in each groups is equalized by taking samples near the borderline data points. [29]

  • Fair multilayer perceptron neural networks (FMLP): the proposed method is built on Multilayer Perceptron (MLP) neural network for SSL in the in-processing phase, where unlabeled data is marked labels with pseudo labeling. [30]

  • Fairness-enhanced sampling (FES): A fair SSL framework includes pseudo labeling, re-sampling and ensemble learning. [31]

Fig. 1
figure 1

The trade-off between accuracy and discrimination in the proposed method FS-LR (Red), FC (Blue), US (Blue cross), PS (Yellow cross) and FES (Green cross) under the fairness metric of disparate impact with LR in two datasets. As the threshold of covariance c increases, accuracy and discrimination increase. The results demonstrate that our method achieves a better trade-off between accuracy and discrimination than other methods

5.2 Experimental results of disparate impact

5.2.1 Trade-off between accuracy and discrimination

Figure 1 shows that as c varies, accuracy and discrimination level in the proposed method fair semi-LR (FS-LR) and other methods with LR on two datasets. From the results, we can observe that our framework provides the better trade-off between accuracy and discrimination. A better trade-off means that with the same accuracy, discrimination is low or with the same discrimination, accuracy is higher. For example, at the same level of accuracy on the Titanic dataset, our method FS-LR has a discrimination level of around 0.08, while FC method has a discrimination level of 0.11. A similar observation can be made from the results with PS method (Yellow cross), US method (Blue cross) and FES method (Green cross). Note that the discrimination level (red line) with LR in the Health dataset does not extend because discrimination does not increase as c grows.

Figure 2 shows that accuracy and discrimination level in the proposed method fair GCN (FGCN) and the baseline method FMLP as \( \beta \) varies. The result shows that FGCN performs the better trade-off between accuracy and discrimination than FMLP. This contributes to GCN has effective utilization of structural and feature information of unlabeled data.

5.2.2 Different fairness constraints

Our next set of experiments is to determine the impact of different fairness constraints. For these tests, the size of unlabeled data is set to 12,000 data points in the Health dataset and 400 data points in the Titanic dataset. Due to space limitation, we have only reported the results for the LR, which appear in Tables 1 and 2. The result shows that, when varying the threshold of covariance c, different fairness constraints on labeled and unlabeled data have different impacts on the training results. As the threshold of covariance increases, both accuracy and discrimination level increase before steadying off for the duration. In terms of accuracy, this is because a larger c allows for a larger space to find better weights \( {\mathbf {w}} \) to inform classification. In terms of discrimination, a larger c tends to introduce more discrimination in noise.

It is also observed that the fairness constraint on mixed data generally has the best performance in the trade-off between accuracy and discrimination. Other three constraints have very similar accuracy and discrimination levels. We attribute this to the assumption that labeled and unlabeled data have the similar data distribution, and therefore the mixed fairness constraint on labeled and unlabeled data gives the best description of the covariance between sensitive attributes and signed distance from feature vectors to the decision boundary.

Fig. 2
figure 2

The trade-off between accuracy and discrimination in the proposed method FGCN (Red), FMLP (Blue), US (Blue cross), PS (Yellow cross) and FES (Green cross) under the fairness metric of disparate impact in two datasets. As the parameter \( \beta \) increases, accuracy and discrimination decrease

Table 1 The impact of fairness constraints on different datasets in terms of accuracy (Acc) and discrimination level (Dis) under the fairness metric of disparate impact with FS-LR in the Health dataset
Table 2 The impact of fairness constraints on different datasets in terms of accuracy (Acc) and discrimination level (Dis) under the fairness metric of disparate impact with FS-LR in the Titanic dataset

5.2.3 The impact of unlabeled data

For these experiments, we set the covariance threshold \( c=1 \) for the Health and Titanic datasets, and parameter \( \beta = 0.5\) in the Health dataset and \( \beta = 0.8 \) in the Titanic dataset. Figure 7 shows that accuracy and discrimination level varies with the amount of unlabeled data with FS-LR and FGCN methods on both datasets. As shown, accuracy increases as the amount of unlabeled data increases in both datasets before stabilizing at its peak. Discrimination level sharply decreases almost immediately, then stabilize or decrease. We can explain why unlabeled data help to reduce discrimination according to [19, 31]. In [19, 31], discrimination is decoupled into discrimination in bias, discrimination in variance and discrimination in noise. With an increasing size of unlabeled data, discrimination in variance decreases, leading to the whole discrimination decreases.

Fig. 3
figure 3

The impact of the amount of unlabeled data in the training set on accuracy (Red) and discrimination level (Blue) under the fairness metric of disparate impact with FS-LR and FGCN in two datasets. The X-axis is the size of unlabeled dataset; left y-axis is accuracy; and right y-axis is discrimination level

5.3 Experimental results of disparate mistreatment

5.3.1 Trade-off between accuracy and discrimination

Figures 4, 5 and 6 show that as the threshold of covariance c increases in FS-LR, and parameter \( \beta \) increases in FGCN, accuracy and discrimination increase under the fairness metric of OMR, FPR and FNR. From the results, we can observe that our proposed methods FS-LR and FGCN (Red line) generally are in the left above the FC method and FMLP method (Blue line). This indicates that our framework provides the better trade-off between accuracy and discrimination in three metrics for the most time. For example, at the same level of accuracy (Acc = 0.885) on the Bank dataset under OMR, our method with FS-LR has a discrimination level of around 0.045, while FC method has a discrimination level of 0.06. We also observe that discrimination level is quite different under fairness metrics. For example, discrimination level can reach 0.17 at the end under FNR, while discrimination level only shows 0.01 under FPR. In addition, we note that accuracy and discrimination level have different performance on training models. In the Bank dataset, FGCN generally has a lower accuracy and discrimination than FS-LR.

Fig. 4
figure 4

The trade-off between accuracy and discrimination in proposed method FS-LR and FGCN (Red), FC and FMLP (Blue) in two datasets under the metric of overall misclassification rate. The results demonstrate that our methods using unlabeled data achieves a better trade-off between accuracy and discrimination

Fig. 5
figure 5

The trade-off between accuracy and discrimination in the proposed method FS-LR and FGCN (Red), FC and FMLP (Blue) in two datasets under the metric of false negative rate

Fig. 6
figure 6

The trade-off between accuracy and discrimination in proposed method FS-LR and FGCN (Red), FS and FMLP (Blue) in two datasets under the metric of false positive rate

5.3.2 Different fairness constraints under OMR

Tables 3 and 4 shows that different fairness constraints on labeled and unlabeled data have different impacts on the training results. Due to space limitation, we have only reported the results for the FS-LR under the metric of OMR on the Bank and Titanic datasets. For these tests, the size of unlabeled data is set to 4,000 data points in the Bank dataset and 400 data points in the Titanic dataset. As shown, when varying the threshold of covariance c, different fairness constraints on labeled and unlabeled data have huge difference on the training results. When the fairness constraint is enforced in labeled data, accuracy and discrimination increases with the increase in c in the Titanic dataset. This is because a smaller c enforces the lowest discrimination level, which results in a lower accuracy.

However, when the fairness constraint is enforced in unlabeled data, accuracy and discrimination could decrease with the increase in c. This is because the label of unlabeled data appears in the fairness constraint of disparate mistreatment, and it is updated during the training. This means that the distribution of unlabeled data is not described well during the training. As a result, the fairness constraint on unlabeled data is not that effective.

Table 3 The impact of fairness constraints on different datasets in terms of accuracy (Acc) and discrimination level (Dis) under the fairness metric of overall misclassification rate with FS-LR in the Bank dataset
Table 4 The impact of fairness constraints on different datasets in terms of accuracy (Acc) and discrimination level (Dis) under the fairness metric of overall misclassification rate with FS-LR in the Titanic dataset

5.3.3 The impact of unlabeled data under OMR

For these experiments, we show the impact of unlabeled data on OMR. The covariance threshold is set as \( c=1 \) for the Bank and Titanic datasets. Figure 7 shows accuracy and discrimination level varies given different size of unlabeled data with FS-LR and FGCN on two datasets. As shown, before the peak is reached, as the amount of unlabeled data increases in the two data sets, accuracy will also increase. Discrimination level decreases at the beginning, and then stabilize in the Titanic dataset. These results indicate that discrimination in variance decreases as the amount of unlabeled data in the training set increases.

Fig. 7
figure 7

The impact of the amount of unlabeled data in the training set on accuracy (Red) and discrimination level (Blue) under the fairness metric of overall mistreatment rate with FS-LR and FGCN in two datasets. The X-axis is the size of unlabeled dataset; left y-axis is accuracy; and right y-axis is discrimination level

5.4 Discussion and summary

5.4.1 Discussion

We have some comparison of two methods. This provides some suggestions to choose which method to use in practice. 1) FGCN is suitable to train a large dataset, while FSMC may not work because the DCCP solver is difficult to process a large number of data points. 2) FGCN is suitable for multi-classification problems, while FSMC cannot directly be applied in multi-classification problems. 3) FSMC admits a closed form solution which makes it attractive in practice with a low computationally cost, while FGCN is generally more computational.

5.4.2 Summary

From these experiments, we can obtain some conclusions. 1) The proposed methods, FSMC and FGNN can make use of unlabeled data to achieve a better trade-off between accuracy and discrimination. 2) In FSMC, the fairness constraint on mixed labeled and unlabeled data generally has the best trade-off between accuracy and discrimination under disparate impact. The fairness constraint on labeled data achieves the trade-off between accuracy and discrimination under disparate mistreatment. 3) More unlabeled data generally helps to make a better compromise between accuracy and discrimination. 4) Model choice can affect the trade-off between accuracy and discrimination. Our experiments show that FGNN is more friendly to achieve a better trade-off than FSMC.

6 Related work

6.1 Fair supervised learning

Methods for fair supervised learning include pre-processing, in-processing and post-processing methods. In pre-processing, discrimination is eliminated by guiding the distribution of training data towards a fairer direction [29] or by transforming the training data into a new space [14, 32,33,34]. Subsequent studies extended fair representations into more fairness metrics and more generalized tasks [35,36,37,38]. The main advantage of the pre-processing method is that it does not require changes to the machine learning algorithm, so it is very simple to use.

In in-processing, discrimination is constrained by fair constraints or regularizers during the training phase. For example, Kamishima et al. [39] used regularizer term to penalize discrimination in the learning objective. Konstantinov et al. designed fairness regularizers during training can greatly improve the fairness of rankings [40]. [6, 24, 41] designed the convex fairness constraint, called decision boundary covariance to achieve fair classification for classifiers. Some work presented the constrained optimization problem as a two-player game, and formalized the definition of fairness as a linear inequality [42,43,44,45]. This is more flexible for optimizing different fair constraints, and solutions using this method are considered to be the most robust. Recent work extended in-processing methods to more complex cases [46,47,48]. For example, Perrone et al. proposed a general constrained Bayesian optimization framework to optimize the model performance [47]. Chikahara et al. studied individual fairness with path-specific causal-effect constraint [48].

A third approach to achieving fairness is post-processing, where a learned classifier is modified to adjust the decisions to be non-discriminatory for different groups [13, 49, 50]. Post-processing does not need changes in the classifier, but it cannot guarantee a optimal classifier. Awasthi et al. further studied equalized odds post-processing method with a perturbed attribute [51]. Putzel et al. worked on the predictions of a blackbox machine learning classifier in order to achieve fairness in a multiclass setting [52].

6.2 Fair unsupervised learning

Chierichetti et al. [15] was the first to study fairness in clustering problems. Their solution, under both k-center and k-median objectives, was required every group to be (approximately) equally represented in each cluster. Many subsequent works have since been undertaken on the subject of fair clustering. Among these, Rosner et al. [18] extended fair clustering to more than two groups. Schmidt et al. [53] consider the fair k-means problem in the streaming model, define fair coresets and show how to compute them in a streaming setting, resulting in significant reduction in the input size. Bera et al. [54] presented a more generalized approach to fair clustering, providing a tunable notion of fairness in clustering. Li et al. [55] defined a new fairness metric in clustering and incorporated group fairness into the algorithmic centroid clustering problem.

6.3 Comparing with other work

Existing fair learning methods mainly focus on supervised and unsupervised learning, and cannot be directly applied to SSL. As far as we know, only [28, 30, 31] has explored fairness in SSL. Chzhen et al. [28] studied Bayes classifier under the fairness metric of equal opportunity, where labeled data is used to learn the output conditional probability, and unlabeled data is used for to calibrate threshold in the post-processing phase. However, unlabeled data is not fully used to eliminate discrimination, and the proposed method only applies in equal opportunity. In [30], the proposed method is built on neural networks for SSL in the in-processing phase, where unlabeled data is marked labels with pseudo labeling. Zhang et al. [31] proposed a pre-processing framework which includes pseudo labeling, re-sampling and ensemble learning to remove discrimination. Our solution will focus on margin-based classifier in the in-processing stage, as in-processing methods have demonstrated good flexibility in both balancing fairness and supporting multiple classifiers and fairness metrics. A few studies have studied fair graph learning. For example, Rahman et al. studied how to learn fair node representations [56], while we focus on fair graph-based SSL. Kang et al. studied individual fairness on graph mining [57], while we focus on group fairness on graph-based SSL.

7 Conclusion

In this paper, we study how to improve the trade-off between fairness and accuracy with unlabeled data. We propose two methods of fair graph-based SSL that operates during in-processing phase. Our first method is formulated as an optimization problem with the goal of finding weights and labeling unlabeled data by minimizing the loss function subject to fairness constraints. We analyze several different cases of fairness constraints for their effects on the optimization problem plus the accuracy and discrimination level in the results. The second method is built on GNN models with fairness regularizers that ensures fair representations of nodes with labeled and unlabeled data can be learned. Our experiments confirm this analysis, showing that the proposed framework provides accuracy and fairness at high levels in semi-supervised settings.