Introduction

In recent years, artificial intelligence (AI)-based systems have assisted humans with decision-making from streaming data in various domains such as finance, criminal and education [1,2,3]. However, as AI becomes more sophisticated and automated, it is important to provide users with trustworthy, reliable and fair models. For example, an investigation into the fairness of online advertisements on Facebook and Google Ads revealed that online housing advertisements exclude families with women, children or disabilities from their target audience. These policies/practices clearly disadvantage these groups or individuals, usually called underrepresented groups. Sensitive features such as gender and race should be unseen or protected from the decision-making process to advocate a fair society. In addition, with AI’s increasing impact on users’ lives, it becomes essential to comprehend how AI makes decisions and the reasoning behind such decisions. Therefore, fairness-aware machine learning (ML) with explainability has recently become an important topic in the ML research community.

Fairness-aware ML algorithms can be categorised into three approaches: pre-processing, post-processing and in-processing. Pre-processing techniques try to transform data to remove discrimination by relabeling instances or assigning different weights for different groups. However, these methods are not quite effective as standalone unless being used in conjunction with other methods with sophisticated design, so it is inappropriate for data streams [4]. The post-processing approach modifies the output predictions to mitigate discrimination. It is used when the learnt models cannot be modified. However, transferring such techniques to online settings is more complex as the boundary/prediction could evolve due to the non-stationary distributions of the streaming data [4]. Existing fairness-aware method algorithms usually follow the in-processing approach [5], which takes into account fairness in model training. The in-processing approach stands out in the streaming learning context [4,5,6,7,8]. It involves learnt models during the learning process, showcasing their capability to adapt over time, offering immediate feedback, continuously observing model behaviour and applying adaptive interventions. However, most of these approaches focused on solving static problems where data distribution is assumed unchanged. These approaches are not applicable in data streams, where data distribution or the underlying concepts or patterns in the data, which the machine learning model is trying to capture, can evolve or change at any time. Concept drifts can negatively impact the performance of the learnt model. In the data stream context, adaptive fairness-aware decision tree classifier (FAHT) [7] and its latter version 2FAHT [4] have been developed for fairness-aware classification based on the Hoeffding tree (HT) algorithm. HT has shown promise not only for handling concept drift but also for self-explained models in dynamic environments. One significant advantage of HT is that it produces transparent decision tree models, allowing users to understand how decisions are made. This feature helps build trust in the AI system’s decision-making process, which is critical in sensitive applications. Therefore, FAHT and 2FAHT inherit HT’s advantages and successfully mitigate discrimination. However, FAHT and 2FAHT have limitations in controlling the trade-off between accuracy and discrimination.

Since discrimination and accuracy are conflicting objectives, in this study, we propose a multi-objective fairness-aware classification algorithm based on HT and the speed-constrained multi-objective particle swarm optimisation method (SMPSO) [9] to overcome the above-mentioned limitations. SMPSO is chosen because it holds promise for addressing dynamic optimisation problems [10], especially with its speed constraint mechanism that can improve optimisation speed [9], making it well suited for online learning scenarios. Although HT produces transparent models, the explainability will be degraded if the model is too big. Therefore, we minimise the model size to improve explainability. In this study, SMPSO is used to evolve feature weights that maximise accuracy, minimise discrimination (for model performance) and reduce the HT model size (for explainability).

In short, in this method, we propose:

  • A new incremental classification method called FOMOS (fairness optimisation with a multi-objective swarm) can learn a decision tree for multi-variable data stream classification and adapt the learnt classifier over time.

  • A modified version of HT that incorporates different weights for different features in the node-splitting criterion.

  • A new formulation for balance information gain that allows flexible controls of the trade-off of conflicting objectives on each feature over time.

  • A new multi-objective PSO-based algorithm that simultaneously maximises accuracy and minimises discrimination and model complexity.

The rest of the paper is organised as follows. The background and related works are introduced in Sect. 2. The details of our method will be placed in Sect. 3. Section 4 shows the dataset and experiment setups for validating our method. Conclusion and future works are discussed in Sect. 5.

Background

Discrimination measures

Measuring discrimination in ML models is the first step to evaluating and improving fairness in learned models. Various discrimination measures, such as statistical parity, equalised odds and absolute between-ROC area [11] were proposed. Statistical parity is one of the most popular measures used in many studies [7, 12, 13]. This measure reports the percentage difference between the unprotected and protected groups when assigned a positive class (Y=1).

Let D be the binary classification dataset whose each instance \(x_t\) is represented as (\(A_1\), \(A_2\), ...\(A_n\), \(Y_t\)). A is a set of features and Y is the class label (0, 1), where a set of possible values called domains of features and label are dom(\(A_i\)) with (i in (1,\(\ldots \), n)) and dom(Y), respectively. \({\hat{Y}}\) is prediction values (0, 1). Let S in up be the sensitive feature (e.g. gender, race) with u, p denoting unprotected and protected groups, respectively. Equation (1) shows the intrinsic discrimination level of the dataset, while Eq. (2) shows the discrimination of the classifier results on the dataset.

$$\begin{aligned} \textrm{DataDisc}(D)= & {} P(Y = \mid S = p) - P(Y = 1 \mid S = u), \end{aligned}$$
(1)
$$\begin{aligned} \textrm{Disc}(D)= & {} P({\hat{Y}} = 1 \mid S = p)- P({\hat{Y}} = 1 \mid S = u).\nonumber \\ \end{aligned}$$
(2)

\(\textrm{Disc}(D)\) and \(\textrm{DataDisc}(D)\) range in [− 1, 1], with 0 standing for no discrimination, 1 meaning that the protected group is totally discriminated and − 1 indicating that the unprotected group is discriminated.

Fairness-aware machine learning

Fairness-aware machine learning is at the intersection of computer sciences and social sciences [11]. Fairness-aware machine learning aims to produce algorithms to build predictive models that mitigate discrimination. Groups having similar non-sensitive characteristics should receive similar predictions. Current studies have been proposed to minimise discrimination without reducing predictive performance. Most studies fall into three approaches: pre-processing, in-processing and post-processing.

Pre-processing methods try to neutralise or eliminate discrimination in data. Massaging [14] and re-weighting [15] are popular techniques used in this approach. The former selects examples of instances and directly modifies data distribution by swapping the label. The selection and swapping are based on a ranker that minimises information loss and reduces discrimination. While the former tries re-labelling data, the latter assigns weights to different sensitive groups. Instances in an unprotected group are assigned a higher weight than instances in a protected group to reduce discrimination. Fairness-enhancing interventions in stream classification (FEI) follow a pre-processing approach that intervenes to neutralise data chunks before sending them to the classifier model [13]. The process of FEI includes two stages: correcting data and updating the classifier. First, FEI detects discrimination increasing in incoming data chunks. The data chunks are corrected by chunk-based massaging and chunk-based re-weighting techniques. Second, FEI utilises stream classifiers such as Hoeffding trees [16], accuracy updated ensembles [17], Naive–Bayes [18] and KNN [19] to deal with concept drifts in streaming data. However, this method does not mention how to evaluate which data correction method is appropriate for each classifier. Another pre-processing technique, FABBOO (fairness-aware batch-based online oversampling), has been developed to address fairness concerns in learning under class imbalance within online learning environments [20]. As a pre-processing technique, FABBOO performs oversampling on the minority class instances within each batch. The oversampling is designed to address class imbalance while also considering fairness concerns to reduce bias toward certain sensitive features. However, FABBOO operates by oversampling minority instances, which increases the data size and potentially affects computational resources and processing time. Besides, FABBOO requires pre-defined parameters of oversampling rates and discrimination threshold, which can impact its performance.

In-processing approaches modify models by incorporating fairness measurement into objective functions to mitigate discrimination. Since discrimination derives from sensitive features, current methods reduce the effect of sensitive features in their models. For example, adaptive fairness-aware decision tree classifier [7] (FAHT) and fairness-enhancing and concept-adapting decision tree classifier [8] (FEAT) are two methods based on Hoeffding tree (HT). In HT, each internal node corresponds to a feature that grows based on splitting criteria. A feature that gets high information gain will become a decision node that affects classification results. FAHT and FEAT introduce a new splitting criterion considering information gain and discrimination. The results show that these methods are successful in mitigating discrimination. However, the trade-off between accuracy and discrimination is unchanged and fixed by the algorithm. To overcome this limitation, 2FAHT [4] proposed a method with flexible control on a trade-off between accuracy and discrimination. Users can flexibly focus on accuracy or discrimination via a pre-defined weight in their applications. However, the weight is applied to all features over time, while data distribution in streaming data changes over time and different features should have different weights. To control the trade-off between accuracy and discrimination, techniques incorporate changes into objective functions or impose a constraint. Multi-objective optimisation methods optimise conflicting objectives simultaneously [21]. Researchers have applied them to solve bias problems [22,23,24,25] which finds the trade-off between objectives including accuracy and fairness, giving a preference to enhancing learning performance. Notably, the explainability of the learned models needs to be adequately addressed. Furthermore, it is important to note that these methods focused on solving problems with static data.

Post-processing approaches involve modifying a model’s decision boundary or altering its prediction labels. In some cases, additional prediction thresholds are introduced to combat discrimination, as seen in [26]; while in others, the decision boundary of a model like AdaBoost is adjusted for fairness, as demonstrated in [27]. These approaches focus on the classifier’s output. It is important to note that adapting these techniques to online learning can be challenging because the boundary or prediction may evolve due to the changes in data distribution in streaming data.

Multi-objective optimisation based on particle swarm optimisation

Particle swarm optimisation (PSO) [28] is a well-known population-based algorithm that mimics the social behaviours of bird flocking. In PSO, a swarm consists of many individuals or particles. While the position of a particle presents a candidate solution to the problem, its velocity indicates the speed and direction that the particle should move in each direction. If d is the problem dimension (i.e. the number of features), a particle’s velocity and position are d-dimension vectors of numerical values. During the search process, particles share their best-found solutions (\(\pmb {pbest}\)), enabling them to locate the best solution found so far (\(\pmb {gbest}\)). To move towards these best solutions to explore fruitful areas, a particle k updates its position and velocity based on Eq. (3) and (4), respectively, [29].

$$\begin{aligned} p_{kj}^{i+1}= & {} p_{kj}^i + v_{kj}^{i+1}, \end{aligned}$$
(3)
$$\begin{aligned} v_{kj}^{i+1}= & {} w * v_{kj}^{i} + c_1 * r_{1} * (pbest_{kj}^i - p_{kj}^{i}) \nonumber \\{} & {} + c_2 * r_{2} * (gbest_{j}^i - p_{kj}^{i}), \end{aligned}$$
(4)

where \(p_{kj}^i\) and \(v_{kj}^i\) are the position and velocity of particle k in dimension j at iteration i, respectively. w is the inertia weight representing the moving momentum of the particles. \(pbest_{kj}\) and \(gbest_j\) are \(\pmb {pbest}_k\) and \(\pmb {gbest}\) position in dimension j. \(c_1\) and \(c_2\) are acceleration constants. \(r_1\) and \(r_2\) are random values in [0, 1] anew at time t.

Multi-objective optimisation methods aim to find a set of solutions that represent the best trade-offs between multiple conflicting objectives [30]. In 1999, Moore and Chapman proposed the very first attempt to extend PSO to multi-objective optimisation [31]. Until now, a large number of multi-objective optimisation PSO-based algorithms have been proposed [30]. These algorithms have been successfully applied to different problems, including feature selection for classification [32, 33]. However, it has never been used to solve fairness problems in streaming data.

Speed-constrained multi-objective PSO (SMPSO) [9] is one of the well-known multi-objective PSO algorithms that are promising in solving dynamic optimisation problems [10]. SMPSO showed its success in finding good approximated Pareto fronts and faster converging towards these fronts, as evidenced by the work [9], making SMPSO a suitable choice for streaming learning due to its rapid convergence speed. SMPSO initialises a swarm of particles representing possible solutions and a leader archive recording non-dominated solutions in the swarm (i.e. those are not dominated by any other solutions in terms of all the objectives). In addition to normal updates to positions and velocities in each iteration, polynomial mutation [21] is applied to generate a new generation. The resulting particles are evaluated to update the swarm and leader archive. After maximum iterations, SMPSO returns non-dominated solutions in the leader archive.

Methodology

Overview

In this paper, we developed a new classification algorithm called FOMOS for data streams that can automatically identify changes in data distribution and incrementally adapt the learnt model to improve accuracy, fairness and model size. Figure 1 presents a system overview with two main components. The first component is an incrementally adaptive Hoeffding tree classifier. The second component is a multi-objective optimisation based on PSO (SMPSO) for feature weighting. SMPSO works as a pre-processing step to evolve HT’s feature weights to maximise accuracy and fairness while minimising the size of learnt models. When the discrimination of the current classifier on new data chunks increases, SMPSO is triggered to adapt feature weights to the new data. The evolved feature weights will be used in HT to grow the tree with the new node-splitting criterion. The next subsections will present how the discrimination metric in Eq. (2) is integrated into the proposed balanced information gain and how HT and SMPSO work.

Fig. 1
figure 1

FOMOS system overview

Balanced information gain

In FOMOS’s first component, HT incrementally builds a decision tree from incoming data. Each node of the tree stores sufficient information for HT to decide when to grow a leaf node to become a new branch when new data arrive. This decision is based on the top two features with the highest information gain (IG). If the difference between their IGs is larger than the Hoeffding bound, a measure indicating whether a sufficient subset of examples passes through the node, the leaf node will be split using the top-IG feature [34]. However, growing the tree based on IG and the Hoeffding bound can guarantee classification accuracy but not fairness. Therefore, we propose a balanced information gain (BIG) in Eq. (5), which combines both IG and fairness gain (FG) in this decision using a feature weight \(\alpha \) optimised by SMPSO to simultaneously improve the accuracy, fairness and size of the learnt model.

$$\begin{aligned} \textrm{BIG}(D, A) = \textrm{IG}(D, A) + \alpha (A) * \textrm{FG}(D, A), \end{aligned}$$
(5)

where \(\alpha (A)\) is the weight of feature A evolved by SMPSO, \(\textrm{FG}(D, A)\) measures the fairness gain of feature A when it is used to grow the tree in adapting to the coming data chunk D. Equation (6) shows how \(\textrm{FG}(D, A)\) is calculated.

$$\begin{aligned}{} & {} \textrm{FG}(D, A) \nonumber \\{} & {} \quad = {\left\{ \begin{array}{ll} 0,\qquad \qquad \qquad \qquad \,\,\,\,\,\,\text {if all samples are in one group} \\ |\textrm{Disc}(D) |- \sum _{v\in dom(A)} \frac{|D_v |}{|D |} |\textrm{Disc}(D_v) |,\,\,\, \text {otherwise}, \end{array}\right. }\nonumber \\ \end{aligned}$$
(6)

where v are possible values of feature A; for example, Gender has a domain of Male and Female. Disc(D) and Disc\((D_v)\) are calculated using Eq. (2).

When all instances in a data chunk are in one group (e.g., Male or Female), the Disc(D) equation will raise an error of dividing by zero. To avoid this situation, we set FG to 0 when all samples are in one group (protected or unprotected).

figure a
Fig. 2
figure 2

SMPSO particle for feature weighting in \(FOMOS\_Tree\)

FOMOS Heeffding tree

FOMOS Heeffding tree (FOMOS Tree) is a new algorithm extended from HT. The key difference between HT and FOMOS is the integration of fairness weights optimised by SMPSO into the decision of growing new leaf nodes.

Algorithm 1 shows the pseudocode of the FOMOS Tree algorithm. Initially, the FOMOS Tree starts with an empty root node with empty data distribution (Lines 3–6). When coming data samples are traversed to the corresponding leaf node (Line 9), the data distribution information of the leaf node is updated (Line 10), and the number of sample hits to the leaf node \(total\_sample\) is increased (Line 11). Until the number of samples reaches the pre-defined number \(grace\_period\), node-splitting will be considered (Line 12). Then, the BIG of each feature is calculated (Line 13) using Eq. (5) with the feature weights \(\alpha (A)\) optimised by SMPSO. If the difference between BIGs of the two top-BIG features is higher than the Hoeffding bound or the Hoeffding bound is less than the default parameter \(tie\_threshold\), the leaf node is split using the top-BIG feature (Line 18).

SMPSO for feature weighting

The goal of SMPSO is to evolve a feature weight vector (\(\alpha \)), which will be used to decide whether the FOMOS tree should use a feature to split a node. \(\alpha \) is optimised to maintain the trade-off of multiple objectives such as accuracy, fairness and model size of the learnt decision tree.

The position of a particle represents a candidate feature weight solution. Given d is the number of features, the position of particle k is a d-dimension vector \(\alpha _k = [\alpha _{k1}, \alpha _{k2}, \alpha _{k3}, \ldots , \alpha _{kd}]\) of real numbers ranging from 0 to 1. Each particle also maintains an objective value vector. Given h is the number of objectives, the objective value vector of a particle k is \(obj_k = [\textrm{obj}_{k1}, \textrm{obj}_{k2}, \ldots , \textrm{obj}_{kh}]\). For example, \(h=3\) if the objectives of the FOMOS Tree model are classification accuracy, discrimination and model size.

Figure 2 shows an example of how the feature weight vector \(\alpha \) evolved by SMPSO is used in growing the FOMOS Tree. When \(FOMOS\_Tree\) chooses a feature to split, it will calculate Balanced information gain (BIG) for each feature based on its feature weight using Eq. (5). The tree is then grown based on these BIGs and evaluated with the current data chunk.

Algorithm 2 shows how a particle is evaluated to return its objective values. The input is the current data chunk \(D_t\), particle k’s position or the vector of feature weights \(\alpha \_k= [\alpha _{k1}, \alpha _{k2}, \alpha _{k3}, \ldots , \alpha _{kd}]\) and the current \(FOMOS\_Tree\). The output is the objective vector \(\textrm{obj}= [\textrm{obj}_1, \textrm{obj}_2, \ldots ,\) \(\textrm{obj}_h]\). The algorithm starts with growing the \(FOMOS\_Tree\) with data \(D_t\) using Algorithm 1 based on feature weight vector \(\alpha \_k\). The updated \(FOMOS\_Tree\) is used as a classifier to predict data in \(D_t\). From the prediction results, objectives \(\textrm{obj}= [\textrm{obj}_1, \textrm{obj}_2, \ldots , \textrm{obj}_h]\) are calculated and return.

Table 1 Datasets used in the experiments
figure b
figure c

Finally, Algorithm 3 presents the pseudocode of SMPSO [9]. The algorithms initialise particles with random positions as solutions. Besides, Leaders records non-dominated solutions. In each iteration, the algorithm randomly selects two leaders from Leaders and the one located in a more crowded area (i.e. more non-dominated solutions) will be chosen as gbest. To generate more new particles from the original ones, polynomial mutation [21] is applied to approximately 15% of particles, enabling particles to move towards new areas in the search space. These particles are then evaluated by Algorithm 2. The next step is to update Leaders with non-dominated particles. If the max size of Leaders is reached, SMPSO will remove the worst solution (i.e. dominated by any solutions). If no dominant solution is found, the oldest solution is removed. After optimising steps, non-dominated solutions are recorded in Leaders. Finally, Kneedle [35], a knee point detection method, is used to select the best solution from Leaders to return.

Experiment and result analysis

Datasets and experiment design

To examine the performance of FOMOS in fairness-aware learning from data streams, we compare our method with baseline methods, including both non-fairness-aware (HT, HAT [36]), fairness-aware methods (FAHT, FABBOO). Since SMPSO is a stochastic algorithm, we run FOMOS on each dataset with 30 independent runs with different seeds for SMPSO.

The experiments use six binary-class datasets taken from [12], which have a large number of instances to simulate data stream environments effectively and a significant degree of discrimination to demonstrate the algorithm’s effectiveness. All the compared methods are run based on the Skmultiflow packages [37], which is a commonly used platform for streaming data. All categorical features are binarised before using these packages. Table 1 presents the number of instances (N) and number of features after binarisation (d) of each dataset. Each also has a sensitive feature (SF) with two groups, protected and unprotected (UP). For example, Race is the sensitive feature in the Law dataset in which Non_White is the unprotected group. These binary datasets are also class-imbalanced. The last two columns show the percentage of class zero (C0) and the discrimination level of the whole dataset (Data Disc.) calculated based on Eq. 1.

Table 2 describes the parameter settings used in the experiments. All methods are run with a \(window\_size\) of 1000, which is commonly used in fairness learning studies for streaming data. Parameters for Hoeffding Trees are also set as default in the original study [16]. For SMPSO, the population size is set to 30 and the number of iterations is 10 to suit online learning. Other SMPSO parameters are also adopted in the jMetalPy framework [38].

Table 2 Parameter setting

In addition, to provide users with flexibility in selecting suitable objectives for their applications, we run FOMOS with two different sets of objectives:

  • FOMOS_2obj: The goal is to maximise accuracy and minimise discrimination, comparable to baseline methods, which consider accuracy and discrimination.

  • FOMOS_3obj: The goal is to maximise accuracy and minimise discrimination and model size. This strategy aims to examine how minimising the model size as an objective affects FOMOS performance.

Table 3 Accuracy obtained by FOMOS_2obj, FOMOS_3obj and other methods

Comparison with baseline methods

Accuracy

Table 3 shows the best and the average accuracy of FOMOS_2obj, FOMOS_3obj, two non-fairness-aware methods (HT, HAT) and two fairness-aware methods (FAHT and FABBOO). Column T2 presents the results of the Wilcoxon statistical test comparing FOMOS_2obj against each baseline method. A ‘+’ means the accuracy of FOMOS_2obj is significantly better than the method in the corresponding row and vice versa. This means that the more ‘+’, the better the proposed method. Similarly, Column T3 compares FOMOS_3obj against each baseline method.

As seen from Table 3, the accuracy obtained by FOMOS_2obj and FOMOS_3obj are very similar. Therefore, FOMOS_3obj will be used as the representative method in the analysis below and called FOMOS for short.

Compared with non-fairness-aware methods (HT and HAT), FOMOS obtained a competitive result with 7 ‘+’ and 5 ‘–’ out of the 12 comparisons. It is noted that HT and HAT only focus on the accuracy of their performance, while FOMOS considers accuracy, fairness and model size.

FAHT and FABBOO are fairness-aware methods in streaming data. The Wilcoxon test results in Table 3 show that FOMOS outperforms FAHT and FABBOO on all datasets. On the Dutch dataset, FOMOS obtained 8% and 11% higher accuracy than FAHT and FABBOO, respectively. FAHT usually obtained 3–8% higher accuracy than FABBOO; however, it fails on Law and Bank datasets. These two datasets have very small percentages of data falling into the unprotected group, 16% and 6%, respectively. Therefore, there is a high chance that all samples in one data window are from the protected group. When this happens, FAHT fails to calculate discrimination measures in Eq. (2) due to the divided-by-zero error.

The results show that FOMOS outperforms fairness-aware HT-based algorithms in terms of accuracy while obtaining a higher prediction performance in most comparisons than non-fairness-aware counterparts, which only focus on accuracy.

Table 4 Discrimination obtained by FOMOS_3obj and other methods

Discrimination

Table 4 presents the lowest and average discrimination levels of all the compared methods. Enforcing statistical parity within machine learning methods may lead to reverse discrimination, indicated by negative discrimination values, so the absolute discrimination values are considered in the comparisons. A method with a lower discrimination level is the fairer or better one. Similar to Table 3, a ‘+’ also means FOMOS is significantly fairer (or less discriminate) than the method in the corresponding row and vice versa.

It can be seen that FOMOS_2obj and FOMOS_3obj also obtain similar results on discrimination on all datasets. Again, FOMOS_3obj is chosen as the representative method for FOMOS.

Compared with HT and HAT (methods without considering fairness), FOMOS obtained slightly better performance in terms of fairness with 11 ‘+’ and one ‘–’ out of 12 comparisons. In Law and Dutch, FOMOS obtained about 4% and 7% lower discrimination than HT and HAT, respectively. On the other hand, HAT attained 0.15% lower discrimination than FOMOS on the Census.

Compared with fairness-aware methods (FAHT and FABBOO), FOMOS has a discrimination performance with one ‘+’ and nine ‘–’ out of 10 comparisons because FAHT failed on two datasets. Better results were obtained from FAHT and FABBOO, which greatly sacrifice accuracy to achieve low discrimination, while FOMOS produces prediction models with a better balance between these two metrics.

The results show that FOMOS algorithms generate fairer classifiers than the non-fairness-aware HT-based algorithms while maintaining a better balance between accuracy and discrimination than fairness-aware methods.

Domination

Since FOMOS is a stochastic algorithm, we executed 30 independent runs with different seeds on each dataset. To better check whether FOMOS dominates other methods in terms of both accuracy and fairness, we summarise the Wilcoxon statistical test results shown in Tables 3 and 4 in Table 5. For example, the cell of Law and HT containing (+ +) means FOMOS_3obj outperforms HT in accuracy and fairness. This means FOMOS_3obj dominates the corresponding method in both objectives if the cell contains two pluses (+ +).

Table 5 Domination of FOMOS_3obj and other methods

As can be seen from the highlighted cells in blue, FOMOS_3obj demonstrates a marked superiority over the non-fairness-aware methods (HT and HAT) that only focus on accuracy. FOMOS_3obj dominates HT and HAT on three datasets. Compared with fairness-aware methods FAHT and FABBOO, if the failures of the FAHT method on the Law and Bank datasets are considered dominated by FOMOS_3obj, FOMOS_3obj dominates FAHT in three out of the six datasets.

Overall, FOMOS_3obj dominates in 9 out of 24 comparisons while not being dominated by any other method. The results show that the multi-objective strategy in FOMOS works well to improve fairness and accuracy. FOMOS not only avoids using sensitive features for making predictions but also chooses other relevant features based on the evolved feature weights. FOMOS also demonstrates its effectiveness when dealing with datasets where the percentages of groups in sensitive features are imbalanced, such as the Law and Bank datasets, where FAHT fails to perform.

In general, the results show that FOMOS_3obj outperforms the compared methods. FOMOS_3obj significantly dominates nine over 24 comparisons and is not dominated by any methods. Especially, FOMOS_3obj can work with imbalanced datasets (Law and Banks), which FAHT cannot do. Besides, FOMOS_3obj wins FABBOO in accuracy on all datasets.

While Table 5 illustrates the domination of FOMOS_3obj over other methods based on average results of 30 runs, Table 6 provides a comprehensive comparison of 30 individual runs of FOMOS_3obj with baseline methods. Table 6 shows the number of FOMOS’s runs that win (W), draw (D) or lose (L) the baseline methods based on three objectives (accuracy, discrimination and model size). In the table, FOMOS_3obj wins over other methods when its objectives are better. Otherwise, FOMOS_3obj loses to other methods when all objectives are worse than others. Two methods draw when one objective is better and the other is worse. For example, FOMOS draws HT in all runs in the Dutch dataset. Because FAHT and FABBOO are ensemble learners, including a set of decision trees, their model size cannot be directly compared to FOMOS. Instead, their comparison focuses only on accuracy and discrimination.

Table 6 Comparisons of FOMOS_3obj and other methods in each run regarding accuracy, discrimination and model size

As can be seen from Table 6, compared with non-fairness-aware methods (HT and HAT), on all datasets, FOMOS_3obj wins HT 12 runs on Law, all runs on Credit, 19 runs on Adult and three runs on Bank. In addition, FOMOS_3obj wins HAT all 30 runs on both Credit and Adult. Because HT and HAT only focus on accuracy, FOMOS_3obj loses accuracy but gains lower discrimination and smaller model size, resulting in drawn outcomes in other comparisons. Compared with fairness-aware methods (FAHT and FABBOO), FOMOS_3obj consistently achieves either a win or a draw in all comparisons. Due to FAHT’s failures in the Law and Bank datasets, we concentrated our analysis on the remaining datasets. Across 30 runs, FOMOS_3obj consistently outperforms the Census while drawing in Credit, Adult and Dutch. Regarding FABBOO, FOMOS_3obj draws FABBOO in all runs across all datasets. The majority of drawn results in the comparison with FAHT and FABBOO stem from discrimination. FAHT and FABBOO demonstrate lower discrimination, a trade-off impacting their low accuracy. Overall, most FOMOS_3obj’s runs win or draw other methods.

Model size

The third objective that FOMOS_3obj aims to improve is the model size. To evaluate FOMOS_3obj’s effectiveness on this objective, we compare the average tree sizes of FOMOS_3obj and FOMOS_2obj in 30 runs with HT.

As shown in Fig. 3, the tree size obtained by FOMOS_3obj is smaller than HT in five datasets, with a significant reduction of up to 20% on Credit, Dutch and Census. Similarly, FOMOS_2obj models are smaller than HT on five out of six datasets. However, the model size achieved by FOMOS_2obj is larger than that of HT on the Law dataset. This is because FOMOS_2obj only considers balancing the accuracy and the fairness and ignores the model size. This problem is addressed by FOMOS_3obj, which controls the trade-off between accuracy, fairness and model size at once, allowing it to obtain a similar model size to HT on this dataset. In general, an average reduction of 15% was obtained by FOMOS_3obj across all datasets and FOMOS_2obj demonstrates a decrease of 12%. The results show that adding the model size as an additional objective helps FOMOS_3obj reduce the complexity of the learnt models, which is essential in enhancing model explainability.

Fig. 3
figure 3

Tree size obtained by FOMOS and HT methods

Explainability of FOMOS

To see how the transparency in FOMOS models facilitates explainability regarding fairness, we present some of the decision trees generated by FOMOS on Dutch and Adult. These two datasets were chosen because they have the highest correlation between the sensitive feature (Gender) and the class label with Pearson’s Correlation Coefficients of 0.22 and 0.30, respectively. This means a model using Gender can obtain very high accuracy. However, Gender, or its binarised features, Male and Female, are sensitive features that should not be used for classification. Moreover, since FOMOS is developed based on HT, we compare the models produced by the two methods to see how FOMOS chooses features in growing decision trees to improve fairness and accuracy.

Fig. 4
figure 4

Decision tree on adult dataset created by HT and FOMOS

Adult dataset

The Adult dataset is commonly used in fairness-aware methods [12]. The dataset classifies whether a person’s yearly income exceeds 50 thousand dollars based on gender, education level, economic status, age, etc.

Both HT and FOMOS incrementally build the classifiers. The three models in Fig. 4 are generated in window 10. It can be seen that HT uses Male in the classifier, while FOMOS algorithms do not. Therefore, while the three models obtain similar accuracy, the HT model’s discrimination measure is 2.7% higher than that of the models generated by FOMOS_2obj and FOMOS_3obj. The result shows that FOMOS successfully avoids using sensitive features for decision-making and chooses other features (Age) that can provide a similar accuracy.

Fig. 5
figure 5

Decision tree on Dutch dataset created by HT and FOMOS

Dutch dataset

Figure 5 shows models evolved by FOMOS (Since FOMOS_2obj and FOMOS_3obj generated the same decision tree, we present one called FOMOS) and HT methods on the Dutch dataset in window 10. These models aim to predict a person’s job, which can be prestigious or not, based on Gender, Age, Household position, Education level, Economic status, etc.

It can be seen that the Female feature is used as the root node of the HT model, which provides 0.11% worse accuracy than FOMOS models and 7.55% worse in discrimination. It can be seen that FOMOS successfully replaced Female with other features, e.g. Economic Status (Attendant at educational institutions), to maintain the prediction accuracy.

The two examples demonstrate that FOMOS shows the ability to produce equitable decision-making while maintaining reasonable levels of accuracy. Using transparent models as decision trees, FOMOS classifiers are explainable, allowing users to understand model decisions and identify and rectify any biased predictions made by the models that are unfair to unprotected groups. This is crucial for building user trust, particularly when the model’s predictions impact individuals’ lives or decisions, enabling AI and machine learning to be safely adopted for a fairer society.

Conclusions

This paper aims to propose the first multi-objective approach to fairness learning on data stream classification. The goal was achieved by proposing a new method called fairness optimisation with a multi-objective swarm (FOMOS) based on Hoeffding tree and SMPSO. FOMOS uses SMPSO to evolve feature weights to grow more accurate, fairer and simpler Hoeffding tree models. The novelty of the proposed method is an incremental streaming data classification that maximises accuracy and minimises discrimination and model complexity simultaneously.

The results of FOMOS on six real-world datasets show that FOMOS can generate explainable models that are more accurate, fairer and simpler than the current fairness-aware and non-fairness-aware methods for streaming data. FOMOS has demonstrated an innovative application of swarm intelligence to solve challenges in fairness-aware methods in streaming data. Currently, FOMOS and existing fairness-aware methods address discrimination on one sensitive feature. In reality, biases can involve multiple sensitive features, for example, gender, race and age at the same time. Our future direction is to extend FOMOS so that it can consider more sensitive features simultaneously.