Search-based fairness testing for regression-based machine learning systems

Machine learning (ML) software systems are permeating many aspects of our life, such as healthcare, transportation, banking, and recruitment. These systems are trained with data that is often biased, resulting in biased behaviour. To address this issue, fairness testing approaches have been proposed to test ML systems for fairness, which predominantly focus on assessing classification-based ML systems. These methods are not applicable to regression-based systems, for example, they do not quantify the magnitude of the disparity in predicted outcomes, which we identify as important in the context of regression-based ML systems. We conduct this study as design science research. We identify the problem instance in the context of emergency department (ED) wait-time prediction. In this paper, we develop an effective and efficient fairness testing approach to evaluate the fairness of regression-based ML systems. We propose fairness degree, which is a new fairness measure for regression-based ML systems, and a novel search-based fairness testing (SBFT) approach for testing regression-based machine learning systems. We apply the proposed solutions to ED wait-time prediction software. We experimentally evaluate the effectiveness and efficiency of the proposed approach with ML systems trained on real observational data from the healthcare domain. We demonstrate that SBFT significantly outperforms existing fairness testing approaches, with up to 111% and 190% increase in effectiveness and efficiency of SBFT compared to the best performing existing approaches. These findings indicate that our novel fairness measure and the new approach for fairness testing of regression-based ML systems can identify the degree of fairness in predictions, which can help software teams to make data-informed decisions about whether such software systems are ready to deploy. The scientific knowledge gained from our work can be phrased as a technological rule; to measure the fairness of the regression-based ML systems in the context of emergency department wait-time prediction use fairness degree and search-based techniques to approximate it.

is to generate discriminatory inputs, which are the test cases that expose biases (i.e., individual discrimination, group discrimination, and causal discrimination) in the MLbased system under test (SUT). To generate the discriminatory inputs, black-box fairness testing approaches have been proposed, e.g., random test generation (Themis (Galhotra et al. 2017)), directed test generation (Aequitas (Udeshi et al. 2018)), symbolic generation (Aggarwal et al. 2019), as well as white-box fairness testing approaches . However, these fairness testing approaches are primarily designed for classificationbased ML systems, but not for regression-based ML systems.
Regression-based ML systems use regression techniques to predict continuous values (e.g., patient wait-time estimation). Since the nature of the predicted outcomes of regression-based ML systems is different from classification-based ML systems (i.e., continuous vs binary), existing fairness measures and fairness testing approaches may not be directly applicable, and also not yet effective and efficient. In particular, existing fairness measures for classification-based ML systems are indicative of the number of test cases that exhibit different outcomes (TRUE → FALSE or FALSE → TRUE) for two similar inputs except the difference of a sensitive attribute such as gender. However, for regression-based ML systems, the difference of the predicted outcomes is not binary, but continuous-which is challenging to determine if two predicted continuous outcomes are different or not to be considered as unfair. Thus, one common approach is to use a threshold to determine if a test input is considered discriminatory, i.e., the difference of two predicted outcomes is greater than the threshold (Udeshi et al. 2018). Nevertheless, existing fairness testing approaches that leverage these fairness measures only indicate the number of the discriminatory inputs, without quantifying the magnitude of the difference of the two predicted outcomes.
Recently, Berk et al. (2017) introduced a fairness measure for regression-based ML systems as the average differences of predicted outcomes for two similar inputs which differ in the value of the sensitive attribute, without considering the maximum difference of the two predicted outcomes. Thus, the Berk's fairness measure may underestimate the possible worst case of unfairness of the SUT. Therefore, the extremely worst cases may go unnoticed during testing if the existing fairness measures are used, making the users of the regression-based ML systems vulnerable to unfair predictions (e.g., female patients are estimated to wait longer than male patients based on the estimation of an emergency department wait-time prediction). To the best of our knowledge, none of the existing fairness measures and testing approaches can estimate the possible worst case of unfairness of the ML-based system under test.
In this paper, we propose (1) a new fairness measure called fairness degree, which is defined as the maximum difference in the predictions for two inputs that are identical except for a sensitive attribute (e.g., gender); and (2) a novel Search-Based Fairness Testing (SBFT) approach to estimate the fairness degree of the ML-based system under test. In contrast to existing fairness testing approaches, SBFT is the first search-based approach designed to test fairness in any machine learning system by employing a genetic algorithm. It has an efficient fitness evaluation procedure which uses an archiving approach for values that are more likely to lead to bias revealing inputs, and a fast local search procedure that improves the search for high quality test inputs that reveal biases. Then, we evaluate the effectiveness and efficiency of our SBFT approach in terms of discovering fairness degree and compare with four existing fairness testing approaches (i.e., Aequitas Udeshi et al. (Udeshi et al. 2018), Themis (Galhotra et al. 2017), symbolic generation (Aggarwal et al. 2019) and random testing).
Our study follows a design science research approach. Fairness metrics for regressionbased ML systems are rare and the existing ones ignore to quantify the magnitude of outcome difference. We identify the real problem instance in the context of emergency department (ED) wait-time prediction. We propose a novel fairness metric, i.e., fairness degree, to quantify the degree of bias in a regression-based ML system and a novel searchbased fairness testing technique to estimate the fairness degree of an ML system. The proposed solutions are implemented and applied to ED wait-time prediction software. The scientific knowledge gained from our work can be phrased as a technological rule; to measure the fairness of the regression-based ML systems in the context of ED wait-time prediction use fairness degree and search-based techniques to approximate it.
Through an experimental evaluation on 12 emergency department wait-time prediction models based on regression techniques trained from over 1.3 million patient records from 12 hospitals (i.e., the largest emergency department patient datasets to date), we address the following two research questions: RQ1 How effective is SBFT? SBFT is more effective than the baseline approaches, yielding a statistically higher fairness degree with large effect sizes at 10, 7, and 10 hospitals for the sensitive attributes country of birth, Indigenous status, and gender, respectively. RQ2 How efficient is SBFT? SBFT is significantly more efficient than the baseline approaches with large effect sizes at 11, 5, and 6 hospitals for the sensitive attributes country of birth, Indigenous status, and gender, respectively.
These results confirm that our SBFT approach outperforms the state-of-the-art fairness testing approaches in terms of finding fairness degree of ML software systems based on regression techniques. Thus, we expect that our SBFT approach can help software teams to identify the degree of fairness in predictions and make data-informed decisions about whether such software systems are ready to deploy.

Novelty & Contributions
The novelty and contributions of this paper are as follows: 1. We propose a fairness measure called fairness degree that describes the worst case behaviour of a regression-based ML system. 2. We propose a search-based fairness testing approach that effectively and efficiently estimates the fairness degree of a regression-based ML system. 3. The results of our experimental evaluation on 12 emergency department wait-time models demonstrate that SBFT is more effective and efficient in terms of finding fairness degree compared to the state-of-the-art fairness testing approaches.

Paper Organisation
The organisation of this paper following Section 1, the introduction, is as follows. Section 2 provides the context of emergency department wait-time prediction software used in this study. Section 3 describes the design science research approach undertaken for this study. Section 4 formally defines the proposed fairness measure, i.e., fairness degree. Section 5 describes SBFT, our proposed approach to estimate the fairness degree of a regression-based ML system. Section 6 experimentally evaluates SBFT. Section 7 positions our study amongst related work. Section 8 outlines the threats to validity of our study. Section 9 concludes the paper.

Context
AI-enabled healthcare software is being adopted by the healthcare industry, aiming for improved efficiency, reduced costs and errors, and improvements to both patient satisfaction and experience. It has been estimated that the market value of healthcare software is projected to reach around $29.9 billion by 2023, with a compound annual growth rate (CAGR) of 7.4% from 2018 to 2023. With the wave of the global COVID-19 pandemic, emergency departments (ED) in many countries are currently overwhelmed and overcrowded due to an unpredictable and unscheduled influx of patients. 1 Overcrowded EDs, usually caused by overcrowded hospitals, jeopardise patient safety and are associated with increased death rates (Di Somma et al. 2015). When EDs are overcrowded, this also cascades back to community health, taking ambulances off the road whilst they are unable to offload patients into ED beds. Patients have limited access to information about ED waiting time and hospital capacity. This lack of transparency from healthcare facilities has negative consequences on patient satisfaction and can make a difficult patient journey much harder for consumers of healthcare (Walker et al. 2020). Hospitals throughout the world have started to deploy AI-enabled healthcare software to support EDs in improving patient satisfaction, healthcare management, and clinical practices. Patient waiting time for healthcare services is identified as one of the key measurements of a responsive healthcare system (Sun et al. 2017), according to the World Health Organization (WHO), and a well-designed and resourced healthcare system should not have long queues for consultations and treatments. Wait times are often highly politicised, for both acute and elective care.
In this paper, we focus on a regression-based machine learning system, which is used in EDs for patient wait-time prediction (aka. ED Software). ED Software provides patients attending the ED at hospitals with an estimated time that they would have to wait before being seen by a doctor or other provider. ED Software is currently being used in hospitals across Melbourne, Australia to estimate the wait time for new patients seeking care (Walker et al. 2021).
Machine learning to predict ED patient wait-times is useful to keep patients informed about the expected wait time before being treated by a doctor (Shah et al. 2015). Such waittime predictions can also be used to raise awareness of the current flow of waiting patients, improve capacity planning, and identify system bottle-necks (Strobel et al. 2021). The accuracy of the patient wait-time predictions is one of the key factors used in hospitals to better manage patients' expectations. An overestimated wait-time can cause a patient to decide to delay seeking critical treatment whereas an underestimated wait-time can negatively impact patient experience (Soremekun et al. 2011). According to the WHO's global strategy on digital health 2020-2025, AI-enabled healthcare software that are used to support decision-makers must ensure the ethical use of technology. In addition, ED clinicians perceive the role of the ED to be a safety net for healthcare for all patients equally, treating anyone at anytime, without judgement. Thus, ED Software should not exhibit any discrimination bias in patient wait-time predictions.
Therefore, ED Software should produce similar wait-time estimations for patients who have the same clinical urgency, arriving at the same time with the same number of patients ahead of them in the queue. This should be the case regardless of individual sensitive attributes (e.g., gender, country of birth, indigenous status, race, religion). Discrimination bias embedded in the prediction models in the ED Software could have a negative impact on patient satisfaction, regardless of the number of patients affected. Patients and families cope better with expected waits than unexpected waits and are disappointed by longer than expected waits. Hospitals that seek to attract more patients (fee-for-service funding) may find customers choosing alternative facilities if their wait-time is overestimated and generate low net promoter scores and ill-will when the wait time is underestimated (also losing customers). Hospitals with block funding also suffer when wait-time predictions are inaccurate, due to reduced patient satisfaction and increased aggression and complaints. Therefore, the use of existing fairness measures (Berk et al. 2017) (e.g., the average differences of predicted outcomes for two similar inputs which differ in the value of the sensitive attribute) may hinder the extreme cases of unfairness, leading to a poor understanding of the ultimate worst case of unfairness of an ML-based system under test and sub-optimal decision-making if such unfair systems are deployed. Thus, a novel fairness degree to indicate the absolute worst case of unfairness by an ML-based system is critically needed.

Research Methodology
We conduct this study as design science research. Runeson et al. (2020) designed a visual abstract template to help identify the design science constructs in software engineering research. We use this template to communicate our research methodology as shown in Fig. 1. According to Runeson et al. (2020), there are three main constructs of design science research; i) technological rule, ii) its instantiation in terms of a real problem-solution pair, and iii) the empirical or theoretical support for problem conceptualisation and the solution design. The scientific knowledge gained from our work can be phrased as a technological rule; to measure the fairness of the regression-based ML systems in the context of ED wait-time prediction use fairness degree and search-based techniques to approximate it.
We understand through the literature that fairness metrics for regression-based ML systems are rare and the existing ones ignore to quantify the magnitude of outcome difference. The real problem instance is identified in the context of emergency department wait time prediction, where extreme errors mean patients either wait longer than the predictions or leave the EDs in case of over-estimations. We propose a novel fairness metric called fairness degree to measure the maximum difference of the outcomes of two identical inputs except for their membership in two different protected groups (Section 4). To find the fairness degree of an ML system, we propose a novel search-based fairness testing technique (Section 5). The proposed solutions are implemented and applied to ED wait-time prediction software.
The relevance of the research, the rigour of the research activities, and the novelty of the technological rule in relation to the underpinning research will be discussed in the Conclusion (Section 9).

Fairness Degree for Regression-Based Machine Learning Systems
A regression-based machine learning system can be formally defined as: Definition 1 A regression-based machine learning system is a software system that has a machine learning component where the explanatory variables (or instances) are denoted by x ∈ X, the target variables (or labels) by y ∈ Y = [−∞, +∞], and the predicted variables are denoted byŷ ∈Ŷ = [−∞, +∞], where y andŷ are continuous.
We consider the case where each instance x contains a sensitive feature s and its value is denoted as x s . An example of a sensitive feature is gender, and its value can be female, male, or non-binary.
In regression-based machine learning systems, the target and predicted values are continuous, and the same value is not likely to occur even twice in the data, which makes it hard to determine whether the system is fair or not. Berk et al. (2017) introduced the first measure for fairness of the regression-based ML systems. The measure is called individual fairness, and estimates how similarly a model treats two similarly labelled instances, which differ in the value of the sensitive feature.
For every cross pair (x i , y i ) ∈ g 1 and (x j , y j ) ∈ g 2 , where g 1 and g 2 are groups from the same population that are different in terms of the sensitive attribute s (e.g., women, men), individual fairness measures how differently the machine learning model treats x i and x j , weighted by a function of |y i − y j |.
where d is a fixed non-negative function decreasing in |y i − y j |, which takes care of cancellation issues. More specifically, individual fairness states that the fairness penalty for overestimating several of one group's labels cannot be mitigated by overestimating several of the other group's labels. However, this averaging over the instances has a smoothing effect, and hides the impact of outliers.
Previous research shows that assuring machine learning fairness depends more on how fairness is defined than on how it is implemented (Corbett-Davies and Goel 2018), which explains why the issue that has received the most attention in the ML fairness literature is the definition of fairness (Selbst et al. 2019). Indeed, several definitions of fairness have been introduced, such as conditional statistical parity (Corbett-Davies et al. 2017), equal opportunity (Hardt et al. 2016), counterfactual fairness (Chiappa 2019), equalised odds (Hardt et al. 2016), etc. Feldman et al. (2015, formalise the Equal Employment Opportunity Commission's 80% rule into a formal measure of fairness called disparate impact, which is used in U.S. law to encode unintentional bias. Disparate impact measures if a decision making process has widely different outcomes for different groups, even as it appears to be neutral. In the well known debate about the COMPAS risk score, the creators argued that COMPAS was fair because the test accuracy was equalised across groups (Dieterich et al. 2016), which was estimated by the disparate impact measure. Instead, ProPublica journalists demonstrated that COMPAS was not fair because another important measure of fairness, known as equality of error rates was violated (Angwin et al. 2016).
This example highlights that existing fairness definitions are simplifications that cannot fully capture the range of existing and overlapping notions of fairness in all philosophical, legal, and sociological contexts. Hence, the suitability of a fairness measure is context dependent (Mehrabi et al. 2019) and must be carefully selected.
Without discounting the importance of avoiding a systemic bias on the whole population, outliers, where the predictions deviate by a large margin from reality, deserve special attention. In other words, minimising extreme deviations in predictions is as important as avoiding systemic bias.
In the context of ED Software, this could mean that the wait-time was grossly overestimated, so the patient may decide to delay treatment or go to another clinic, or underestimated, where the patient may decide to wait and not receive timely treatment (Strobel et al. 2021). Deciding to go to another clinic as a result of overestimation may have negative impacts if the selected clinic is not a specialist one. Not receiving timely treatment due to underestimation can also be detrimental to health. Existing fairness measures do not address this issue, hence we propose a new measure of fairness, which we call fairness degree, and a search-based fairness testing (SBFT) approach that serves for approximating the fairness degree of a machine learning system.
Fairness degree describes the worst-case behaviour of a regression-based machine learning system. It indicates that the machine learning system is more biased (i.e., less fair) if the fairness degree is higher. Formally, Definition 2 Given a machine learning system, the fairness degree is measured by the maximum difference in the predicted values by the machine learning system for all pairs of instances (x i , x j ) that are identical apart from the sensitive attribute, i.e., x s i = x s j .
In the case of the ED Software, we would expect the machine learning model to predict similar wait-times for two patients who are identical apart from the sensitive attribute; e.g., it is reasonable to assume that a woman would not need to wait longer than a man to be seen by a doctor, given that all relevant circumstances (urgency, age, etc.) are the same. A larger value of D indicates that the ED Software is more biased. More specifically, this means that there are two patients who are identical except they are members of two different groups (e.g., male and female), and one of them is overestimated a wait-time of D minutes compared to the other.
Identifying the fairness degree of a machine learning system can be computationally expensive for large input spaces. This calls for a more effective method for testing such systems, which provides motivation for our Search-Based Fairness Testing approach introduced in the next section.

Search-Based Fairness Testing
Search-Based Fairness Testing (SBFT) is the first approach for testing regression-based machine learning systems. The key research challenge SBFT addresses is the identification of test inputs that reveal large bias for regression-based machine learning systems in order to estimate the fairness degree of the system in an efficient way. The optimisation problem is formulated as: where x s i and x s j are the values of the sensitive attribute. Existing test generation techniques for fairness use no guidance to find test inputs that can expose the fairness degree of the machine learning system. For example, Aequitas (Udeshi et al. 2018) uses a predefined threshold to identify test inputs that are discriminatory, i.e., if a test input reveals a fairness degree above the threshold it is considered discriminatory. Aequitas considers all test inputs that are discriminatory as equally important, hence this approach is not suitable for finding inputs that maximise the fairness degree.
The proposed approach SBFT is presented in Algorithm 1. It is based on a genetic algorithm, and introduces new enhancements to make the search more efficient and effective in identifying test inputs that reveal the fairness degree. In the next subsections, first we introduce the solution representation, and discuss the fitness evaluation and caching of sensitive variables which helps speed up the evaluation of the fairness degree of a machine learning system (Section 5.1). Next, we discuss the search steps with introducing the genetic operators used to create the new offspring from parent solutions and select individuals to the next population (Section 5.2).

Solution Representation and Fitness Evaluation
In SBFT, a solution to the optimisation problem is a test input, which is defined as can be an integer, categorical variable or real number. The variables representing the sensitive attributes are defined with the superscript letter s (i.e., x s i ). The fitness of a test input x i is measured by the individual fairness degree, defined as where y i and y j are the outputs of the inputs x i and x j , which are only different in terms of their sensitive attributes.
To measure the individual fairness degree of a test input, SBFT changes the value of the variable representing the sensitive attribute, keeping everything else the same (for example, flipping the value of the variable representing gender from female to male). Not all sensitive attributes have only two values. For instance, country of birth can take 293 values in the ED software system. It would be inefficient to exhaustively sample every value of the sensitive attribute to evaluate the fitness like in Aequitas. Therefore, SBFT maintains a cache of two values of the sensitive attribute which expose the current fairness degree of the system. To evaluate the fitness of a test input, SBFT either uses the cache with a probability of p c (lines 18-20) or chooses two random values for the sensitive attribute with a (1 − p c ) probability (lines 21-23). If the fitness of the test input is greater than the current fairness degree, it updates the cache with the new values (lines 25-26).

The Search Steps and Genetic Operators
SBFT starts with an initial M number of random test inputs as the initial population (line 5). Solutions are then evaluated using the fitness function defined by procedure EVALUATE-FITNESS in line 17, and explained in Section 5.1. SBFT applies the crossover, mutation and selection operators to evolve the population of test cases until a termination criteria is met (lines 9 to 15). Test cases with the highest fitness, as defined in (6) are more likely to reproduce and survive to the next generation.
SBFT generates an offspring population (line 9) by calling the procedure GENERA-TEOFFSPRING in line 28. Parents are selected using the roulette wheel selection strategy (line 21) (McMinn 2004). This way test cases with higher fitness are more likely to be selected as parents, hence to reproduce. Next, the selected parents are crossed-over to produce two children with a certain probability p cr (line 32).
Formally, given two parents . , x N j } the uniform crossover operator may produce two children where each variable is swapped with a probability of 0.5 except for the sensitive variables, which remain the same. The crossover operator helps the search focus on high-quality areas of the search, exploiting existing information in the current population.
SBFT uses a uniform mutation operator. It mutates each variable of one of the two children x i with a certain probability p μ (line 33), by randomly choosing a new value for each variable in the range of all possible values. Given . , x N j } the uniform mutation operator creates a new solution j } This mutation operator helps the search algorithm explore new areas of the search space, thus preventing the search from becoming trapped in local optima.
Each variable of the second child x j is mutated with a probability p μ by randomly choosing a new value in the range of the values of its parents (line 34). Given , ∀k = s This mutation operator helps the search to further exploit the information of the selected parents, which are already one of the best solutions in the current population. Similar to the crossover operator, both mutation operators do not change the sensitive variables.
In the preliminary experiments, we find that the search is likely to get trapped in local optima, hence produce sub-optimal solutions. Therefore, SBFT generates a small number of random test inputs every iteration to include as a seed in the population (line 36). This is done with the focus on increasing the exploration capabilities of SBFT.
Once the offspring population is generated, SBFT merges it with the current population (line 10) and selects the best M test inputs to the next population (elitism) (line 13). Finally, SBFT performs a local search on the best solution of the selected population by calling the procedure LOCALSEARCH in line 39 to find a better solution in the neighbourhood of the current best solution (line 14). For each variable in the best solution (line 40), the local search changes the value of it by δ ∈ {−1, 1} (line 43), and updates the best solution if the fitness of the new solution is better than the current best solution (lines 44-46). The local search explores the neighbourhood of the best solution in a more systematic way than the crossover operator to find the local optimum. Thus, it adds another layer of exploitation of the best solutions for the search in SBFT.

Experimental Evaluation
We design a set of experiments to evaluate the effectiveness and efficiency of SBFT. Effectiveness is assessed based on the fairness degree exposed of the system under test (SUT) at the end of the search process, which is the maximum fitness as found during the search. The approach that discovers higher fairness degree of the SUT compared to other approaches is considered more effective. Efficiency, on the other hand is the rate at which the approaches discover fairness degree of the SUT. The baseline approaches are presented in Section 6.1, details of the ED Software, i.e., emergency department wait-time prediction models, used as the experimental subjects are given in Section 6.2, experimental settings are described in Section 6.3, and the results are presented in Section 6.4. In essence, with these experiments, we aim to answer the following research questions: RQ1 How effective is SBFT? The effectiveness of SBFT in discovering fairness degree of the SUT is compared to the baseline approaches described in Section 6.1. All approaches are executed against 12 ED Software built for 12 hospitals (Section 6.2) for the same amount of execution time, and the final fitness, as measured by (6) is reported for each hospital and sensitive attribute. We perform statistical tests (as described in Section 6.3) to determine whether differences in performance are statistically significant.  A large AUC value indicates that the approach is fast at finding a higher fairness degree. Both the x (execution time) and y (fairness degree) axes of the curve are normalised in the range 0 to 1.

Baseline Approaches
We use the state-of-the-art Aequitas fully directed approach (Udeshi et al. 2018), Themis (Galhotra et al. 2017), symbolic generation (SG) (Aggarwal et al. 2019) and random testing as baselines for comparison.
We employ a threshold of 10 minutes for Aequitas, which is a parameter used to determine if a test input is discriminatory. The global search of Aequitas is run until a discriminatory input is found, then a local search is run on that input. These steps are repeated until the allocated execution time runs out.
In Themis, causal discrimination detection technique is used as it is appropriate to our context. Both Themis and SG are developed with the focus on classification-based machine learning systems. Therefore, we utilise a similar method as in Aequitas to determine if a test input is discriminatory, i.e., a threshold of 10 minutes.
Since the tool for symbolic generation approach is not publicly available, we re-use the tool developed by  for their experimental evaluation, which can be found at https://github.com/pxzhang94/ADF. The authors (Aggarwal et al. 2019) suggest to use a random seed in the symbolic generation approach in the absence of training data, which is the case in our ED models. Thus, we allocate the first 10% of the execution time, i.e., 12 minutes, to generate a random seed in every run of SG. In addition, we replace the decision tree classifier with a decision tree regressor to generate the decision tree as the ED models are regression-based ML systems. For all the other parameters, we use the same values as used in , which are from the best performance setting according to Aggarwal et al. (2019).

ED Patient Wait-Time Estimation Models
In our experiments, we use the twelve ED patient wait-time estimation models provided by Walker et al. (2021). The 12 ED models are built for 12 hospitals. Due to the highly-sensitive nature of the patient datasets, we do not have access to the datasets and the implementation source code. Instead, we only have access to the model objects as Python pickle files. Next, we describe the implementation details as provided by Walker et al. (2021).
A total of 1,930,609 patient records from 12 hospitals is collected from each hospital. The patient records for each hospital were sorted by their arrival time in chronological order. The 1,388,509 patient records from 2017-18 are used to build ED models using a random forest regression technique, while the 542,100 patient records in 2019 are used to evaluate the ED models using a time-wise hold-out validation approach. Table 1 presents a summary of the 13 variables based on the Victorian Minimum Emergency Dataset (VEMD) and 6 additional variables to approximate the resource and capacity in real-time (i.e., age, patients in triage queue, patients awaiting a provider, admitted patients awaiting departure, ambulance offload queue, and average wait-time of the last k patients). The 19 variables in Table 1 are considered as independent variables (X ), while the patient wait-time is considered as dependent variable (Y). Due to the ethical concerns, we anonymise the hospital names by using H1 to H12 to denote the twelve hospitals in our study.
To avoid any administrative data entry errors, Walker et al. (2021) removed the wait time outliers from the datasets according to the following criteria: (1) the wait-time exceeding the maximum of 360 minutes; and (2) the wait-time exceeding the predefined statistical outlier threshold value (defined as 1.5 times the interquartile range (IQR = Q3 -Q1) over Q3) (n = 13,612 (0.7%)).

Experimental Settings
We consider three sensitive attributes: Country of birth, Indigenous status, and Gender. Country of birth can take 293 values (e.g., Australia), and Indigenous status (e.g., Aboriginal but not Torres Strait Islander origin) and gender (e.g., Male) can take 6 and 4 values respectively. We evaluate SBFT against the baselines for the three attributes separately. Each of the five approaches are given an execution time of 120 minutes. To account for the randomness of SBFT and the baseline approaches, we repeat the experiments 20 times. Then, we conduct non-parametric Mann-Whitney U-Test with the significance level (α) 0.05 (Arcuri and Briand 2014) to check for statistical significance of the differences. If p-value < 0.05, then the differences are statistically significant.
In addition, we conduct Vargha and Delaney's A 12 statistical test (Vargha and Delaney 2000) to compute the effect size of such differences. The A 12 statistic indicates the probability of one algorithm producing a larger value than another algorithm. We consider there is a small effect size if 0.58 ≤ A 12 < 0.65, medium effect size if 0.65 ≤ A 12 < 0.75, and large effect size if A 12 ≥ 0.75. If A 12 = 0.50, then the two algorithms are equivalent, and the effect size is negligible if 0.50 < A 12 < 0.58 (Panichella et al. 2015).
SBFT has various parameters to be configured. We use irace, iterated racing for automatic algorithm configuration, to automatically tune the parameters of SBFT (López-Ibánez et al. 2016). The following parameter settings produced the best results: population size M = 100, crossover probability p cr = 0.75, mutation probability p μ = 0.70, rate of random test insertion r i = 0.10, probability of using cache p c = 0.50. We implement SBFT, Aequitas, random testing and Themis in a prototype tool in order to experimentally evaluate them. The prototype tool is available at the online appendix, https://github.com/search-based-fairness-testing.

RQ1: How effective is SBFT?
To address RQ1, we use the fairness degree measure (Definition 2) to quantify the effectiveness of SBFT when comparing to the baseline approaches, i.e., Aequitas, random testing, Themis and SG. We compute the fairness degree discovered by each studied approach given an execution time of 2 hours. Since the studied approaches are non-deterministic, we repeat the experiments 20 times prior to conducting statistical tests to determine whether such differences in results are statistically significant. Table 2 shows a statistical summary and results of statistical tests of fairness degree produced by the studied approaches for the 12 hospitals and the three sensitive attributes. We observe that overall, SBFT is more effective than the baseline approaches, yielding a statistically higher fairness degree with large effect sizes at 10, 7, and 10 hospitals for country of birth, Indigenous status, and gender, respectively.
In particular, for country of birth, SBFT discovers that the fairness degree of the ED Software at hospital H5 is 44.94 minutes on average, which is 38.01 (+548.5%), 23.64

RQ2: How efficient is SBFT?
To address RQ2, we quantify the efficiency of each studied approach using the area under curve (AUC) of fairness degree that is discovered over time, as described under RQ2 in Section 6. Similar to RQ1, we allocate 2 hours of execution time for all of the studied approaches. We compute the fairness degree of each studied approach at every second. For each studied approach, we draw a line plot of fairness degree (y-axis) and execution time (xaxis), and compute the area under curve. A large AUC value indicates that an approach can discover large fairness degree early. Finally, we use Mann-Whitney U-Test and A 12 statistic to statistically determine the most efficient studied approach. Table 3 shows a statistical summary and results of statistical tests of AUC for the studied approaches for the 12 hospitals and the 3 sensitive attributes. We observe that overall, SBFT is more efficient than the baseline approaches, yielding statistically higher AUC values with large effect sizes at 11, 5, and 6 hospitals for country of birth, Indigenous status, and gender, respectively.
In particular, for country of birth, SBFT has an AUC value of 0.90 on average at hospital H5, which is 0.85 (+1700.0%), 0.59 (+190.3%), 0.55 (+157.1%) and 0.36 (+66.7%) higher than symbolic generation, Aequitas, Themis and Random Testing. In some cases, SBFT achieves AUC values closer to 1.0, e.g., at H3 and H7 for Indigenous status, suggesting that they converge to the final fairness degree of the ED Software very early in the search. We observe that SBFT discovers the highest fairness degree of 48.85 minutes at H7 for indigenous status. These findings show that SBFT can effectively find the largest fairness degree in a very efficient manner (i.e., discovered in the early stage of the search).
Symbolic generation (SG) has the worst performance compared to all the other approaches. In particular, it discovers the fairness degree of the ED Software only up to 7.91 minutes on average for any sensitive attribute at any hospital. We find that SG generates only 8 test inputs on average for an execution time of 2 hours across the three sensitive attributes and the 12 hospitals. This is significantly lower than the number of test inputs generated by the other four approaches, for example, SBFT and random testing generate around 14000 test inputs. The symbolic generation approach is significantly slowed down at its local explainer. SG uses LIME (Ribeiro et al. 2016) as the local explainer and LIME produces an execution path for each test input generated. To produce the path, LIME randomly samples a large number of inputs in the neighbourhood of the generated test input (i.e., 5000 inputs is the default value), and executes the ED Software for each of these inputs, which is computationally expensive. Figure 3 shows the fairness degree improvements of the five approaches over execution time at hospital H5 for the sensitive attribute country of birth. We observe that SBFT finds better test cases compared to the baseline methods early in the search, and keeps this advantage throughout the execution time, with a steady improvement in solution quality. Aequitas, Themis and random testing, on the other hand, have long plateaus where fitness does not improve for many iterations and there are sudden large increments in the fairness degree. This behaviour is expected for Themis and random testing because both of them are based on random search techniques. The local search procedure in Aequitas considers all the We observe that SBFT and random testing generate 14,455 and 14,341 test inputs on average for an execution time of 2 hours for the sensitive attribute country of birth. On the other hand, Aequitas only generates 100 test inputs on average. Such significant differences in the number of generated test inputs is due to the exhaustive nature of the sampling process of Aequitas. The fitness evaluation in Aequitas is slower than SBFT and random testing because of its exhaustive nature. Aequitas samples every possible value of the sensitive variable to evaluate the fitness of a test input. Thus, when a sensitive attribute has many possible values, e.g., country of birth which has a total of 293 possible values, Aequitas is much slower than SBFT and random testing. We observe the same problem in the fitness evaluation of Themis, which only generates 96 test inputs on average for country of birth. The fitness evaluation in Themis is similar to Aequitas such that Themis samples every possible value of the sensitive attribute until the input is determined as discriminatory, i.e., difference of predicted outputs is greater than the threshold.
Further analysis into the results show that SBFT is not only better at finding a higher fairness degree of ED Software, but also at generating test cases with higher individual fairness degree as measured by (6). Figures 4, 5 and 6 show the distributions of the individual fairness degree of the test cases generated by the studied approaches for each hospital as violin plots for the sensitive attributes country of birth, Indigenous status and gender, respectively. We exclude the test cases with individual fairness degree less than one minute in the analysis. According to the Mann-Whitney U-Test, SBFT generates test cases with significantly higher individual fairness degree compared to the baseline approaches at 7, 6 and 3 hospitals for country of birth, Indigenous status and gender, respectively. Descriptive statistics are provided in the online appendix, https://github.com/search-based-fairness-testing. We can also clearly see in the violin plots that SBFT has a higher third quartile and there are more test cases above the third quartile at hospitals H2, H3, H5 and H7 for country of birth, H3 and H7 for Indigenous status, and H11 for gender. This suggests that the test cases generated by SBFT have higher individual fairness degree than the ones generated by the baseline approaches.

Discussion
Healthcare delivery is biased in favour of certain population groups. These favoured groups will experience better outcomes for the same diseases, when compared to other populations Individual fairness degree distributions as discovered by each approach for gender (higher survival rates and less long-term disability). The reasons for this are many, and bias exists at decision points the whole way along some patients' journeys through illness, from recognising the first symptoms to final long-term treatment. This creates biased derivation health datasets. Machine learning algorithms may well learn these biases and produce biased outputs when applied to health questions or scenarios. There is a risk of not only continuing to propagate these endemic biases, but also amplifying them as machine learning algorithms are incorporated into healthcare decision making.
In health, a woman with heart attack symptoms might not be advised to seek immediate medical care (Bairey Merz et al. 2017) whereas a man might be encouraged to call an ambulance and attend a heart hospital for immediate care (McSweeney et al. 2016;Udell et al. 2018). In a time-critical condition such as a heart attack, this will increase the woman's risk of dying or having a larger area of damage to her heart compared to the outcome had she received aggressive, immediate treatment Stehli et al. 2021).
Favoured groups in healthcare tend to be middle-aged men, who speak the main language of their country, are of the majority racial group, and have high affluence and health literacy (Juergens et al. 2016;Wechkunanukul et al. 2016). Groups who fare less well include women, older people, those with cultural and linguistic diversity compared to the majority of the country, and those with low affluence and low health literacy (Vogel et al. 2021).
In order to understand if machine learning is generating biased outputs prior to deploying them in the real world, it is important to be able to detect the biases. Once the biases are detectable, machine learning could have the potential to mitigate against biases rather than amplifying the inequities.
Fairness is recognised as a critical non-functional attribute of machine learning systems (Horkoff 2019). Recent work has focused on improving ML fairness, finding that removing sensitive features such as gender and ethnicity is not sufficient to ensure fair outcomes (Kamishima et al. 2011), which highlights the importance of fairness testing approaches. Our novel approach for fairness testing in regression-based machine learning systems can help software teams to identify the degree of fairness in predictions and make data-informed decisions about whether such software systems are ready to deploy.
In this study, we demonstrate the use of fairness degree to measure the biases in regression-based ML systems from the healthcare domain, i.e., emergency department waittime prediction. We can draw some parallels between the new measure of fairness degree and existing work on worst-case analysis pertaining to quality attributes such as reliability (Bishop and Bloomfield 2002) and performance (Puschner and Burns 2000;Cortellessa et al. 2005;Ramamoorthy and Ho 1980). For example, Cortellessa et al. (2005) devise a methodology for estimating performance-based risk factors for software systems, which originate from violations of performance requirements, i.e., performance failures. The work introduces annotated UML diagrams to estimate the performance failure probability, which is combined with the failure severity estimate, thus enabling the determination of risky scenarios. The bias degree has a similar goal as it aims at identifying the most risky scenarios with respect to fairness.
Puschner and Burns (Puschner and Burns 2000) present an overview of worst case execution time analysis for safety critical real-time systems. Such analysis forms the basis for establishing confidence into the timely operation of a real-time system. When it comes to fairness, the fairness degree can be considered as a worst case analysis, as it aims at determining the highest bias that a system exhibits. Worst-case performance analysis is relevant specifically for safety critical real-time systems, such as robotics or cars (e.g., response time of the software component controlling the break pedal in a car), and may not be as relevant in software systems that do not need perform safety critical and real-time actions. Similarly, the fairness degree is an important metric that should be considered when developing software systems that are used in life-critical scenarios, such as the problem domain considered in this paper.
Fairness degree can also be used to measure the biases in other ML systems, such as the crime prediction algorithms, e.g., the risk of an individual re-offending (recidivism) (Angwin et al. 2016), personalised price prediction algorithms (Mahdawi 2018), patient risk-score predicting algorithms (Ledford 2019), and face detection algorithms, e.g., to crop a human face in an image (Hern 2020). For example, face detection algorithms detect the specific position of human faces in an image using a bounding box, i.e., drawing a rectangular box surrounding the human face in the image. The output is a bounding box marked by the (x, y) coordinates of the centre of the box along with its height and width. Given a face detection algorithm, it is reasonable to assume that it outputs the same bounding box ((x, y), height and width) for two exactly identical faces except for a sensitive attribute like the skin colour (e.g., light skin and dark skin), if not that means the face detection algorithm is biased against individuals based on their skin colour. The concept of fairness degree can be leveraged to measure the maximum difference of (x, y) coordinates of the detected boxes (also height and width) for two identical faces which are different only from a sensitive attribute, like skin colour.

Fairness of Machine Learning Systems
"Humans are inscrutable in a way that algorithms are not" (Mullainathan 2019). As software is created by people, it is inevitable that these biases will end up in code, resulting in biased software. Indeed, it is becoming increasingly evident that machine learning systems are vulnerable to bias, which render their decisions "unfair". In the context of decisionmaking, fairness is the absence of any favouritism toward an individual or a group based on their inherent or acquired characteristics (Mehrabi et al. 2019). In other words, an unfair machine learning system is one whose decisions are skewed toward a particular group of people (Binns 2018).
Unlike humans, however, software can be tested, creating the potential for new forms of transparency and hence opportunities to detect biases that are otherwise unavailable (Galhotra et al. 2017). For example, in 2016, Amazon.com, Inc. used software to determine which parts of the United States it would offer free same-day delivery. The decisions made by the software prevented minority neighbourhoods from benefiting from this offer, often when every surrounding neighbourhood could participate (Ingold and Soper 2016b).

Bias Detection and Mitigation Techniques
The machine learning community has been working on developing techniques for bias detection and mitigation for ML-based systems (Kamiran and Calders 2012;Calmon et al. 2017;Biswas and Rajan 2020). For example, IBM AI Fairness 360 toolkit (Bellamy et al. 2019) provides state-of-the-art bias detection techniques with 71 fairness metrics, such as disparate impact, which is the ratio between the probability of unprivileged group getting a favourable prediction and the probability of privileged group getting a favourable prediction. Similar to IBM AI Fairness 360, fairkit-learn (Johnson et al. 2020) is a toolkit that helps practitioners to reason about and understand fairness. In addition, it can also evaluate multiple models by computing the optimal trade-offs between fairness and performance of models. However, such toolkits only detect bias using the existing datasets, but are not able to detect bias by discovering discriminatory inputs via exploring the search space of all possible inputs.
Bias mitigation algorithms can be classified into pre-processing (fix the data), inprocessing (fix the classifier), and post-processing (fix the predictions) techniques. Preprocessing techniques do not change the model, but only work on the dataset before training so that models can produce fairer predictions. For example, a reweighing technique (Kamiran and Calders 2012) and a disparate impact remover technique (Feldman et al. 2015). Zhang and Harman (2021) recommend that an enlarged feature set substantially improves both model accuracy and fairness, while extending the training dataset when the feature set is small could result in more unfairness by the trained model. In-processing techniques modify the ML model to mitigate the bias in the original model prediction. For example, an adversarial debiasing technique (Zhang et al. 2018). Post-processing techniques aim to modify the prediction results instead of the ML models or the input data. Fairway (Chakraborty et al. 2020) is an example of a technique that combines pre-processing and in-processing bias mitigation techniques. Although various techniques have been proposed in the machine learning community, such techniques are not designed for bias discovery in ML models.

Fairness Testing
Fairness testing of ML-based systems is still at an early stage. The existing fairness testing techniques are primarily focused on classification-based ML systems (Galhotra et al. 2017;Udeshi et al. 2018;Aggarwal et al. 2019;). Among those techniques, only Aequitas (Udeshi et al. 2018) is designed to work with regression-based ML systems. In addition, Galhotra et al. (2017) suggest ways of extending Themis to work with continuous variables as outputs, however, both papers limit their experimental evaluations only to systems with binary outputs. Udeshi et al. (2018) proposed Aequitas, a directed test generation technique focusing on the individual fairness of the SUT. Aequitas starts with a global search and then performs a local search on the discriminatory inputs identified in the global search to find further discriminatory inputs. Their approach considers a discrimination threshold for the difference of two outputs to determine discrimination, thereby making it compatible with regressionbased ML systems. Aequitas directs the local search only by considering whether the recent inputs are discriminatory or not, and it considers all the inputs that are discriminatory as equally important. This is opposite to the concept of fairness degree, which considers the extreme deviations in predictions are important.
Aequitas exploits the inherent robustness property of machine learning models, i.e., the output of the model should have low variation for small perturbations in the input (Udeshi et al. 2018). While this works for its intended task, i.e., maximising the number of discriminatory inputs found, we argue the Aequitas approach is ineffective in terms of discovering fairness degree. The local search in Aequitas does small perturbations on the inputs identified in the global search. According to the robustness property, the inputs identified in the local search should behave similarly to the ones identified in the global search, hence, the local search fails to make use of the inputs provided by the global search in order to maximise the observed fairness degree. In contrast, SBFT employs a genetic algorithm to incrementally maximise the fairness degree. In particular, SBFT evolves a population of test inputs through exploration and exploitation via mutation, crossover and a fast local search procedure.
Themis (Galhotra et al. 2017) is a random test generation technique that targets group and causal discrimination of the SUT. Its causal discrimination detection technique can be used to detect violations of individual fairness, hence, to find fairness degree. Unlike SBFT, Themis searches for test inputs randomly without any guidance to maximise the fairness degree of the SUT. Aggarwal et al. (2019) propose a black-box testing technique called symbolic generation which focuses on individual fairness of classification-based ML systems. Their approach uses symbolic execution, which systematically generates test inputs, together with local explanation, which approximates the path in the model for a corresponding input. The symbolic generation approach is very slow at the local explainer. When LIME (Ribeiro et al. 2016) produces an execution path for a corresponding input in each iteration, it randomly samples a large number of inputs in the neighbourhood, e.g., 5000, and executes the model for each of these inputs. This is a time consuming task compared to the fast searching procedures used in SBFT which only executes the model to evaluate fairness degree. As a result, symbolic generation spends a large percentage of its allocated execution time to build decision tree paths, while SBFT utilises a large part of the execution time to explore the search space for high quality test inputs that reveal biases.  proposed a white-box testing technique called adversarial discrimination finder (ADF) to detect individual discriminatory instances in classification-based ML systems. It was specifically designed for systems built with deep neural networks (DNNs). In contrast, SBFT is a black-box testing technique that does not require the internal details of the model under test, which enables testing of third party ML systems where the models are not accessible (Angwin et al. 2016). Unlike ADF, the concept of SBFT is not model dependent and can be applied to any regression-based ML system.
FairTest (Tramer et al. 2017) is a testing tool that detects statistically significant associations between sensitive attributes and an output of a model and generates an interpret-able bug report. These associations are called unwarranted associations and characterised as fairness bugs in the paper. FairTest needs an existing dataset as input to generate the bug report.
The datasets used to train and test ML models often contain highly sensitive information, e.g., in healthcare domain, and are rarely available to practitioners at the time of testing. Unlike FairTest, SBFT does not require ground-truth datasets, and it evolves an initial population of random test inputs to maximise the identified fairness degree of the system under test.
Unlike other testing approaches (Galhotra et al. 2017;Udeshi et al. 2018;Aggarwal et al. 2019;Tramer et al. 2017), TILE (Sharma and Wehrheim 2019) tests fairness of ML algorithms at the learning stage. It applies four metamorphic transformations on training data; i) permutation of training data instances, ii) permutation of feature ordering, iii) shuffling of feature names and iv) renaming of feature values, and claims the algorithm is fair when the application of transformations results in equivalent predictors. In contrast to TILE, SBFT is intended to use at the prediction stage where the models are already built.

Situation Testing
The concept of fairness degree relates to situation testing. Situation testing is a systematic research procedure used in the legal field to analyse discriminatory treatment on an individual (Bendick 2007). In situation testing, pairs of individuals who are similar to each other except for their membership in two different protected groups are sent out to decision makers. Then, the decisions each pair receive are used to analyse the discriminatory behaviour. Fairness degree also checks the difference of outcomes for two individuals who are identical to each other except for their membership in two protected groups. In contrast to situation testing, fairness degree focuses on the maximum difference in the outcomes. Luong et al. (2011) first used situation testing for discrimination discovery of a dataset in a classification setting. They used k-NN approach to find out similar individuals in a dataset. Zhang et al. (2016) followed a similar approach for discrimination discovery using Causal Bayesian Networks to find out similar individuals. Chakraborty et al. (2021) proposed to train a logistic regression model on the dataset and then flip the value of the sensitive attribute for every data point to find out if the outcome changes. The original data input was considered biased if the outcome was changed. As opposed to these works, SBFT does not require a dataset and explores the search space of all valid test inputs to find out the fairness degree of the ML system under test. In contrast to Luong et al. (2011) andZhang et al. (2016), SBFT considers two individuals are similar only if they are identical except for a sensitive attribute. These discrimination discovery techniques based on situation testing are designed for classification-based ML systems (Luong et al. 2011;Zhang et al. 2016;Chakraborty et al. 2021). However, it is difficult to determine if two outcomes are different enough to be considered biased in a regression setting. We address this problem by introducing a novel fairness measure, fairness degree.

Fairness Measures for Machine Learning Systems
Underpinning the efforts for fair ML systems are an ever increasing array of fairness measures which aim to quantify fairness. Any measure, however, will undoubtedly have limitations. Indeed, the implication of "measurement" is problematic as it implies a straightforward process (Barocas et al. 2018).
Several studies have developed and applied fairness measures to evaluate ML systems. The most well known example is the report by ProPublica (Angwin et al. 2016) on the COMPAS algorithm. COMPAS (Dieterich et al. 2016) was used for recidivism prediction, and it was shown that it had higher false positive rates for African American defendants than European American groups, which was interpreted as the tool was biased (Angwin et al. 2016). The false positive metric has also been used to measure the fairness of credit scoring algorithms (Hardt et al. 2016) and child welfare services (Chouldechova et al. 2018). Other studies have focused on Area Under the Curve (AUC) as a measure of fairness (Siegel 2003).
Aside from the philosophical and ethical debates on defining fairness, creating generalised definitions of fairness is difficult. Metrics usually either emphasise individual (e.g., everyone is treated equal), or group fairness. Existing fairness measures can be broadly categorised into group fairness and individual fairness (Caton and Haas 2020), with the following examples standing out in the literature.
-Demographic parity (Dwork et al. 2012) -group -the likelihood of a positive outcome should be the same independent of the value of the protected attribute. All of these measures were developed and applied to classification-based machine learning systems, and cannot be directly applied to regression-based machine learning systems.

Search-Based Software Testing
Search-based software testing techniques have been applied to a plethora of testing problems such as unit testing (e.g., EvoSuite (Fraser and Arcuri 2013), AUSTIN Lakhotia et al. 2013), end-to-end testing of Android apps (e.g., Sapienz (Mao et al. 2016)), functional testing (Wegener and Bühler 2004), security testing (Del Grosso et al. 2005), etc. EvoSuite is a state-of-the-art unit test generation tool which generates JUnit test suites for Java programs to optimise code coverage by employing metaheuristic search methods like genetic algorithms (Panichella et al. 2015). EvoSuite has been shown to be effective at not only achieving high code coverage (Panichella et al. 2018), but also finding bugs (Perera et al. 2020). Sapienz is an automated Android testing tool that optimises test sequences, maximising coverage and fault revelation, which has been successfully deployed in production at Facebook (Alshahwan et al. 2018). Prior to our work, search-based methods have not been applied in the area of fairness testing. SBFT is the first approach that generates bias revealing test inputs to optimise the observed fairness degree of a machine learning system by employing a search-based technique.

Construct Validity
The new measure introduced in this paper tackles a particular context of regression-based machine learning systems. While we do not claim that the fairness degree covers all fairness concerns of such systems, a threat to the validity of the results is the appropriateness of inferences made on the basis of such measure.
The SBFT approach is an approximate method. It effectively and efficiently searches the search space of potential test inputs via exploration and exploitation techniques to find the fairness degree of the SUT. However, it does not provide a guarantee that the final identified fairness degree is the actual fairness degree, i.e., absolute worst case, of the SUT, which is always greater than or equal to the fairness degree discovered by SBFT. Our experimental evaluation on an ML system in the healthcare domain demonstrates that SBFT is more effective and efficient than the existing approaches at finding fairness degree.
Another construct threat to validity relates to the definition of our fairness degree. Our proposed fairness degree measures the maximum difference in the predicted values by the machine learning system, while the existing fairness measure by Berk et al. (2017) considers the average differences. However, these two measures have their own advantages and disadvantages. The advantage of the maximum difference is to highlight the ultimate worst case of the unfairness of an ML-based system under test, but the maximum difference may be biased due to the outlier data. However this threat is mitigated in our experimental evaluation since the outliers in the patient datasets were removed prior to building the ED wait-time models (as mentioned in Section 6.2). On the other hand, while the average difference (Berk et al. 2017) is not sensitive to the outliers, it does not reveal what is the worst case of the unfairness of the ML-based system under test, which is the primary focus of our paper.

Internal Validity
To account for the non-deterministic behaviour of the five algorithms, we repeat the experiments 20 runs and carry out rigorous statistical analysis, i.e., two-tailed non-parametric Mann-Whitney U-Test (Arcuri and Briand 2014) and Vargha and Delaney's A 12 statistic (Vargha and Delaney 2000), to draw conclusions from the results.
All algorithms are implemented in the same tool. The baseline approaches, Aequitas, Themis and symbolic generation, are implemented as described in the original papers (Galhotra et al. 2017;Udeshi et al. 2018;Aggarwal et al. 2019). Thus, any confounding effects due to different implementations or use of tools are mitigated in our experimental evaluation.

External Validity
We evaluate SBFT using 12 ED wait-time prediction models for 12 hospitals with three different sensitive attributes. While the results cannot be generalised to all regression-based machine learning systems, the concept behind our proposed SBFT approach does not depend on the ED Software used in our experimental evaluation. Thus, future work can explore if our SBFT is effective and efficient in other regression-based machine learning systems.
SBFT supports three variable types for the test inputs: integer, real and categorical. While other input types, such as strings, images, sounds or videos are not directly handled in the current version of SBFT, the core concepts of our algorithm are independent of the input variable types. SBFT will be able to handle these types of variables by incorporating additional techniques to create valid inputs.

Conclusion
We present a new approach for evaluating the fairness of regression-based machine learning systems, which includes a new measure called fairness degree and a new automated testing approach for testing regression-based machine learning systems for fairness, called Search-Based Fairness Testing. Our approach is motivated by a machine learning system from the healthcare domain, which is used to predict wait times in an emergency department (ED Software). Existing fairness testing approaches are not designed for regression-based machine learning systems, and SBFT is the first approach that tackles this important problem. Such systems are being used in life critical situations such as the prediction of wait times in emergency departments. In our experimental evaluation, we observed differences of up to 48 minutes in the prediction of wait-times for patients that are identical apart from the sensitive attribute. Such errors in predictions may result in patients delaying critical treatment when the wait time is overestimated, hence it is important that we test such machine learning systems for the fairness degree, which analyses the worst case behaviour.
Our study follows a design science research approach. Runeson et al. (Runeson et al. 2020) proposed that the contributions of a design science research be assessed with respect to relevance, rigour and novelty. In regard to relevance of the research, we observe the problem in a real-world regression-based ML system; ED wait-time prediction software and evaluate the proposed solutions on 12 ML systems from healthcare domain. In Section 6.5, we discussed other regression-based ML systems that our proposed solutions can be applied to. In regard to the rigour of the research activities, we evaluate the proposed solutions on 12 ED Software from healthcare domain, which were built using nearly 1.9 million patient records from 12 hospitals. Each ED Software was evaluated for three sensitive attributes. The proposed SBFT technique was compared against four alternative solutions. We use two-tailed non-parametric Mann-Whitney U-Test and Vargha and Delaney's A 12 statistic to draw conclusions from the results. In regard to the novelty of the technological rule, the proposed SBFT technique discover up to 48 minutes of fairness degree, which could go unnoticed if an existing fairness measure was used due to the smoothing effect (averaging) or not quantifying the magnitude of outcome difference. In the context of EDs, this could mean patients delaying treatments because they did not want to wait. The alternative fairness testing techniques are not as effective and efficient as SBFT to measure the fairness degree for regression-based ML system.
The context is subtle and the impacts are nuanced. Overall, the continued exposure to bias adds up and influences health outcomes -more like a thousand cuts rather than one big event. In this scenario, our approach searches for test cases that expose the maximum difference in wait time prediction by the ED Software for patients that are identical apart from the sensitive attribute. We demonstrate that SBFT discovers discriminatory inputs of larger margins in terms of fairness degree compared to the baseline approaches. SBFT is also more efficient than the baseline approaches for discovering discriminatory inputs. In the future, we plan to investigate a new study from ambulance wait-time prediction software and work on other measures of fairness for regression-based machine learning systems.