Synthesizing Fair Decision Trees via Iterative Constraint Solving

. Decision trees are increasingly used to make socially sensitive decisions, where they are expected to be both accurate and fair, but it remains a challenging task to optimize the learning algorithm for fairness in a predictable and explainable fashion. To overcome the challenge, we propose an iterative framework for choosing decision attributes, or features , at each level by formulating feature selection as a series of mixed integer optimization problems. Both fairness and accuracy requirements are encoded as numerical constraints and solved by an oﬀ-the-shelf constraint solver. As a result, the trade-oﬀ between fairness and accuracy is quantiﬁable. At a high level, our method can be viewed as a generalization of the entropy-based greedy search techniques such as CART and C4.5 , and existing fair learning techniques such as IGCS and MIP . Our experimental evaluation on six datasets, for which demographic parity is used as the fairness metric, shows that the method is signiﬁcantly more eﬀective in reducing bias than other methods while maintaining accuracy. Furthermore, compared to non-iterative constraint solving, our iterative approach is at least 10 times faster.


Introduction
Decision trees are one of the most widely used machine learning models in statistical analysis, data mining and decision making. Compared to other predictive models such as deep neural networks, decision trees have the advantage of being easily understandable by humans, which makes them a favorite building block in systems that require interpretability [34]. However, when they are used to make socially sensitive decisions in business, finance and law enforcement, decision trees may introduce bias against certain groups [16]. In this context, a widely used group fairness metric is demographic parity [11,38], also known as the 80% rule [8]. Bias against demographic groups, in general, comes from two sources. First, historical data used to learn models may be biased. Second, learning algorithms may be biased even if they operate on unbiased data.
State-of-the-art decision tree learning algorithms such as CART and C4.5 [10,29], which are the ones used by popular machine learning toolkits, rely on a greedy search technique that is optimized solely for high learning speed and classification accuracy. Since they do not consider fairness as an optimization requirement at all, they often produce decision trees that are severely biased. To mitigate the bias, modifications have been proposed to make the greedy search discrimination-aware [24] (e.g., IGCS). Unfortunately, these modifications are not always effective as shown by our own experimental evaluation in Sect. 5 and, more importantly, the impact of ad hoc modifications is often unpredictable and difficult to explain.
Meanwhile, there is a line of work in operational research that formulates decision tree learning as a mixed-integer optimization (MIO) problem [7,35]. Given a finite set F of decision attributes, or features, and a maximum tree depth K, the set of all possible decision trees is captured symbolically as a set of numerical constraints, which is then fed to a solver to compute the globallyoptimal decision tree. While optimality was defined initially to minimize the tree size and accuracy loss [7,35], later on, fairness was added as a goal of the optimization [1,5]. However, the approach remains largely theoretical due to its limited scalability: since the entire decision tree must be encoded as a monolithic MIO problem, only small training datasets (with sample sizes in the 1000s) and small decision trees (with depths up to 4 or 5) can be handled [2,7].
To overcome the limitations of the existing approaches, we propose an iterative constraint solving technique for synthesizing decision trees in a practically efficient fashion while simultaneously optimizing for fairness and accuracy. Instead of encoding the decision tree as a monolithic MIO formula, we break it down to a series of small steps to avoid the scalability bottleneck. Specifically, starting from the root node, we use constraint solving to conduct a depthbounded look-ahead search at each level of the decision tree, to compute the best feature. Within the look-ahead search, we encode both fairness and accuracy requirements explicitly as numerical constraints, to make the fairness-accuracy trade-off not only predictable but also easy to explain.
The overall flow of our method, SFTree, is shown in Fig. 1. Given a set of training examples (E), a set of features (F), and a sensitive feature (f s P F) as input, SFTree returns the synthesized decision tree (T ) as output. Internally, SFTree encodes the hierarchical structure of a partial decision tree symbolically starting from the current node and its training set E, covering a fixed number of tree levels. Then, it uses an MIO solver to compute the optimal feature, f * , that minimizes the bias against the protected group, the classification error, and the tree size. Assuming that f * P {0, 1} is a Boolean predicate, the training set is partitioned into subsets E f * and E f * , one for each child node. Our method iteratively partitions the child nodes until the training subset becomes empty, or all examples in E belong to the same class, or all features in F have been used.
To demonstrate its effectiveness, we have implemented SFTree and evaluated it on six supervised learning datasets, consisting of three small datasets and three large ones. Since the small datasets can be handled even by the monolithic MIO approach (named MIP [1]) to obtain globally-optimal and fair solutions, we used them to evaluate the quality of decision trees learned by our method. The large datasets, which are out of the reach of MIP, were used to evaluate scalability. For comparison, we also evaluated CART [27], a mainstream decision tree learning algorithm, and IGCS [24], a discrimination-aware learning algorithm.
The experimental results show that, among all methods (CART, IGCS, MIP, and SFTree), SFTree produces the best overall solution in terms of fairness and accuracy. In contrast, CART produces unfair decision trees in most cases and, while IGCS does well on the small datasets, it produces mostly unfair decision trees for the large datasets. Neither CART nor IGCS is effective in satisfying the well-known 80% Rule [8] for demographic parity [11,38]. In contrast, SFTree satisfies the 80% Rule in all cases. In terms of scalability, MIP fails to handle any of the large datasets, while SFTree handles all of them. In fact, among all four methods, SFTree is the only one that produces fair and accurate decision trees for datasets with ą40, 000 training samples.
To sum up, this paper makes the following contributions: -We propose an iterative constraint-solving method for synthesizing fair decision trees: • By formulating feature selection as a series of mixed integer optimization subproblems, we make the constraints efficiently solvable. • By encoding fairness and accuracy explicitly as symbolic constraints, we make the trade-off quantifiable and easy to explain. -We demonstrate the advantages of SFTree over existing approaches (CART, IGCS, and MIP) using six popular datasets in the fairness literature.
The remainder of this paper is organized as follows. In Sect. 2, we review the basics of decision tree learning and group fairness. In Sect. 3, we present our method. In Sect. 4, we present generalization and performance enhancement techniques. In Sect. 5, we present our experimental results. After reviewing the related work in Sect. 6, we give our conclusions in Sect. 7.

Training Dataset E
The training dataset is a finite set of examples, . . , f k is a vector of features, and output y i is a class label. Let F be the set of all features. For ease of comprehension, let us assume for now that all input features and the output class label are Boolean. In this case, every input x i P {0, 1} k is a k-bit vector in the feature space, the output y i P {0, 1} is a bit, and a decision tree trained using E is a k-input Boolean function. To make the presentation clear, we may also use y i P {´,`} instead of y i P {0, 1} as the output, where´means "no" and`means "yes". Figure 2 shows a training set E, where each row in the table represents an example. The input features are a job candidate's gender (0 = Female, 1 = Male), college rank (0 = Low, 1 = High), experience (0 = No, 1 = Yes), and interview score (0 = Not-Good, 1 = Good), while the output shows whether the job is offered (0 = No, and 1 = Yes). At the root of the decision tree, for instance, the input goes to the left branch when (f 4 " 0) and to the right branch when (f 4 " 1). The example illustrates three important notions associated with the training set: (1) partition of E (2) entropy, and (3) conditional entropy.
Partition. Given a set E and a feature f j , we can partition E into subsets E fj "0 and E fj "1 , or E fj and E fj , respectively, in shorthand notation. Here, consists of examples whose f j is 0, and consists of examples whose f j is 1. By definition, we have E fj Ď E and E fj Ď E, E fj X E fj " H and E fj Y E fj " E.
For our example in Fig. 2, partitioning the dataset by gender (f 1 ) results For our running example, since there are 3 female and 4 male candidates, we have |E f1 |{|E| " 3{7 and |E f1 |{|E| " 4{7. Thus, the conditional entropy is

Real-Valued Features.
It is important to note that, while the above examples use Boolean features, our method is more general in that it allows all features have real values, i.e., x i P [0, 1] k instead of x i P {0, 1} k . We accomplish this by applying one-hot encoding to any categorical feature and normalizing any real-valued feature to the [0, 1] domain. Thus, the branch predicates become (f j ă b v ) and (f j b v ), instead of (f j " 0) and (f j " 1), where b v P (0, 1] is a threshold computed by our method. For example, if f j is the (normalized) salary and b v " 0.5, the branch predicates are (f j ă 0.5) and (f j 0.5).

Decision Tree Learning
A decision tree T is a binary tree consisting of a set of nodes and a set of edges. Let the set of nodes be V Y L, where V is the subset of branch nodes (including the root) and L is the subset of leaf nodes. Let E be the set of edges between these nodes. A path in T is a sequence of nodes and edges, denoted v 0 , e 1 , v 1 . . . v n , e n , l n , where v 0 is the root, l n is a leaf node, v 1 . . . v n are the internal nodes, and e 1 , . . . , e n are the edges.
Each edge has a branch condition. The edge is activated only if the condition holds for a given input x. In Fig. 2, for example, the left-most path of the decision tree has the condition f 4 (x) " 0 and output offer " 0, while the right-most path has the condition (f 4 (x) " 1)^(f 1 (x) " M ) and output offer " 1.
Given a training set E " {(x i , y i )}, where x i is an input and y i is the known output, mainstream algorithms aim to learn a decision tree T that minimizes the classification error. They also aim to minimize the tree size which, in general, allows T to generalize well on the test examples.
The Baseline Algorithm. Algorithm 1 shows the top-level procedure of these mainstream algorithms. It takes the training set E and the feature set F as input, and returns a decision tree (T ) as output. These mainstream algorithms use a Algorithm 1. The baseline decision tree learning procedure T = DTL(E, F). return T " LeafNode(l) 5: else if F " H and the most common label of E is l * = MostCommonLabel(E) 6: return T " LeafNode(l * ) 7: else if E " H and in E.parent, we have l * = MostCommonLabel(E.parent) 8: return T " LeafNode(l * ) 9: else 10: T foreach value i P {0, 1} of the chosen feature f * 12: Ti greedy method to recursively select decision attributes from F and use them to partition the training set E. At each step, it selects the best feature f * using the subroutine FindNextFeature.
In CART, for example, FindNextFeature is entropy-based, to maximize the information gain of partitioning E by f as shown in Algorithm 2. While this is fast and often leads to high classification accuracy, it does not consider fairness and thus often produces biased decision trees. In this work, we use iterative constraint solving to overcome the limitation.
After f * is computed by FindNextFeature, Algorithm 1 uses it to partition the training set E, and recursively process the two subsets: The recursion ends when -all training examples in the set E have the same class label (Lines 3-4 ); -there are no features left in F to split E further (Lines 5-6 ); or -the set E is empty (Lines 7-8 ).

Fairness Metric
Given a training set E and a sensitive feature f s P F, e.g., race or gender, the goal is to construct a decision tree T that maximizes classification accuracy while minimizing bias. The metric concerned in this work, demographic parity [11,38], comes from the legal guideline in the United States for avoiding employment discrimination. Known as the 80% rule [8], it says the percentage at which This is formalized using the fairness index, F s (T , E), defined as follows: , the tree fails to satisfy the 80% rule due to bias against female. The bias is explicit in that f 1 is actually used in the edge labels of the right most two paths of the decision tree. However, even if f 1 is not used in T explicitly, T may still be biased against female, for example, if other non-sensitive features (or their combinations) are statistically correlated to f 1 and, as a result, introduce bias against female. This is the reason why mitigating bias during decision tree learning is a challenging task.

Our Method
To minimize the bias and, at the same time, maximize the classification accuracy, we proposed to follow the top-level procedure in Algorithm 1, but formulate feature selection as a series of mixed-integer optimization (MIO) subproblems.
As shown in Algorithm 3, each of our MIO subproblems consists of an objective function O and a constraint Φ, and the solution is an assignment of the numerical variables (shared by O and Φ) that minimizes O while satisfying Φ. In the remainder of this section, we present our symbolic encoding of the objective function, O, and the constraint, Φ, respectively.

The Objective Function O
We define the function as O :" O accu`α O tree´β O f air , consisting of components for accuracy loss (O accu ), tree size (O tree ), and fairness score (O f air ), respectively. The constants, α and β, are used to make trade-offs. In our implementation, α is fixed to 1{(2 K`1´2 ) while β is the optimal value in [0, 1] selected using n-fold cross-validation.
Specifically, we test the values 0.02, 0.04, 0.06, . . . to 1.00 and, for each fold of the dataset, we compute the objective function and choose β with the minimal objective value. In general, a bigger β means more fairness. Our experiments show that, as β gets larger, O f air remains constant initially and then starts increasing while O accu remains constant, and then O accu starts increasing.
Since the decision tree structure is not known a priori, we encode a complete binary tree while allowing all branch and leaf nodes to be activated or de-activated. Recall that L is the subset of leaf nodes, V is the subset of branch nodes, l P L denotes a leaf node, and v P V denotes a branch node.
To get a valid decision tree, p v must be constrained also by formula Φ (Sect. 3.2) . Assuming the number of p v variables is |V|, the tree size is the number of p v variables with value 1.

Accuracy Loss (O
We assign a variable L l to each leaf node l P L to represent the misclassification error at l. Since we start with a complete tree, each leaf node corresponds to a distinct path. The actual value of L l is defined by formula Φ (Sect. 3.3). Assuming the number of L l variables is |L|, the accuracy loss is measured by averaging the L l values.
Fairness Score (O f air :" F ). We assign a variable F to represent the overall fairness score of the decision tree. The value of F is defined by formula Φ (Sect. 3.4) according to the definition of demographic parity.
Next, we present our encoding of formula Φ :" Φ tree^Φaccu^Φf air , where Φ tree encodes the hierarchical structure of the tree, Φ accu encodes the accuracy requirement, and Φ f air encodes the fairness requirement. They share variables with O tree , O accu and O f air in the objective function, such as p v , L l , and F . Note that, since the constraint will be solved by an off-the-shelf MIO solver, Φ must be encoded as a conjunction of equality/inequality constraints. If logical-or operators are needed, they must be converted to equality/inequality operators.

Encoding of the Decision Tree (Φ tr ee )
Given a node, which may be the root of the decision tree under construction, or any of its branch nodes, we consider a depth-K complete binary tree rooted at that node. Since it is a complete binary tree, there are precisely T K " 2 K`1´1 nodes with indices 1 . . . T K and, for any node n, the left and right child nodes have indices 2n and 2n`1, respectively. Furthermore, the set of leaf nodes is -Every leaf node l P L has an output class label, and the path from root to l represents a classification rule, which assigns any input x that goes through the path to the output class. -Every branch node v P V has a vector w v of bits for selecting the feature.
Thus, at most one bit in w v is 1, and Figure 3 shows a depth-2 binary tree whose branch nodes are colored in teal and leaf nodes are colored in red. The thresholds b 1 , b 2 and b 3 may be either 0 or a value in (0, 1]: only when they are non-zero, the corresponding nodes are split by features. For instance, when b 2 is set to 1, if edge condition (w T 2 x ă 1) holds, input x goes to the left child, and if (w T 2 x 1) holds, x goes to the right child. When b 2 is set to 0, however, since edge condition (w T 2 x ă 0) is always false and (w T 2 x 0) is always true, input x always goes to the right child. In other words, b 2 " 0 disallows splitting at node v " 2.
Symbolic Variables. To model how a feature splits the training set, we define some symbolic variables first. has any associated input. The value of I t is either 0 (no) or 1 (some).
Formula Φ tree . We define the formula as Φ tree :" Π split^Πedge^ΠleafΠ branch where Π split encodes how features are used to split branch nodes, Π edge encodes the constraints on edges, Π leaf encodes the constraints on leaf nodes, and Π branch encodes the constraints on branch nodes.
Subformula Π split . We construct Π split by constraining p v , w vj , and b v : 1. If p v " 1, meaning v P V is split, we require ( jP{1,...,k} w vj " 1) to ensure exactly one feature is selected. We also require (b v ą 0) to activate the branch conditions on the outgoing edges, meaning v is not split, we require ( jP{1,...,k} w vj " 0) to ensure no feature is selected, and (b v " 0) to de-activate the left branch. That is, input x always goes to the right, while the left subtree stops growing.
Subformula Π edge . We construct Π edge by constraining the p v variables: If node v P V stops splitting, its child nodes also stop splitting. That is, when p v " 0, both p 2v and p 2v`1 must also be 0. Thus, we have Π edge " vPV (p v p 2v )^(p v p 2v`1 ).
Subformula Π leaf . We construct Π leaf by constraining variables z it and I t : 1. For each input x i , where i P {1, . . . , n} and n " |E|, we require that x i is associated with exactly one leaf node l P L, i.e., ( lPL z il " 1). 2. If I l " 0, meaning no input is associated with l, we require that (z il " 0) for all i P {1, . . . , n}. This is encoded as lPL (z il I l ).
Subformula Π branch . We construct Π branch by constraining w vj , b v , and z it : 1. In a complete binary tree, the depth-d nodes are v P {2 d , . . . , 2 d`1´1 }. Since exactly one of them is associated with input x i , we require that condition Π br1 :" ( vP{2 d ,...,2 d`1´1 } z iv " 1) holds. 2. At each node v P V, since input x i is associated with either the left child L " 2v or the right child R " 2v`1, but not both, we require that the following three conditions hold: Thus, we have Π branch :" iP{1,...,n} dP{1,...,K´1} (Π br1^Πbr2^Πbr3^Πbr4 ). Explanation of Π br3 and Π br4 . What we would like to encode in Π br3 is the fact that branch condition ( w vj x ij ă b v ) may be either TRUE (x i goes to the left child L when z iL " 1 and b v P (0, 1]) or FALSE (x i goes to the right child R when z iL " 0 and b v P (0, 1], or when b v " 0). However, since off-the-shelf MIO solvers do not support logical-or operators, we have to encode these different scenarios in a single inequality constraint. This is accomplished by adding a slack value,´γ L (1´z iL ), to the branch condition. Similarly, in Π br4 , we add a slack value, (1´z iR ), to the branch condition ( w vj x ij b v ).

Encoding of the Accuracy Requirement (Φ accu )
To minimize the accuracy loss defined in O accu :" 1 |L| lPL L l (Sect. 3.1), we need to constrain the L l variables in Φ accu such that L l models the misclassification error at the leaf node l P L. In the depth-K complete binary tree, there are |L| " 2 K leaf nodes. For each leaf node l, variable L l represents the number of misclassified examples (x i , y i ) P E: it is misclassified if the given output y i does not match the predicted output T (x i ).
. . , n} and n " |E|, and for each output value m P {0, 1}, we use p im to model if (y i " m). The value of p im , which is either 0 or 1, is const im :" (y i " m) ? 0 : 1 . Thus, we have Φ p :" n i"1 1 m"0 (p im " const im ). Subformula Φ N . We use variable N l to represent the number of examples associated with leaf node l, and N lm to represent those with output value m.
Thus, we have Φ N :" lPL (N l " n i"1 z il )^(N lm " 1 2 n i"1 z il (1`p im )). Subformula Φ θ . According to Lines 5-8 of Algorithm 1, each leaf node has an output class label θ l " argmax mP{0,1} N lm . Since argmax cannot be directly encoded, we define a matrix of θ lm variables in {0, 1}, where θ lm " 1 means the output label of node l is m. By definition, only one θ lm variable can be 1.

Subformula Φ loss .
Assuming that m is the output label predicted by the leaf node l. The misclassification error, L l , is equal to the number of examples associated with l, denoted N l , minus the number of examples that have the most common label m, denoted max mP{0,1} N lm . To avoid max/min in L l " N l´m ax mP{0,1} N lm " min mP{0,1} (N l´Nlm ), we use θ lm variables and constant n " |E| to rewrite the constraint as : Thus, we have Φ loss :" lPL ((L l 0)^ mP{0,1} (L l N l´Nlm´n (1´θ lm ))( L l N l´Nlm`n θ lm )).

Encoding of the Fairness Requirement
Formula Φ f air :" Φ Fs^ΦF M has two subformulas. Here, Φ Fs encodes the fairness index and Φ F M encodes the constraints on variables used in Φ Fs .
According to Eq. 1 (Sect. 2.3), the fairness index is defined as F s " (P r` fs {P rf s ), where f s is a sensitive feature such that f s (x), for any input x P E, may be 0 or 1 (e.g., female and male) while T (x) "`means the output generated by T is positive (e.g., a job is offered). According to the "80% rule", demographic parity is achieved if F s is above 80%. In this work, our goal is to find a solution that (1) satisfies (F s ą 0.8) and, at the same time (2) maximizes the value of F s .
However, the definition of F s shown in Eq. 1 has division operators, which are not supported by off-the-shelf MIO solvers. Furthermore, the divisor part of the equation varies even for a fixed set E of examples, which makes the encoding a challenging task. To overcome the challenge, we refine the definition of as follows: (2) For each of the four components, we create a symbolic variable. Variable S 0 represents the number of examples whose sensitive feature has the value 0 (e.g., female) for the gender ( f 1 ) feature. Variable S0 represents the number of examples in S 0 that have the positive output (e.g., a job is offered). Variable S 1 represents the number of examples whose sensitive feature has the value 1 (e.g., male) for the gender ( f 1 ) feature. Variable S1 represents the number of examples in S 1 that have the positive output.
Subformula Φ Fs . We use Φ Fs to enforce the 80% rule: Assuming S 0 ą 0, S0 ą 0, S 1 ą 0, and S1 ą 0, we encode the rule as follows: There are two advantages of this encoding. First, the resulting constraint can be solved by off-the-shelf MIO solvers, whereas a direct encoding of Eq. 2 cannot. Second, the value of (S0ˆS 1´0 .8ˆS 0ˆS1 ) increases as F s increases; therefore, it can be used as part of the objective function, O f air , to maximize F s .
We use Φ F M to constrain the variables S 0 , S0 , S 1 , and S1 . Toward this end, we need to define the following variables: -S 0i : We use variable S 0i P {0, 1} n to model if the value of f s (x i ) is 0. Thus, we require S 0i " 1 when f s (x i ) " 0, and S 0i " 0 otherwise. -S 0ì l : We use variable S 0ì l P {0, 1} nˆ|L| to model, at each leaf node l P L, if x i P E is given the positive output. Thus, we require S 0ì l " 1 when the following condition holds, and S 0ì l " 0 otherwise: In the condition above, (θ lm " 1) means the output label produced by the leaf node l is m, and (m " 1) means m is the positive output ("`").
-S 1i and S 1ì l : We define variables S 1i and S 1ì l similar to S 0i and S 0ì l .
Putting It All Together. Recall that, in Sect. 3.3, we have constrained the accuracy loss, L l , in the objective function O accu , and defined the objective function O tree in Sect. 3.1, which is used to minimize the tree size and thus reduce over-fitting. As for the objective function O f air (Sect. 3.1), we define the fairness score as follows: F " (S0ˆS 1´0 .8ˆS 0ˆS1 ). Thus, we have the entire MIO problem as follows:

Generalization and Performance Enhancement
In this section, we first explain how our method relates to various existing algorithms (Sect. 4.1). Next, we present techniques for speeding up constraint solving while maintaining the quality of the solution (Sect. 4.2). Finally, we show that, beyond demographic parity, our method can encode other group fairness metrics, such as equal opportunity and equal odds (Sect. 4.3).

Relating to Existing Algorithms
Recall that our method performs feature selection by symbolically encoding a depth-K binary tree, to perform a bounded look-ahead search of the optimal feature using the MIO solver. For ease of presentation, let us call the selected feature depth-K optimal, where K P {1, . . . ,`8}.
Depth-1 Optimal. When K " 1, the tree consists of the root node only and, as a result, look-ahead search is disabled. In this case, our method is the same as a purely greedy search method. Depending on whether fairness is encoded, there are two cases.
-Without the fairness component, our method would compute the depth-1 optimal feature that minimizes only the tree size and the accuracy loss. This is similar to mainstream decision tree learning algorithms such as CART. -With the fairness component, our method would compute the depth-1 optimal feature that minimizes the tree size and the accuracy loss, and maximizes the fairness score. This is similar to IGCS [24], an discrimination-aware technique for learning decision trees.
Our experimental evaluation (in Sect. 5) shows that neither CART nor IGCS is effective in improving fairness, especially for larger datasets, primarily due to their inability to look beyond the current node.

Depth-8 Optimal.
When K is set to a sufficiently-large number, our method is able to find the globally optimal feature for not only the root node, but also other nodes in the decision tree. Thus, it would compute the entire decision tree in one shot.
-Without the fairness component, our method would act like the technique introduced by Bertsimas and Dunn [7], which laid the ground work for encoding an optimal classification tree as a monolithic MIO problem. -With the fairness component, our method would act like MIP, a fair learning technique introduced by Aghaei et al. [1].
Our experimental evaluation (in Sect. 5) shows that the computational overhead of the monolithic MIO approach or MIP is too high to be practically useful. We discuss how to set the value of K in our method in the next subsection.

Performance Enhancement
We propose two techniques for speeding up our method by (1)  Choosing the K Value Adaptively. There is a trade-off between looking further ahead and reducing the constraint solving time. Given n " |E| training examples, and 2 K leaf nodes in a depth-K binary tree, the number of decision variables (such as S 0il ) would be (nˆ2 K ). Since mixed-integer optimization is NP-hard, the complexity of constraint solving is O(2 nˆ2 K ). Empirically, we have found that Gurobi, a state-of-the-art solver, may take 1-2 h to solve a problem for n " 1000 training examples and tree depth K " 7-this is consistent with prior experimental results, e.g., Bertsimas and Dunn [7]. Unfortunately, supervised learning datasets in practice often bring as many as 50,000 training examples to the root node of a decision tree, although the number decreases gradually and may reach 0 for some leaf nodes. Therefore, setting K to 7, or any predetermined value, would not work well in practice. Instead, we propose to set the K value adaptively. Given a time-out limit (T/O) for learning a decision tree, we start with a relatively small K value, say K " 2, to synthesize a decision tree. Then, we increase the K value to synthesize a better decision tree. We keep increasing the K value as long as the time limit is not yet reached, and the quality of the decision tree is improved. We measure the quality of the tree using the value of the objective function, O, which consists of the tree size, the accuracy loss, and the fairness score.
Sampling the Training Examples. We propose to reduce the size of the constraints in Φ by sampling the training examples in E, before using them to construct the formula Φ. Our experience shows that sampling can reduce the value of n significantly and, at the same time, maintaining the quality of the MIO solution. For the adult dataset, which has 48, 842 training examples, even with a small K value, the symbolic constraints would take more than 1 h to solve. Our sampling method is not directly applicable to the original MIP approach because, if sampled data are used as input, the MIP solving procedure would permanently discard the rest of the data, which would significantly degrade its accuracy. In contrast, sampling in our method only causes the rest of the data to be ignored temporarily (for this particular node) but, for the child nodes in the subtree, the entire data will still be used in the subsequent computation.

Encoding Other Group Fairness Metrics
Beyond demographic parity, there are two popular metrics for group fairness, of which one is equal opportunity and the other is equalized odds.
Pf s"1,fc"1´Pfs"0,fc"1 (4) In our method, Eq. 4 may be encoded as Φ eq :" S1 S 0´S0 S 1´ S 0 S 1 0, to replace Φ Fs in the fairness requirement Φ f air :" Φ Fs^ΦF M . The definitions of variables S 0 , S0 , S 1 and S1 are analogous to that in Sect. 3.4. Similarly, we can define fairness decision variables S 0i , S 0il , S 1i , and S 1il . For example, the value of S 0i is set to 1 if f s (x i ) " 0^f c (x i ) " 1 and is set to 0 otherwise.
Since Eq. 5 can be encoded similarly to Eq. 4, the details are omitted for brevity.

Experiments
We have implemented our method, SFTree, using Python, Julia 1.5.1 [15], and Gurobi 9.03 [21], where Julia is used to encode the MIO constraints and Gurobi is used to solve the constraints. We compared SFTree with three state-of-the-art techniques: CART, which is a mainstream algorithm for decision tree learning, IGCS, which is a discrimination-aware learning algorithm, and MIP, which is a monolithic MIO approach to learning fair tress. We conducted all experiments with Catalina running on a macOS with 2.4 GHz 8-Core CPU and 64G RAM.
Benchmarks. Our evaluation uses six popular benchmarks from the fairness literature. They are divided to three small datasets and three large datasets.
Since the small datasets can be handled by the less-scalable but more-accurate MIP to obtain globally optimal solutions, they are useful in evaluating the quality of our method. The large datasets, in contrast, are out of the reach of MIP and thus useful in evaluating the scalability of our method.
-  During learning, we apply the standard 5-fold cross validation expect for Compas, to which we apply 4-fold cross validation to be consistent with prior work.

Results on the Small Benchmarks.
We compare the quality of the decision trees learned by our method and three existing methods on the small benchmarks. The results are shown in Table 1 Since the datasets are small, MIP is able to compute the best solutions: without violating the 80% Rule, it maximizes accuracy. The result shows that, overall, CART has the best accuracy but the worst fairness score. IGCS improves over CART, but still violates the 80% Rule in 5 out of the 15 cases. In contrast, SFTree satisfies the fairness requirement in all 15 cases and, at the same time, achieves high accuracy. Furthermore, it runs more than 10 times faster than MIP.
Results on the Large Benchmarks. We use these benchmarks to evaluate both the quality and the scalability of our method. Table 2 shows the result of the quality comparison, which has the same format as Table 1. CART has the highest accuracy but fails to satisfy the fairness requirement in all 14 cases. Although IGCS is somewhat effective for the small benchmarks in Table 1, here, it fails to satisfy the fairness requirement in 12 of the 14 cases. In contrast, our method is the only one that satisfies the fairness requirement in all cases and, at the same time, has accuracy comparable to CART and IGCS. Table 3 shows the execution time comparison. MIP times out in all 14 cases (T/O = 3h), while our method finishes each within 1 h. Thus, our method runs more than 10 times faster than MIP. Although CART and IGCS are faster, they are equivalent to depth-1 look-ahead search in our method and, due to the limited ability to look ahead, they almost never satisfy the fairness requirement. Evaluating the Impact of the K-value. We have also evaluated how the K value affects the quality of the learned decision tree using the Student Fold1 benchmark. Since the benchmark is small enough, we set K to fixed values 1, . . . , 7 instead of letting it adapt, so we can assess the impact. Figure 4 shows the result, where the x-axis is accuracy and the y-axis is the fairness score. Thus, the closer a dot is to the right-top corner, the higher the overall quality is. The result shows that the quality of our solution increases dramatically as the K value increases from 1 to 7, due to the increasingly deeper look-ahead search.
Summary of Additional Results. While we have also evaluated the scalability of our method with respect to the dataset size, we omit the results for brevity and instead provide a summary. What we have found is that, as the dataset gets larger, the execution time of our method increases modestly at first, and then stops increasing after a threshold is reached. This is due to the use of performance enhancement techniques presented in Sect. 4. Thus, our method does not have scalability issues. In fact, among all four methods, SFTree is the only one that consistently produces fair and accurate decision trees for datasets with ą40,000 training samples.

Related Work
At a high level, our method can be viewed as an in-processing approach to mitigating bias in machine learning models. Broadly speaking, there are three approaches: pre-processing [17,25,31], in-processing [11,19,24,30,33] and postprocessing [18,22], depending on whether the focus is on de-biasing the training data, the learning algorithm, or the classification output.
Since the pre-processing approach focuses on de-biasing the training data [17,25,31], it is applicable to any machine learning model; however, it cannot remove bias introduced by the learning algorithms, which is problematic because, even if the training data is not biased, learning algorithms may introduce new bias. While the post-processing approach can remove such bias by modifying the predicted output [18,22], the result is often hard to predict and difficult to explain. In contrast, our method does not have these limitations.
Compared to other in-processing techniques for fair learning decision trees, including IGCS [24] and similar greedy search methods [11,19,30,33], our method has the advantage of being more systematic and quantifiable. This is because we encode both accuracy and fairness requirements explicitly as numerical constraints. Thus, it would be easy to explain, at every step, why a feature is chosen over another feature, and quantify how much more effective it is in minimizing bias and accuracy loss at the same time. Compared to the monolithic constraint solving approach, including MIP [1] and similar methods [5,35], our method has the advantage of being significantly more scalable.
Our method differs from the recent work of Torfah et al. [32] in that their method uses a small training set sampled from a known distribution and thus does not need techniques such as incremental solving. Furthermore, their method assumes the decision predicates are given, but in our method, the predicates are synthesized from real-valued features. Finally, our fairness constraint is also different from the explainability constraint.
Besides synthesis, there are techniques for improving fairness by repairing an existing machine learning model [4,9,20,26], and techniques for verifying that an existing machine learning model is indeed fair, e.g., by using probabilistic analysis methods [3,6,28]. While these techniques are related, they differ from our method in that they cannot synthesize new decision trees from training data while ensuring the decision trees are fair by construction.

Conclusion
We have presented a method for synthesizing a fair and accurate decision tree, by formulating feature section as a series of mixed-integer optimization problems and solve them using an off-the-shelf constraint solver. The method is flexible in expressing group fairness metrics including demographic parity, equal opportunity, and equal odds. On popular datasets, it is able to learn decision trees that satisfy the fairness requirement and, at the same time, achieve a high classification accuracy.