1 Introduction

Decision trees are one of the most widely used machine learning models in statistical analysis, data mining and decision making. Compared to other predictive models such as deep neural networks, decision trees have the advantage of being easily understandable by humans, which makes them a favorite building block in systems that require interpretability [34]. However, when they are used to make socially sensitive decisions in business, finance and law enforcement, decision trees may introduce bias against certain groups [16]. In this context, a widely used group fairness metric is demographic parity [11, 38], also known as the 80% rule [8]. Bias against demographic groups, in general, comes from two sources. First, historical data used to learn models may be biased. Second, learning algorithms may be biased even if they operate on unbiased data.

State-of-the-art decision tree learning algorithms such as CART and C4.5 [10, 29], which are the ones used by popular machine learning toolkits, rely on a greedy search technique that is optimized solely for high learning speed and classification accuracy. Since they do not consider fairness as an optimization requirement at all, they often produce decision trees that are severely biased. To mitigate the bias, modifications have been proposed to make the greedy search discrimination-aware  [24] (e.g., IGCS). Unfortunately, these modifications are not always effective as shown by our own experimental evaluation in Sect. 5 and, more importantly, the impact of ad hoc modifications is often unpredictable and difficult to explain.

Meanwhile, there is a line of work in operational research that formulates decision tree learning as a mixed-integer optimization (\({\texttt {MIO}}\)) problem [7, 35]. Given a finite set \(\mathcal {F}\) of decision attributes, or features, and a maximum tree depth K, the set of all possible decision trees is captured symbolically as a set of numerical constraints, which is then fed to a solver to compute the globally-optimal decision tree. While optimality was defined initially to minimize the tree size and accuracy loss [7, 35], later on, fairness was added as a goal of the optimization [1, 5]. However, the approach remains largely theoretical due to its limited scalability: since the entire decision tree must be encoded as a monolithic \({\texttt {MIO}}\) problem, only small training datasets (with sample sizes in the 1000s) and small decision trees (with depths up to 4 or 5) can be handled [2, 7].

To overcome the limitations of the existing approaches, we propose an iterative constraint solving technique for synthesizing decision trees in a practically efficient fashion while simultaneously optimizing for fairness and accuracy. Instead of encoding the decision tree as a monolithic \({\texttt {MIO}}\) formula, we break it down to a series of small steps to avoid the scalability bottleneck. Specifically, starting from the root node, we use constraint solving to conduct a depth-bounded look-ahead search at each level of the decision tree, to compute the best feature. Within the look-ahead search, we encode both fairness and accuracy requirements explicitly as numerical constraints, to make the fairness-accuracy trade-off not only predictable but also easy to explain.

Fig. 1.
figure 1

SFTree – our symbolic method for synthesizing a fair decision tree.

The overall flow of our method, SFTree, is shown in Fig. 1. Given a set of training examples (\(\mathcal {E}\)), a set of features (\(\mathcal {F}\)), and a sensitive feature (\(f_s\in \mathcal {F}\)) as input, SFTree returns the synthesized decision tree (\(\mathcal {T}\)) as output. Internally, SFTree encodes the hierarchical structure of a partial decision tree symbolically starting from the current node and its training set \(\mathcal {E}\), covering a fixed number of tree levels. Then, it uses an \({\texttt {MIO}}\) solver to compute the optimal feature, \(f^*\), that minimizes the bias against the protected group, the classification error, and the tree size. Assuming that \(f^* \in \{0,1\}\) is a Boolean predicate, the training set is partitioned into subsets \(\mathcal {E}_{f^*}\) and \(\mathcal {E}_{\lnot f^*}\), one for each child node. Our method iteratively partitions the child nodes until the training subset becomes empty, or all examples in \(\mathcal {E}\) belong to the same class, or all features in \(\mathcal {F}\) have been used.

To demonstrate its effectiveness, we have implemented SFTree and evaluated it on six supervised learning datasets, consisting of three small datasets and three large ones. Since the small datasets can be handled even by the monolithic \({\texttt {MIO}}\) approach (named MIP [1]) to obtain globally-optimal and fair solutions, we used them to evaluate the quality of decision trees learned by our method. The large datasets, which are out of the reach of MIP, were used to evaluate scalability. For comparison, we also evaluated CART [27], a mainstream decision tree learning algorithm, and IGCS [24], a discrimination-aware learning algorithm.

The experimental results show that, among all methods (CART, IGCS, MIP, and SFTree), SFTree produces the best overall solution in terms of fairness and accuracy. In contrast, CART produces unfair decision trees in most cases and, while IGCS does well on the small datasets, it produces mostly unfair decision trees for the large datasets. Neither CART nor IGCS is effective in satisfying the well-known 80% Rule [8] for demographic parity [11, 38]. In contrast, SFTree satisfies the 80% Rule in all cases. In terms of scalability, MIP fails to handle any of the large datasets, while SFTree handles all of them. In fact, among all four methods, SFTree is the only one that produces fair and accurate decision trees for datasets with \({>}{40,000}\) training samples.

To sum up, this paper makes the following contributions:

  • We propose an iterative constraint-solving method for synthesizing fair decision trees:

    • By formulating feature selection as a series of mixed integer optimization subproblems, we make the constraints efficiently solvable.

    • By encoding fairness and accuracy explicitly as symbolic constraints, we make the trade-off quantifiable and easy to explain.

  • We demonstrate the advantages of SFTree over existing approaches (CART, IGCS, and MIP) using six popular datasets in the fairness literature.

The remainder of this paper is organized as follows. In Sect. 2, we review the basics of decision tree learning and group fairness. In Sect. 3, we present our method. In Sect. 4, we present generalization and performance enhancement techniques. In Sect. 5, we present our experimental results. After reviewing the related work in Sect. 6, we give our conclusions in Sect. 7.

2 Background

2.1 Training Dataset \(\mathcal {E}\)

The training dataset is a finite set of examples, \(\mathcal {E} = \{(x_i, y_i)\}\), where \(i \in \mathcal {N}\) is the index, input \(x_i = \langle f_1,\ldots ,f_k\rangle \) is a vector of features, and output \(y_i\) is a class label. Let \(\mathcal {F}\) be the set of all features. For ease of comprehension, let us assume for now that all input features and the output class label are Boolean. In this case, every input \(x_i\in \{0,1\}^k\) is a k-bit vector in the feature space, the output \(y_i \in \{0,1\}\) is a bit, and a decision tree trained using \(\mathcal {E}\) is a k-input Boolean function. To make the presentation clear, we may also use \(y_i\in \{-,+\}\) instead of \(y_i\in \{0,1\}\) as the output, where − means “no” and \(+\) means “yes”.

Figure 2 shows a training set \(\mathcal {E}\), where each row in the table represents an example. The input features are a job candidate’s gender (0 = Female, 1 = Male), college rank (0 = Low, 1 = High), experience (0 = No, 1 = Yes), and interview score (0 = Not-Good, 1 = Good), while the output shows whether the job is offered (0 = No, and 1 = Yes). At the root of the decision tree, for instance, the input goes to the left branch when \((f_4=0)\) and to the right branch when \((f_4=1)\). The example illustrates three important notions associated with the training set: (1) partition of \(\mathcal {E}\) (2) entropy, and (3) conditional entropy.

Fig. 2.
figure 2

An example training dataset \(\mathcal {E}\) (left) and the related decision tree \(\mathcal {T}\) (right).

Partition. Given a set \(\mathcal {E}\) and a feature \(f_j\), we can partition \(\mathcal {E}\) into subsets \(\mathcal {E}_{f_j=0}\) and \(\mathcal {E}_{f_j=1}\), or \(\mathcal {E}_{\lnot f_j}\) and \(\mathcal {E}_{f_j}\), respectively, in shorthand notation. Here, \(\mathcal {E}_{\lnot f_j} = \{(x_i,y_i)\in \mathcal {E} \mid f_j(x_i)=0\}\) consists of examples whose \(f_j\) is 0, and \(\mathcal {E}_{f_j} = \{(x_i,y_i)\in \mathcal {E} \mid f_j(x_i)=1\}\) consists of examples whose \(f_j\) is 1. By definition, we have \(\mathcal {E}_{\lnot f_j} \subseteq \mathcal {E}\) and \(\mathcal {E}_{f_j} \subseteq \mathcal {E}\), \(\mathcal {E}_{\lnot f_j}\cap \mathcal {E}_{f_j} = \emptyset \) and \(\mathcal {E}_{\lnot f_j}\cup \mathcal {E}_{f_j}=\mathcal {E}\).

For our example in Fig. 2, partitioning the dataset by gender (\(f_1\)) results in subsets \(\mathcal {E}_{f_1 = F} = \mathcal {E}_{\lnot f_1} = \{(x_1,y_1) (x_4,y_4) (x_7,y_7)\}\) and \(\mathcal {E}_{f_1 = M} = \mathcal {E}_{f_1} = \{(x_2,y_2) (x_3,y_3) (x_5,y_5) (x_6,y_6)\}\).

Entropy. The diversity (or purity) of a set \(\mathcal {E}\) may be measured by Shannon entropy. Let \(|\mathcal {E}^+|\) be the number of examples in \(\mathcal {E}\) with positive output label, and \(|\mathcal {E}^{-}|\) be the number of examples with negative output label. The percentage of positive examples is \(|\mathcal {E}^{+}|/|\mathcal {E}|\), and the percentage of negative examples is \(|\mathcal {E}^{-}|/|\mathcal {E}|\). Thus, the entropy is \(H(\mathcal {E}) = -\frac{|\mathcal {E}^+|}{|\mathcal {E}|}log(\frac{|\mathcal {E}^+|}{|\mathcal {E}|}) -\frac{|\mathcal {E}^-|}{|\mathcal {E}|}log(\frac{|\mathcal {E}^-|}{|\mathcal {E}|})\).

For our example in Fig. 2, since \(|\mathcal {E}^{-}|=3\) and \(|\mathcal {E}^{+}|=4\), the entropy is \(H(\mathcal {E}) = -\frac{3}{7}log(\frac{3}{7}) - \frac{4}{7}log(\frac{4}{7}) \approx 0.985\).

Conditional Entropy. Given a partition of the set \(\mathcal {E}\) by the feature \(f_j\), the entropy of each subset, \(\mathcal {E}_{\lnot f_j}\) or \(\mathcal {E}_{f_j}\), is defined similarly. For our example, since \(\mathcal {E}_{\lnot f_1}\) has 2/3 negative examples and 1/3 positive examples, the entropy is \(H(\mathcal {E}_{\lnot f_1}) = -\frac{2}{3} log(\frac{2}{3}) - \frac{1}{3} log(\frac{1}{3}) =0.918\). Similarly, since \(\mathcal {E}_{f_1}\) has 1/4 negative examples and 3/4 positive examples, the entropy is \(H(\mathcal {E}_{f_1}) = -\frac{1}{4} log(\frac{1}{4}) - \frac{3}{4} log(\frac{3}{4}) =0.811\).

The conditional entropy of \(\mathcal {E}\), with respect to \(f_j\), is defined as follows:

$$ H(\mathcal {E} \mid f_j) = \frac{|\mathcal {E}_{\lnot f_j}|}{|\mathcal {E}|} H(\mathcal {E}_{\lnot f_j}) + \frac{|\mathcal {E}_{ f_j}|}{|\mathcal {E}|} H(\mathcal {E}_{ f_j}) $$

For our running example, since there are 3 female and 4 male candidates, we have \(|\mathcal {E}_{\lnot f_1}|/|\mathcal {E}|=3/7\) and \(|\mathcal {E}_{f_1}|/|\mathcal {E}|=4/7\). Thus, the conditional entropy is \(H(\mathcal {E} \mid f_1) = \frac{3}{7} H(\mathcal {E}_{\lnot f_1}) + \frac{4}{7} H(\mathcal {E}_{f_1})\) \(\approx 0.857\).

The difference between \(H(\mathcal {E})\) and \(H(\mathcal {E} \mid f_j)\) is called the information gain, a metric for evaluating how effective \(f_i\) is in separating positive examples from negative examples in \(\mathcal {E}\). For our example, since \(H(\mathcal {E}) \approx 0.985\) and \(H(\mathcal {E} \mid {f_1})\approx 0.857\), the information gain (of partitioning \(\mathcal {E}\)) by gender (\(f_1\)) is \(0.985-0.857= 0.128\). In contrast, the information gain by interview (\(f_4\)) is \(0.985-0.516= 0.469\). Thus, \(f_4\) is more effective as a decision attribute.

Real-Valued Features. It is important to note that, while the above examples use Boolean features, our method is more general in that it allows all features have real values, i.e., \(x_i \in [0,1]^k\) instead of \(x_i\in \{0,1\}^k\). We accomplish this by applying one-hot encoding to any categorical feature and normalizing any real-valued feature to the [0, 1] domain. Thus, the branch predicates become \((f_j<b_v)\) and \((f_j\geqslant b_v)\), instead of \((f_j=0)\) and \((f_j=1)\), where \(b_v\in (0,1]\) is a threshold computed by our method. For example, if \(f_j\) is the (normalized) salary and \(b_v=0.5\), the branch predicates are \((f_j<0.5)\) and \((f_j\geqslant 0.5)\).

2.2 Decision Tree Learning

A decision tree \(\mathcal {T}\) is a binary tree consisting of a set of nodes and a set of edges. Let the set of nodes be \(\mathcal {V}\cup \mathcal {L}\), where \(\mathcal {V}\) is the subset of branch nodes (including the root) and \(\mathcal {L}\) is the subset of leaf nodes. Let E be the set of edges between these nodes. A path in \(\mathcal {T}\) is a sequence of nodes and edges, denoted \(v_0,e_1,v_1\ldots v_n, e_n, l_n\), where \(v_0\) is the root, \(l_n\) is a leaf node, \(v_1\ldots v_{n}\) are the internal nodes, and \(e_1,\ldots ,e_n\) are the edges.

Each edge has a branch condition. The edge is activated only if the condition holds for a given input x. In Fig. 2, for example, the left-most path of the decision tree has the condition \(f_4(x)=0\) and output offer \(=0\), while the right-most path has the condition \((f_4(x)=1)\wedge (f_1(x)=M)\) and output offer \(=1\).

Given a training set \(\mathcal {E}=\{(x_i,y_i)\}\), where \(x_i\) is an input and \(y_i\) is the known output, mainstream algorithms aim to learn a decision tree \(\mathcal {T}\) that minimizes the classification error. They also aim to minimize the tree size which, in general, allows \(\mathcal {T}\) to generalize well on the test examples.

The Baseline Algorithm. Algorithm 1 shows the top-level procedure of these mainstream algorithms. It takes the training set \(\mathcal {E}\) and the feature set \(\mathcal {F}\) as input, and returns a decision tree (\(\mathcal {T}\)) as output. These mainstream algorithms use a greedy method to recursively select decision attributes from \(\mathcal {F}\) and use them to partition the training set \(\mathcal {E}\). At each step, it selects the best feature \(f^*\) using the subroutine .

figure b

In CART, for example, is entropy-based, to maximize the information gain of partitioning \(\mathcal {E}\) by f as shown in Algorithm 2. While this is fast and often leads to high classification accuracy, it does not consider fairness and thus often produces biased decision trees. In this work, we use iterative constraint solving to overcome the limitation.

After \(f^*\) is computed by , Algorithm 1 uses it to partition the training set \(\mathcal {E}\), and recursively process the two subsets: DTL\((\mathcal {E}_{f^* = 0}, \mathcal {F} \setminus \{ f^*\})\) and DTL\((\mathcal {E}_{f^* = 1}, \mathcal {F} \setminus \{ f^*\})\). The recursion ends when

  • all training examples in the set \(\mathcal {E}\) have the same class label (Lines 3–4);

  • there are no features left in \(\mathcal {F}\) to split \(\mathcal {E}\) further (Lines 5–6); or

  • the set \(\mathcal {E}\) is empty (Lines 7–8).

figure e

2.3 Fairness Metric

Given a training set \(\mathcal {E}\) and a sensitive feature \(f_s\in \mathcal {F}\), e.g., race or gender, the goal is to construct a decision tree \(\mathcal {T}\) that maximizes classification accuracy while minimizing bias. The metric concerned in this work, demographic parity [11, 38], comes from the legal guideline in the United States for avoiding employment discrimination. Known as the 80% rule [8], it says the percentage at which candidates from one protected group are offered jobs should be at least 80% of the percentage at which candidates from another group are offered jobs.

This is formalized using the fairness index, \(F_s(\mathcal {T}, \mathcal {E})\), defined as follows:

$$\begin{aligned} F_{f_s}(\mathcal {T}, \mathcal {E}) = \frac{Pr[\mathcal {T}(x)=+ \mid f_s(x)=0]}{Pr[\mathcal {T}(x)=+ \mid f_s(x)=1]} \end{aligned}$$
(1)

where \(Pr[\mathcal {T}(x)=+ \mid f_s(x)=0]\), or \(Pr_{\lnot f_s}^{+}\) in short, is the probability of positive examples under the condition \(f_s(x)=0\), and \(Pr[\mathcal {T}(x)=+ \mid f_s(x)=1]\), or \(Pr_{fs}^+\) in short, is the probability of positive examples under the condition \(f_s(x)=1\). Thus, we have \(Pr_{\lnot f_s}^+ = \frac{|\{x ~\in ~ \mathcal {E} ~\mid ~ f_s(x)=0 ~\wedge ~ \mathcal {T}(x)=+ \}|}{|\{ x ~\in ~ \mathcal {E} ~\mid ~ f_s(x)=0 \}|}\) and \(Pr_{f_s}^+ = \frac{|\{ x ~\in ~ \mathcal {E} ~\mid ~ f_s(x)=1 ~\wedge ~ \mathcal {T}(x)=+ \}|}{|\{x ~\in ~ \mathcal {E} ~\mid ~ f_s(x)=1\}|}\).

Demographic parity means \(0.8 \leqslant F_{s}(\mathcal {T}, \mathcal {E}) \leqslant (1/0.8)=1.25\). For the example in Fig. 2, since \(F_{f_1}(\mathcal {T}, \mathcal {E})=0.44\) for gender (\(f_1\)), the tree fails to satisfy the 80% rule due to bias against female. The bias is explicit in that \(f_1\) is actually used in the edge labels of the right most two paths of the decision tree. However, even if \(f_1\) is not used in \(\mathcal {T}\) explicitly, \(\mathcal {T}\) may still be biased against female, for example, if other non-sensitive features (or their combinations) are statistically correlated to \(f_1\) and, as a result, introduce bias against female. This is the reason why mitigating bias during decision tree learning is a challenging task.

3 Our Method

To minimize the bias and, at the same time, maximize the classification accuracy, we proposed to follow the top-level procedure in Algorithm 1, but formulate feature selection as a series of mixed-integer optimization (\({\texttt {MIO}}\)) subproblems.

As shown in Algorithm 3, each of our \({\texttt {MIO}}\) subproblems consists of an objective function O and a constraint \(\varPhi \), and the solution is an assignment of the numerical variables (shared by O and \(\varPhi \)) that minimizes O while satisfying \(\varPhi \). In the remainder of this section, we present our symbolic encoding of the objective function, O, and the constraint, \(\varPhi \), respectively.

figure f

3.1 The Objective Function O

We define the function as \(O := O_{accu} + \alpha O_{tree} - \beta O_{fair} \), consisting of components for accuracy loss (\(O_{accu}\)), tree size (\(O_{tree}\)), and fairness score (\(O_{fair}\)), respectively. The constants, \( \alpha \) and \(\beta \), are used to make trade-offs. In our implementation, \(\alpha \) is fixed to \(1/(2^{K+1}-2)\) while \(\beta \) is the optimal value in [0, 1] selected using n-fold cross-validation.

Specifically, we test the values 0.02, 0.04, 0.06, \(\ldots \) to 1.00 and, for each fold of the dataset, we compute the objective function and choose \(\beta \) with the minimal objective value. In general, a bigger \(\beta \) means more fairness. Our experiments show that, as \(\beta \) gets larger, \(O_{fair}\) remains constant initially and then starts increasing while \(O_{accu}\) remains constant, and then \(O_{accu}\) starts increasing.

Since the decision tree structure is not known a priori, we encode a complete binary tree while allowing all branch and leaf nodes to be activated or de-activated. Recall that \(\mathcal {L}\) is the subset of leaf nodes, \(\mathcal {V}\) is the subset of branch nodes, \(l\in \mathcal {L}\) denotes a leaf node, and \(v\in \mathcal {V}\) denotes a branch node.

Tree Size (\(O_{tree}:=\sum \nolimits _{v\in \mathcal {V}}{p_v}\)). We assign a variable \(p_v\) to each branch node \(v\in \mathcal {V}\), to indicate if a feature is used to split v. Thus, \(p_v=1\) means v is split, while \(p_v=0\) means v is not split. To get a valid decision tree, \(p_v\) must be constrained also by formula \(\varPhi \) (Sect. 3.2) . Assuming the number of \(p_v\) variables is \(|\mathcal {V}|\), the tree size is the number of \(p_v\) variables with value 1.

Accuracy Loss (\(O_{accu}:= \frac{1}{|\mathcal {L}|}\sum \nolimits _{l \in \mathcal {L}}^{}{L_l}\)). We assign a variable \(L_l\) to each leaf node \(l\in \mathcal {L}\) to represent the misclassification error at l. Since we start with a complete tree, each leaf node corresponds to a distinct path. The actual value of \(L_l\) is defined by formula \(\varPhi \) (Sect. 3.3). Assuming the number of \(L_l\) variables is \(|\mathcal {L}|\), the accuracy loss is measured by averaging the \(L_l\) values.

Fairness Score (\(O_{fair}:=F\)). We assign a variable F to represent the overall fairness score of the decision tree. The value of F is defined by formula \(\varPhi \) (Sect. 3.4) according to the definition of demographic parity.

Next, we present our encoding of formula \(\varPhi := \varPhi _{tree} \wedge \varPhi _{accu} \wedge \varPhi _{fair}\), where \(\varPhi _{tree}\) encodes the hierarchical structure of the tree, \(\varPhi _{accu}\) encodes the accuracy requirement, and \(\varPhi _{fair}\) encodes the fairness requirement. They share variables with \(O_{tree}\), \(O_{accu}\) and \(O_{fair}\) in the objective function, such as \(p_v\), \(L_l\), and F. Note that, since the constraint will be solved by an off-the-shelf \({\texttt {MIO}}\) solver, \(\varPhi \) must be encoded as a conjunction of equality/inequality constraints. If logical-or operators are needed, they must be converted to equality/inequality operators.

3.2 Encoding of the Decision Tree (\(\varPhi _{tree}\))

Given a node, which may be the root of the decision tree under construction, or any of its branch nodes, we consider a depth-K complete binary tree rooted at that node. Since it is a complete binary tree, there are precisely \(T_K = 2^{K+1}-1\) nodes with indices \(1 \dots T_K\) and, for any node n, the left and right child nodes have indices 2n and \(2n+1\), respectively. Furthermore, the set of leaf nodes is \(\mathcal {L}= \{2^K, 2^K+1\dots 2^{K+1}-1\}\), where \(|\mathcal {L}|=2^K\), and the set of branch nodes is \(\mathcal {V}= \{1, 2 \dots 2^K-1\}\), where \(|\mathcal {V}|=2^K-1\).

  • Every leaf node \(l \in \mathcal {L} \) has an output class label, and the path from root to l represents a classification rule, which assigns any input x that goes through the path to the output class.

  • Every branch node \(v \in \mathcal {V}\) has a vector \(w_v\) of bits for selecting the feature. Thus, at most one bit in \(w_v\) is 1, and \(w_v[i]=1\) means feature \(f_i\) is selected. For input x, the value of the selected feature is \(f_i(x) = w_v^T x\).

  • When node v is split by a feature, its outgoing edges are labeled (\(w_v^T x <b_v\)) and (\(w_v^T x \geqslant b_v\)), respectively. Here, \(b_v\in (0,1]\) is a symbolic threshold. When \(f_i(x) = w_v^T x \) is a Boolean feature and \(b_v=1\), for example (\(w_v^T x <1\)) means \(f_i(x) = 0\), and (\(w_v^T x \geqslant 1\)) means \(f_i(x)=1\).

Figure 3 shows a depth-2 binary tree whose branch nodes are colored in teal and leaf nodes are colored in red. The thresholds \(b_1\), \(b_2\) and \(b_3\) may be either 0 or a value in (0, 1]: only when they are non-zero, the corresponding nodes are split by features.

For instance, when \(b_2\) is set to 1, if edge condition \((w_2^T x < 1)\) holds, input x goes to the left child, and if \((w_2^T x \geqslant 1)\) holds, x goes to the right child. When \(b_2\) is set to 0, however, since edge condition \((w_2^T x <0)\) is always false and \((w_2^T x\geqslant 0)\) is always true, input x always goes to the right child. In other words, \(b_2=0\) disallows splitting at node \(v=2\).

Fig. 3.
figure 3

Example of a complete binary tree, where \(\mathcal {V}=\{1,2,3\}\) are branch nodes, \(\mathcal {L}=\{4,5,6,7\}\) are leaf nodes, and the decision thresholds \(b_1\), \(b_2\) and \(b_3\) belong to [0, 1]. (Color figure online)

Symbolic Variables. To model how a feature splits the training set, we define some symbolic variables first.

  • Input (\({\textbf {x}}_{ij}\)): We use \({\textbf {x}}_{ij}\) to model the j-th feature of the i-th input in \(\mathcal {E}\). Thus, \(i \in [1\dots n]\), \(j \in [1\dots k]\), \(n=|\mathcal {E}|\), and \(k=|\mathcal {F}|\). The value of \({\textbf {x}}_{i,j}\) may be any real number from 0 to 1, i.e., \({\textbf {x}}_{i,j} \in [0,1]\).

  • Split (\(p_v\)): For every branch node \(v\in \mathcal {V}\), we use \(p_v\) to model if v is split by a feature. The value of \(p_v\) is either 0 (no) or 1 (yes).

  • Selection (\(w_{vj}\)): We use \(w_{vj}\) to model if the j-th feature is selected by node \(v \in \mathcal {V}\). The value of \(w_{vj}\) is either 0 (no) or 1 (yes). Since both w and x are k-bit vectors, \(w_v^T x\) is the value of the selected feature for a given input x.

  • Threshold (\(b_v\)): We use \(b_v\) to control the activation of branch conditions at node \(v\in \mathcal {V}\). When \(b_v=0\), input x always goes to the right child since condition \((w_v^T x <0)\) is unsatisfiable. Otherwise, x goes to the left child when \((w_v^Tx < b_v)\), and to the right child when \((w_v^Tx\geqslant b_v)\).

  • Input Association (\(z_{it}\)): We use \(z_{it}\) to model if the i-th input, \(x_i\), is associated with node \(t \in \{\mathcal {L} \vee \mathcal {V}\}\). The value of \(z_{it}\) is either 0 (no) or 1 (yes).

  • Empty Association (\(I_t\)): For every leaf node \(t \in \mathcal {L}\), we use \(I_t\) to model if t has any associated input. The value of \(I_t\) is either 0 (no) or 1 (some).

Formula \(\varPhi _{tree}\) . We define the formula as \(\varPhi _{tree} := \varPi _{split} \wedge \varPi _{edge} \wedge \varPi _{leaf} \wedge \varPi _{branch}\) where \(\varPi _{split}\) encodes how features are used to split branch nodes, \(\varPi _{edge}\) encodes the constraints on edges, \(\varPi _{leaf}\) encodes the constraints on leaf nodes, and \(\varPi _{branch}\) encodes the constraints on branch nodes.

Subformula \(\varPi _{split}\) . We construct \(\varPi _{split}\) by constraining \(p_v\), \(w_{vj}\), and \(b_v\):

  1. 1.

    If \(p_v=1\), meaning \(v \in \mathcal {V}\) is split, we require \((\sum _{j \in \{1,\dots ,k\}} w_{vj}=1)\) to ensure exactly one feature is selected. We also require \((b_v >0)\) to activate the branch conditions on the outgoing edges, \((w_v^T x <b_v)\) and \((w_v^T x \geqslant b_v)\).

  2. 2.

    If \(p_v=0\), meaning v is not split, we require \((\sum _{j\in \{1,\ldots ,k\}} w_{vj}=0)\) to ensure no feature is selected, and \((b_v=0)\) to de-activate the left branch. That is, input x always goes to the right, while the left subtree stops growing.

Thus, we have \(\varPi _{split} := \bigwedge _{v\in \mathcal {V}} ~ (\sum _{j\in \{1,\ldots . k\}} w_{vj} = p_v) \wedge (0\leqslant b_v \leqslant p_v)\).

Subformula \(\varPi _{edge}\) . We construct \(\varPi _{edge}\) by constraining the \(p_v\) variables: If node \(v\in \mathcal {V}\) stops splitting, its child nodes also stop splitting. That is, when \(p_v = 0\), both \(p_{2v}\) and \(p_{2v+1}\) must also be 0.

Thus, we have \(\varPi _{edge} = \bigwedge _{v\in \mathcal {V}} ~ (p_v \geqslant p_{2v}) \wedge (p_v \geqslant p_{2v+1})\).

Subformula \(\varPi _{leaf}\) . We construct \(\varPi _{leaf}\) by constraining variables \(z_{it}\) and \(I_t\):

  1. 1.

    For each input \(x_i\), where \(i\in \{1,\dots ,n\}\) and \(n=|\mathcal {E}|\), we require that \(x_i\) is associated with exactly one leaf node \(l\in \mathcal {L}\), i.e., \((\sum _{l\in \mathcal {L}} z_{il} = 1)\).

  2. 2.

    If \(I_l=0\), meaning no input is associated with l, we require that \((z_{il} = 0)\) for all \(i\in \{1,\dots ,n\}\). This is encoded as \(\bigwedge _{l\in \mathcal {L}} (z_{il}\leqslant I_l)\).

Thus, we have \(\varPi _{leaf}:= \bigwedge _{i\in \{1,\dots ,n\}} ~ (\sum _{l\in \mathcal {L}} z_{il} = 1) \wedge \bigwedge _{l\in \mathcal {L}} (z_{il}\leqslant I_l)\).

Subformula \(\varPi _{branch}\) . We construct \(\varPi _{branch}\) by constraining \(w_{vj}\), \(b_v\), and \(z_{it}\):

  1. 1.

    In a complete binary tree, the depth-d nodes are \(v\in \{2^d,\dots , 2^{d+1}-1\}\). Since exactly one of them is associated with input \(x_i\), we require that condition \(\varPi _{br1}:= (\sum _{v\in \{2^d,\ldots ,2^{d+1}-1\}} z_{iv} = 1)\) holds.

  2. 2.

    At each node \(v\in \mathcal {V}\), since input \(x_i\) is associated with either the left child \(L=2v\) or the right child \(R=2v+1\), but not both, we require that the following three conditions hold:

    • \(\varPi _{br2}:= \bigwedge _{v\in \{2^d,\ldots ,2^{d+1}-1\}} ~ (z_{iv} = z_{i(2v)} + z_{i(2v+1)})\)

    • \(\varPi _{br3}:= \bigwedge _{v\in \{2^d,\ldots ,2^{d+1}-1\}} ~ (\sum _{j\in \{1,\ldots ,k\}} w_{vj} x_{ij} -\gamma _L (1- z_{iL}) <b_v)\)

    • \(\varPi _{br4}:= \bigwedge _{v\in \{2^d,\ldots ,2^{d+1}-1\}} ~ (\sum _{j\in \{1,\ldots ,k\}} w_{vj} x_{ij} + (1- z_{iR}) \geqslant b_v)\)

Thus, we have \(\varPi _{branch}:= \bigwedge _{i\in \{1,\ldots ,n\}} ~ \bigwedge _{d\in \{1,\dots ,K-1\}} (\varPi _{br1}\wedge \varPi _{br2}\wedge \varPi _{br3}\wedge \varPi _{br4}\)).

Explanation of \(\varPi _{br3}\) and \(\varPi _{br4}\) . What we would like to encode in \(\varPi _{br3}\) is the fact that branch condition \((\sum w_{vj} x_{ij} < b_v)\) may be either TRUE (\(x_i\) goes to the left child L when \(z_{iL}=1\) and \(b_v\in (0,1]\)) or FALSE (\(x_i\) goes to the right child R when \(z_{iL}=0\) and \(b_v\in (0,1]\), or when \(b_v=0\)). However, since off-the-shelf \({\texttt {MIO}}\) solvers do not support logical-or operators, we have to encode these different scenarios in a single inequality constraint. This is accomplished by adding a slack value, \(-\gamma _L(1-z_{iL})\), to the branch condition. Similarly, in \(\varPi _{br4}\), we add a slack value, \((1-z_{iR})\), to the branch condition \((\sum w_{vj} x_{ij} \geqslant b_v)\).

3.3 Encoding of the Accuracy Requirement (\(\varPhi _{accu}\))

To minimize the accuracy loss defined in \(O_{accu}:= \frac{1}{|\mathcal {L}|}\sum \nolimits _{l \in \mathcal {L}}^{}{L_l}\) (Sect. 3.1), we need to constrain the \(L_l\) variables in \(\varPhi _{accu}\) such that \(L_l\) models the misclassification error at the leaf node \(l\in \mathcal {L}\). In the depth-K complete binary tree, there are \(|\mathcal {L}|=2^K\) leaf nodes. For each leaf node l, variable \(L_l\) represents the number of misclassified examples \((x_i,y_i)\in \mathcal {E}\): it is misclassified if the given output \(y_i\) does not match the predicted output \(\mathcal {T}(x_i)\).

The formula \(\varPhi _{accu} := \varPhi _{p} \wedge \varPhi _{N} \wedge \varPhi _{\theta } \wedge \varPhi _{loss}\) consists of four subformulas.

Subformula \(\varPhi _{P}\) . For each \((x_i,y_i)\in \mathcal {E}\), where \(i\in \{1,\ldots ,n\}\) and \(n=|\mathcal {E}|\), and for each output value \(m\in \{0,1\}\), we use \(p_{im}\) to model if \((y_i=m)\). The value of \(p_{im}\), which is either 0 or 1, is \(const_{im} := (y_i=m)~?~0:1~\).

Thus, we have \(\varPhi _{p} := \bigwedge _{i=1}^{n} \bigwedge _{m=0}^{1} (p_{im} = const_{im})\).

Subformula \(\varPhi _{N}\) . We use variable \(N_l\) to represent the number of examples associated with leaf node l, and \(N_{lm}\) to represent those with output value m.

Thus, we have \(\varPhi _N := \bigwedge _{l\in \mathcal {L}} (N_l = \sum _{i=1}^n z_{il}) \wedge (N_{lm} = \frac{1}{2} \sum _{i=1}^n z_{il} (1+p_{im}) )\).

Subformula \(\varPhi _{\theta }\) . According to Lines 5–8 of Algorithm 1, each leaf node has an output class label \(\theta _l = {\textbf {argmax}}_{m\in \{0,1\}} ~ N_{lm}\). Since argmax cannot be directly encoded, we define a matrix of \(\theta _{lm}\) variables in \(\{0,1\}\), where \(\theta _{lm}=1\) means the output label of node l is m. By definition, only one \(\theta _{lm}\) variable can be 1.

Thus, we have \(\varPhi _\theta := \bigwedge _{l\in \mathcal {L}} ~ (\sum \nolimits _{m\in \{0,1\}} ~ \theta _{lm} = 1)\).

Subformula \(\varPhi _{loss}\) . Assuming that m is the output label predicted by the leaf node l. The misclassification error, \(L_l\), is equal to the number of examples associated with l, denoted \(N_l\), minus the number of examples that have the most common label m, denoted \({\textbf {max}}_{m\in \{0,1\}} N_{lm}\).

To avoid max/min in \( L_l = N_l - {\textbf {max}}_{m \in \{0,1\}} N_{lm} = {\textbf {min}}_{m \in \{0,1\}} (N_l - N_{lm}) \), we use \(\theta _{lm}\) variables and constant \(n=|\mathcal {E}|\) to rewrite the constraint as :

$$\begin{array}{ll} ( L_l \geqslant 0 ) \wedge \bigwedge _{m\in \{0,1\}} &{} ( L_l \geqslant N_l - N_{lm} - n(1-\theta _{lm}) ) \wedge ( L_l \leqslant N_l - N_{lm} + n\theta _{lm} ) \\ \end{array}$$

Thus, we have \(\varPhi _{loss} := \bigwedge _{l\in \mathcal {L}} ( ( L_l \geqslant 0 ) \wedge \bigwedge _{m\in \{0,1\}} ( L_l \geqslant N_l - N_{lm} - n(1-\theta _{lm}) ) \wedge \) \(( L_l \leqslant N_l - N_{lm} + n\theta _{lm} ) )\).

3.4 Encoding of the Fairness Requirement

Formula \(\varPhi _{fair} := \varPhi _{F_s} \wedge \varPhi _{FM}\) has two subformulas. Here, \(\varPhi _{F_s}\) encodes the fairness index and \(\varPhi _{FM}\) encodes the constraints on variables used in \(\varPhi _{F_s}\).

According to Eq. 1 (Sect. 2.3), the fairness index is defined as \(F_{s} = ( Pr_{\lnot f_s}^{+}/Pr_{f_s}^{+} )\), where \(f_s\) is a sensitive feature such that \(f_s(x)\), for any input \(x\in \mathcal {E}\), may be 0 or 1 (e.g., female and male) while \(\mathcal {T}(x)=+\) means the output generated by \(\mathcal {T}\) is positive (e.g., a job is offered). According to the “80% rule”, demographic parity is achieved if \(F_s\) is above \(80\%\). In this work, our goal is to find a solution that (1) satisfies \((F_s >0.8)\) and, at the same time (2) maximizes the value of \(F_s\).

However, the definition of \(F_s\) shown in Eq. 1 has division operators, which are not supported by off-the-shelf \({\texttt {MIO}}\) solvers. Furthermore, the divisor part of the equation varies even for a fixed set \(\mathcal {E}\) of examples, which makes the encoding a challenging task. To overcome the challenge, we refine the definition of as follows:

$$\begin{aligned} \frac{Pr_{f_s=0}^{+}}{Pr_{f_s=1}^{+}} = \frac{ |\{x\in \mathcal {E} \mid f_s(x)=0, \mathcal {T}(x)=+\}| ~/~ |\{x\in \mathcal {E} \mid f_s(x)=0\}| }{ |\{x\in \mathcal {E} \mid f_s(x)=1, \mathcal {T}(x)=+\}| ~/~ |\{x\in \mathcal {E} \mid f_s(x)=1\}|} = \frac{S_0^+/S_0}{S_1^+/S_1} \end{aligned}$$
(2)

For each of the four components, we create a symbolic variable. Variable \(S_0\) represents the number of examples whose sensitive feature has the value 0 (e.g., female) for the gender (\(f_1\)) feature. Variable \(S_0^+\) represents the number of examples in \(S_0\) that have the positive output (e.g., a job is offered). Variable \(S_1\) represents the number of examples whose sensitive feature has the value 1 (e.g., male) for the gender (\(f_1\)) feature. Variable \(S_1^+\) represents the number of examples in \(S_1\) that have the positive output.

Subformula \(\varPhi _{F_s}\) . We use \(\varPhi _{F_s}\) to enforce the 80% rule: \(F_s = \frac{S_0^+/S_0}{S_1^+/S_1} \geqslant 0.8\). Assuming \(S_0>0\), \(S_0^+>0\), \(S_1>0\), and \(S_1^+ > 0\), we encode the rule as follows:

$$\varPhi _{F_s} := (S_0^+ \times S_1 -0.8 \times S_0 \times S_1^+ \geqslant 0)$$

There are two advantages of this encoding. First, the resulting constraint can be solved by off-the-shelf \({\texttt {MIO}}\) solvers, whereas a direct encoding of Eq. 2 cannot. Second, the value of \((S_0^+ \times S_1 -0.8 \times S_0 \times S_1^+)\) increases as \(F_s\) increases; therefore, it can be used as part of the objective function, \(O_{fair}\), to maximize \(F_s\).

Subformula \(\varPhi _{FM}\) . We use \(\varPhi _{FM}\) to constrain the variables \(S_0\), \(S_0^+\), \(S_1\), and \(S_1^+\). Toward this end, we need to define the following variables:

  • \({S_0}_{i}\): We use variable \({S_0}_{i} \in \{0,1\}^n\) to model if the value of \(f_s(x_i)\) is 0. Thus, we require \({S_0}_{i} = 1\) when \(f_s(x_i) = 0\), and \({S_0}_{i}=0\) otherwise.

  • \({S_0}^+_{il}\): We use variable \({S_0}^+_{il} \in \{0,1\}^{n \times |\mathcal {L}|}\) to model, at each leaf node \(l\in \mathcal {L}\), if \(x_i\in \mathcal {E}\) is given the positive output. Thus, we require \({S_0}^+_{il}=1\) when the following condition holds, and \({S_0}^+_{il}=0\) otherwise:

    $$ (\theta _{lm} = 1~\wedge ~m = 1~\wedge ~ z_{il} = 1~\wedge ~{S_0}_{i} = 1) $$

    In the condition above, \((\theta _{lm} = 1)\) means the output label produced by the leaf node l is m, and \((m=1)\) means m is the positive output (“\(+\)”).

  • \({S_1}_i\) and \({S_1}^+_{il}\): We define variables \({S_1}_{i}\) and \({S_1}^+_{il}\) similar to \({S_0}_{i}\) and \({S_0}^+_{il}\).

Thus, we have \(\varPhi _{FM} := (S_0 = \sum \nolimits _{i\in \{1,\dots ,n\}} {S_0}_i) \wedge (S_0^+ = \sum \nolimits _{i\in \{1,\dots ,n\}} \sum \nolimits _{l\in \mathcal {L}} {S_0}^+_{il}) \wedge (S_1 = \sum \nolimits _{i\in \{1,\dots ,n\}} {S_1}_i) \wedge (S_1^+ = \sum \nolimits _{i\in \{1,\dots ,n\}} {S_1}^+_{il})\).

Putting It All Together. Recall that, in Sect. 3.3, we have constrained the accuracy loss, \(L_l\), in the objective function \(O_{accu}\), and defined the objective function \(O_{tree}\) in Sect. 3.1, which is used to minimize the tree size and thus reduce over-fitting. As for the objective function \(O_{fair}\) (Sect. 3.1), we define the fairness score as follows: \(F = (S_0^+ \times S_1 - 0.8 \times S_0 \times S_1^+)\).

Thus, we have the entire \({\texttt {MIO}}\) problem as follows:

$$\begin{aligned}&\begin{aligned} minimize\ \ \ \ \ ~&~ \frac{1}{|\mathcal {L}|}\sum _{l \in \mathcal {L}}^{}{L_l} + \alpha \sum _{v\in \mathcal {V}}{p_v} - \beta F \\ subject\ to\ \ \ \ \ ~&~ \varPhi _{accu}(L_l) \wedge \varPhi _{tree}(p_v) \wedge \varPhi _{fair}(F) \end{aligned} \end{aligned}$$
(3)

4 Generalization and Performance Enhancement

In this section, we first explain how our method relates to various existing algorithms (Sect. 4.1). Next, we present techniques for speeding up constraint solving while maintaining the quality of the solution (Sect. 4.2). Finally, we show that, beyond demographic parity, our method can encode other group fairness metrics, such as equal opportunity and equal odds (Sect. 4.3).

4.1 Relating to Existing Algorithms

Recall that our method performs feature selection by symbolically encoding a depth-K binary tree, to perform a bounded look-ahead search of the optimal feature using the \({\texttt {MIO}}\) solver. For ease of presentation, let us call the selected feature depth-K optimal, where \(K\in \{1,\dots ,+\infty \}\).

Depth-1 Optimal. When \(K=1\), the tree consists of the root node only and, as a result, look-ahead search is disabled. In this case, our method is the same as a purely greedy search method. Depending on whether fairness is encoded, there are two cases.

  • Without the fairness component, our method would compute the depth-1 optimal feature that minimizes only the tree size and the accuracy loss. This is similar to mainstream decision tree learning algorithms such as CART.

  • With the fairness component, our method would compute the depth-1 optimal feature that minimizes the tree size and the accuracy loss, and maximizes the fairness score. This is similar to IGCS [24], an discrimination-aware technique for learning decision trees.

Our experimental evaluation (in Sect. 5) shows that neither CART nor IGCS is effective in improving fairness, especially for larger datasets, primarily due to their inability to look beyond the current node.

Depth-\(\infty \) Optimal. When K is set to a sufficiently-large number, our method is able to find the globally optimal feature for not only the root node, but also other nodes in the decision tree. Thus, it would compute the entire decision tree in one shot.

  • Without the fairness component, our method would act like the technique introduced by Bertsimas and Dunn [7], which laid the ground work for encoding an optimal classification tree as a monolithic \({\texttt {MIO}}\) problem.

  • With the fairness component, our method would act like MIP, a fair learning technique introduced by Aghaei et al. [1].

Our experimental evaluation (in Sect. 5) shows that the computational overhead of the monolithic \({\texttt {MIO}}\) approach or MIP is too high to be practically useful. We discuss how to set the value of K in our method in the next subsection.

4.2 Performance Enhancement

We propose two techniques for speeding up our method by (1) choosing the K value adaptively and (2) sampling the training examples in \(\mathcal {E}\).

Choosing the K Value Adaptively. There is a trade-off between looking further ahead and reducing the constraint solving time. Given \(n=|\mathcal {E}|\) training examples, and \(2^K\) leaf nodes in a depth-K binary tree, the number of decision variables (such as \({S_0}_{il}\)) would be \((n\times 2^K)\). Since mixed-integer optimization is NP-hard, the complexity of constraint solving is \(O(2^{n\times 2^K})\). Empirically, we have found that Gurobi, a state-of-the-art solver, may take 1–2 h to solve a problem for \(n=1000\) training examples and tree depth \(K=7\)—this is consistent with prior experimental results, e.g., Bertsimas and Dunn [7]. Unfortunately, supervised learning datasets in practice often bring as many as 50,000 training examples to the root node of a decision tree, although the number decreases gradually and may reach 0 for some leaf nodes. Therefore, setting K to 7, or any predetermined value, would not work well in practice.

Instead, we propose to set the K value adaptively. Given a time-out limit (T/O) for learning a decision tree, we start with a relatively small K value, say \(K=2\), to synthesize a decision tree. Then, we increase the K value to synthesize a better decision tree. We keep increasing the K value as long as the time limit is not yet reached, and the quality of the decision tree is improved. We measure the quality of the tree using the value of the objective function, O, which consists of the tree size, the accuracy loss, and the fairness score.

Sampling the Training Examples. We propose to reduce the size of the constraints in \(\varPhi \) by sampling the training examples in \(\mathcal {E}\), before using them to construct the formula \(\varPhi \). Our experience shows that sampling can reduce the value of n significantly and, at the same time, maintaining the quality of the \({\texttt {MIO}}\) solution. For the adult dataset, which has 48, 842 training examples, even with a small K value, the symbolic constraints would take more than 1 h to solve.

Empirically, we have observed that the feature computed by depth-K look-ahead using 8,000 randomly-chosen examples is almost as good as the feature computed using all examples. Based on this observation, we set the threshold \((n\leqslant 8000)\), i.e., at most 8,000 examples from \(\mathcal {E}\) are used in the symbolic constraints in Algorithm 4, where \(\varPhi \) = (\(\mathcal {E}, \mathcal {F}, f_s\)) is invoked if \(|\mathcal {E}|\leqslant 8000\). Otherwise, \(\mathcal {E}\) is replaced by the randomly-sampled subset \(\mathcal {E}\mid _{sampled}\).

figure h

Our sampling method is not directly applicable to the original MIP approach because, if sampled data are used as input, the MIP solving procedure would permanently discard the rest of the data, which would significantly degrade its accuracy. In contrast, sampling in our method only causes the rest of the data to be ignored temporarily (for this particular node) but, for the child nodes in the subtree, the entire data will still be used in the subsequent computation.

4.3 Encoding Other Group Fairness Metrics

Beyond demographic parity, there are two popular metrics for group fairness, of which one is equal opportunity and the other is equalized odds.

Equal Opportunity. In addition to the sensitive feature \(f_s\), there is a decision-critical feature \(f_c\). Let \(P^+_{f_s = 0, f_c = 1} = \frac{|x~\in ~ \mathcal {E} ~\mid ~ f_s(x) = 0,~ f_c(x) = 1,~ \mathcal {T}(x) = +|}{|x~\in ~\mathcal {E} ~\mid ~ f_s(x) = 0,~ f_c(x) = 1|} = \frac{S_0^+}{S_0}\) and \(P^+_{f_s = 1, f_c = 1} = \frac{|x~\in ~ \mathcal {E} ~\mid ~ f_s(x) = 1,~ f_c(x) = 1,~ \mathcal {T}(x) = +|}{|x~\in ~\mathcal {E} ~\mid ~ f_s(x) = 1,~ f_c(x) = 1|} = \frac{S_1^+}{S_1}\). A decision tree \(\mathcal {T}\) satisfies equal opportunity if the following condition holds (for a small \(\epsilon \)).

$$\begin{aligned} P^+_{f_s = 1, f_c = 1} - P^+_{f_s = 0, f_c = 1} \leqslant \epsilon \end{aligned}$$
(4)

In our method, Eq. 4 may be encoded as \(\varPhi _{eq} := S_1^+S_0 - S_0^+S_1 - \epsilon S_0 S_1 \leqslant 0\), to replace \(\varPhi _{F_s}\) in the fairness requirement \(\varPhi _{fair}:= \varPhi _{F_s}\wedge \varPhi _{FM}\). The definitions of variables \(S_0\), \(S_0^+\), \(S_1\) and \(S_1^+\) are analogous to that in Sect. 3.4. Similarly, we can define fairness decision variables \({S_0}_{i}\), \({S_0}_{il}\), \({S_1}_{i}\), and \({S_1}_{il}\). For example, the value of \({S_0}_{i}\) is set to 1 if \(f_s(x_i) = 0 \wedge f_c(x_i) = 1\) and is set to 0 otherwise.

Equalized Odds. To satisfy equalized odds, we must satisfy Eq. 4, as well as the condition below:

$$\begin{aligned} P^+_{f_s = 1, f_c = 0} - P^+_{f_s = 0, f_c = 0} \leqslant \epsilon . \end{aligned}$$
(5)

Since Eq. 5 can be encoded similarly to Eq. 4, the details are omitted for brevity.

5 Experiments

We have implemented our method, SFTree, using Python, Julia 1.5.1 [15], and Gurobi 9.03 [21], where Julia is used to encode the \({\texttt {MIO}}\) constraints and Gurobi is used to solve the constraints. We compared SFTree with three state-of-the-art techniques: CART, which is a mainstream algorithm for decision tree learning, IGCS, which is a discrimination-aware learning algorithm, and MIP, which is a monolithic \({\texttt {MIO}}\) approach to learning fair tress. We conducted all experiments with Catalina running on a macOS with 2.4 GHz 8-Core CPU and 64G RAM.

Table 1. Comparing our method with existing methods on small benchmarks.

Benchmarks. Our evaluation uses six popular benchmarks from the fairness literature. They are divided to three small datasets and three large datasets. Since the small datasets can be handled by the less-scalable but more-accurate MIP to obtain globally optimal solutions, they are useful in evaluating the quality of our method. The large datasets, in contrast, are out of the reach of MIP and thus useful in evaluating the scalability of our method.

  • Among the small datasets, German [23] (predicting credit risks) has 1000 training examples and 20 features; Student [12] (predicting student performance) has 649 training examples and 33 features; and Salary [36] (predicting the salary level) has 52 training examples and 16 features. In these datasets, the sensitive feature is gender.

  • Among the large datasets, Adult [14] (predicting the earning power) has 48,842 training examples and 14 features (with race as the sensitive feature); Default [37] (predicting loan default risk) has 30,000 training examples and 23 features (with gender as the sensitive feature); and Compas [13] (predicting the recidivism risk) has 10,500 training examples and 16 features (with race as the sensitive feature).

During learning, we apply the standard 5-fold cross validation expect for Compas, to which we apply 4-fold cross validation to be consistent with prior work.

Results on the Small Benchmarks. We compare the quality of the decision trees learned by our method and three existing methods on the small benchmarks. The results are shown in Table 1, where Column 1 shows name of the dataset, Columns 2–3 shows the result of our method in terms of accuracy and fairness, computed by cross-validation, Columns 4–5 show the result of CART, Columns 6–7 show the result of IGCS, and Columns 8–9 show the result of MIP. Since the datasets are small, MIP is able to compute the best solutions: without violating the 80% Rule, it maximizes accuracy.

The result shows that, overall, CART has the best accuracy but the worst fairness score. IGCS improves over CART, but still violates the 80% Rule in 5 out of the 15 cases. In contrast, SFTree satisfies the fairness requirement in all 15 cases and, at the same time, achieves high accuracy. Furthermore, it runs more than 10 times faster than MIP.

Table 2. Comparing our method with existing methods on large benchmarks.

Results on the Large Benchmarks. We use these benchmarks to evaluate both the quality and the scalability of our method. Table 2 shows the result of the quality comparison, which has the same format as Table 1. CART has the highest accuracy but fails to satisfy the fairness requirement in all 14 cases. Although IGCS is somewhat effective for the small benchmarks in Table 1, here, it fails to satisfy the fairness requirement in 12 of the 14 cases. In contrast, our method is the only one that satisfies the fairness requirement in all cases and, at the same time, has accuracy comparable to CART and IGCS.

Table 3. Comparing the run time of methods on large benchmarks (T/O = 3h).

Table 3 shows the execution time comparison. MIP times out in all 14 cases (T/O = 3h), while our method finishes each within 1 h. Thus, our method runs more than 10 times faster than MIP. Although CART and IGCS are faster, they are equivalent to depth-1 look-ahead search in our method and, due to the limited ability to look ahead, they almost never satisfy the fairness requirement.

Evaluating the Impact of the K -value. We have also evaluated how the K value affects the quality of the learned decision tree using the Student Fold1 benchmark. Since the benchmark is small enough, we set K to fixed values \(1,\dots ,7\) instead of letting it adapt, so we can assess the impact. Figure 4 shows the result, where the x-axis is accuracy and the y-axis is the fairness score. Thus, the closer a dot is to the right-top corner, the higher the overall quality is. The result shows that the quality of our solution increases dramatically as the K value increases from 1 to 7, due to the increasingly deeper look-ahead search.

Fig. 4.
figure 4

How accuracy and fairness of the learned decision tree change with the K value for the Student dataset. For each \(K=1,\dots ,7\), we plot the fairness and accuracy scores.

Summary of Additional Results. While we have also evaluated the scalability of our method with respect to the dataset size, we omit the results for brevity and instead provide a summary. What we have found is that, as the dataset gets larger, the execution time of our method increases modestly at first, and then stops increasing after a threshold is reached. This is due to the use of performance enhancement techniques presented in Sect. 4. Thus, our method does not have scalability issues. In fact, among all four methods, SFTree is the only one that consistently produces fair and accurate decision trees for datasets with \({>}40{,}000\) training samples.

6 Related Work

At a high level, our method can be viewed as an in-processing approach to mitigating bias in machine learning models. Broadly speaking, there are three approaches: pre-processing  [17, 25, 31], in-processing [11, 19, 24, 30, 33] and post-processing [18, 22], depending on whether the focus is on de-biasing the training data, the learning algorithm, or the classification output.

Since the pre-processing approach focuses on de-biasing the training data [17, 25, 31], it is applicable to any machine learning model; however, it cannot remove bias introduced by the learning algorithms, which is problematic because, even if the training data is not biased, learning algorithms may introduce new bias. While the post-processing approach can remove such bias by modifying the predicted output [18, 22], the result is often hard to predict and difficult to explain. In contrast, our method does not have these limitations.

Compared to other in-processing techniques for fair learning decision trees, including IGCS [24] and similar greedy search methods [11, 19, 30, 33], our method has the advantage of being more systematic and quantifiable. This is because we encode both accuracy and fairness requirements explicitly as numerical constraints. Thus, it would be easy to explain, at every step, why a feature is chosen over another feature, and quantify how much more effective it is in minimizing bias and accuracy loss at the same time. Compared to the monolithic constraint solving approach, including MIP [1] and similar methods  [5, 35], our method has the advantage of being significantly more scalable.

Our method differs from the recent work of Torfah et al. [32] in that their method uses a small training set sampled from a known distribution and thus does not need techniques such as incremental solving. Furthermore, their method assumes the decision predicates are given, but in our method, the predicates are synthesized from real-valued features. Finally, our fairness constraint is also different from the explainability constraint.

Besides synthesis, there are techniques for improving fairness by repairing an existing machine learning model [4, 9, 20, 26], and techniques for verifying that an existing machine learning model is indeed fair, e.g., by using probabilistic analysis methods [3, 6, 28]. While these techniques are related, they differ from our method in that they cannot synthesize new decision trees from training data while ensuring the decision trees are fair by construction.

7 Conclusion

We have presented a method for synthesizing a fair and accurate decision tree, by formulating feature section as a series of mixed-integer optimization problems and solve them using an off-the-shelf constraint solver. The method is flexible in expressing group fairness metrics including demographic parity, equal opportunity, and equal odds. On popular datasets, it is able to learn decision trees that satisfy the fairness requirement and, at the same time, achieve a high classification accuracy.