Regularized impurity reduction: accurate decision trees with complexity guarantees

Decision trees are popular classification models, providing high accuracy and intuitive explanations. However, as the tree size grows the model interpretability deteriorates. Traditional tree-induction algorithms, such as C4.5 and CART, rely on impurity-reduction functions that promote the discriminative power of each split. Thus, although these traditional methods are accurate in practice, there has been no theoretical guarantee that they will produce small trees. In this paper, we justify the use of a general family of impurity functions, including the popular functions of entropy and Gini-index, in scenarios where small trees are desirable, by showing that a simple enhancement can equip them with complexity guarantees. We consider a general setting, where objects to be classified are drawn from an arbitrary probability distribution, classification can be binary or multi-class, and splitting tests are associated with non-uniform costs. As a measure of tree complexity, we adopt the expected cost to classify an object drawn from the input distribution, which, in the uniform-cost case, is the expected number of tests. We propose a tree-induction algorithm that gives a logarithmic approximation guarantee on the tree complexity. This approximation factor is tight up to a constant factor under mild assumptions. The algorithm recursively selects a test that maximizes a greedy criterion defined as a weighted sum of three components. The first two components encourage the selection of tests that improve the balance and the cost-efficiency of the tree, respectively, while the third impurity-reduction component encourages the selection of more discriminative tests. As shown in our empirical evaluation, compared to the original heuristics, the enhanced algorithms strike an excellent balance between predictive accuracy and tree complexity.


Introduction
Decision trees are known to provide a good trade off between accuracy and interpretability. However, when their size grows, decision trees become harder to interpret, preventing their deployment in safety-critical applications and in domains where model transparency is highly valued, such as disease diagnosis. As interpretability still remains an ill-defined notion (Lipton 2018), in this paper we consider tree complexity, a commonly-accepted proxy, to quantify interpretability (Freitas 2014;Doshi-Velez and Kim 2017). In addition, low tree-complexity promotes cheaper and faster evaluation. Note that post-pruning techniques, such as the standard minimal cost-complexity pruning (Breiman et al. 1984), are heuristics performed mainly to avoid overfitting. Therefore, in order to produce interpretable trees, we aim for an integrated tree-induction algorithm that considers both the accuracy and complexity of the inferred trees.
More concretely, given a set of labeled objects (examples) drawn from an arbitrary probability distribution, our goal is to learn a decision tree that outputs the correct class of a given input object. Each internal tree node is equipped with a single test, e.g., a projection split along a feature, and each test is associated with a non-uniform cost, i.e., the cost of evaluating the outcome of the test. For example, an input object may represent a person, a test may correspond to a blood sugar test, and one possible outcome can be "high." Our aim is to learn trees that are accurate and have low complexity. The latter complexity objective is measured by the expected cost to classify an object drawn from the input distribution; if all tests incur the same cost, this measure is simply the expected number of tests to classify an object. This complexity measure reflects a form of "local" interpretability: the more tests are involved in an if-then rule for a given object, the more obscure the rule becomes to a user (Freitas 2014). Figure 1 helps demonstrating this intuition by juxtaposing two decision trees with different complexity. Note that non-uniform test costs may arise in different realworld scenarios; for example, in a medical-diagnosis application some tests can be significantly more expensive than others.
The problem of minimizing the expected cost of a tree for perfect class identification has been extensively studied. Typically, the assumption of realizability (or consistency) is being made, which states that for every two distinct objects there exists at least one test that can distinguish them. Thus, one can always expand the tree until it classifies every object in the training data perfectly. Then, the goal is to find the tree with the minimum expected cost that classifies each object perfectly. Note that, in practice, the realizability assumption can be easily fulfilled by data preprocessing, as we demonstrate later in our experiments. When each object belongs to a distinct class, the problem is referred to as entity identification (EI) (Gupta et al. 2017). Without this restriction, the problem is called group identification (GI) (Cicalese et al. 2014). A further generalization that is called adaptive submodular ranking (ASR) (Navidi et al. 2020) characterizes the tree-building process as interaction among multiple (a) (b) Fig. 1 Decision trees for predicting if participants would like to see their date again after speed dating. Each internal node includes the test used and the number of participants in parentheses. Leaf nodes make a decision submodular functions, one for each object, and achieves logarithmic approximation by a greedy algorithm. The above-mentioned works (Gupta et al. 2017;Cicalese et al. 2014;Navidi et al. 2020) consider the problem of building trees for the purposes of exact identification. They do not consider issues of accuracy and overfitting. In fact, exact identification on a set of (training) data leads precisely to overfitting.
The algorithm proposed for the ASR problem by Navidi et al. (2020) provides a very elegant solution for the identification task it has been designed for. However, in practice, it is not suitable for classification tasks in the context of statistical learning, because the chosen tests are geared towards small expected cost and are not necessarily discriminative. Discriminative power is generally measured by the homogeneity of the target variable within a tree node, and is essential for the generalization of model performance over unseen data. Their method selects splits that minimize the number of heterogeneous pairs (also known as impure pairs) of objects (Golovin et al. 2010;Cicalese et al. 2014). In Section A we provide a simple example where the criterion favors a non-discriminative (presumably random) test over a discriminative one. While random tests lead to a balanced tree with bounded expected depth, they are not "learning", that is, no statistical dependence is captured between tests and the target variable.
On the other hand, traditional decision-tree methods, such as CART (Breiman et al. 1984) and C4.5 (Quinlan 1993), rely on time-tested impurity-reduction heuristics that yield decision trees with high discriminative power. Although trees produced by these popular methods are accurate in practice, there has been no guarantee on the size, or depth, of the resulting trees. Actually, despite the popularity of these methods, their theoretical properties remain still poorly understood (Bellala et al. 2012;Brutzkus et al. 2019;Blanc et al. 2020).
In this paper we propose a general family of methods that achieve the best of both worlds: it produces decision trees having both high accuracy and bounded depth. Our key discovery is that the ASR framework can be extended to effectively analyze a broad range of impurity functions for tree induction.
More formally, we introduce the non-overfitting group identification (NGI) problem, which is a natural generalization of group identification (GI), where we further allow early termination during tree expansion to avoid overfitting. We propose a novel greedy algorithm that takes into consideration the impurity reduction and maintains the strong approximation guarantee on the complexity of the resulting tree. Specifically, our greedy algorithm admits the use of a general family of decomposable impurity functions, which is defined to be in the form of a weighted sum over impurity scores in each class. This family includes the popular functions of entropy and Gini-index. Therefore, our approach generalizes many traditional tree-induction algorithms such as CART and C4.5 into a complexity-aware method.
In concrete, in this paper we make the following contributions.
• We extend the adaptive submodular ranking (ASR) framework of Navidi et al. (2020) and we propose a novel greedy algorithm to select discriminative tests for the non-overfitting group identification (NGI) problem. Our algorithm offers an asymptotically tight approximation guarantee on the complexity of the inferred tree under mild assumptions. • We define a general family of decomposable impurity functions, which can be used by our algorithm as a surrogate for discriminative power. As a result, our algorithm generalizes traditional tree-induction algorithms, such as CART and C4.5, into complexity-aware methods.
• We provide a comprehensive experimental evaluation in which we show that the enhanced C4.5 and CART strike an excellent balance between predictive accuracy and tree complexity, compared to their corresponding original heuristics. Furthermore, the ASR formulation yields inferior predictive accuracy, compared to other learning methods. Our implementation is publicly available. 1 The rest of the paper is organized as follows. The related work is discussed in Sect. 2. The necessary notation and the formal definition of the NGI problem are introduced in Sect. 3. The main algorithm and its theoretical analysis follow in Sects. 4 and 5, respectively. Empirical experiments are conducted in Sect. 6, and we conclude in Sect. 7.

Related work
Decision-tree induction. Mainstream algorithms such as C4.5 and CART embrace a top-down greedy approach. Most of the greedy criteria proposed are essentially ad-hoc heuristics for measuring the strength of dependence between tests and the class, with no consideration for tree complexity (Murthy 1998). Theoretical understanding about such greedy methods is still lacking in the literature. A lower bound on the expected tree depth for C4.5 that depends on the shape of a given tree has been developed by Bellala et al. (2012). There also exist some recent results in the field of learning theory (Brutzkus et al. 2019;Blanc et al. 2020). Tree complexity Popular measures include the number of nodes in the tree, the tree height and the expected path length. The first kind of measures are closer to a notion of "global" interpretability, in the sense that one could inspect the entire tree of a small size, while the second kind of measures provide a notion of "local" interpretability, in the sense that one could explain any given object using a small number of tests. Our choice in the paper, the third kind of a measure, combines elements from both global and local interpretability. First, it obviously enables a form of local interpretability, i.e., a guarantee of a small expected number of tests when explaining a given object. This choice is considered to be more natural and less strict compared to worst-case tree height, as it may not be possible to classify every object using a small number of tests. Second, it also enables a form of global interpretability, as the global model knowledge is acquired by understanding the decision for every example in the dataset, and also it leads to smaller trees in general. Unfortunately, these complexity measures are proven to lead to NP-hard tasks (Hancock et al. 1996;Laurent and Rivest 1976). In particular, the expected path-length measure with an arbitrary probability distribution over objects does not admit sub-logarithmic approximation (Chakaravarthy et al. 2007). Identification The entity identification (EI) problem has been investigated in different contexts, including optimal decision trees, disease/fault diagnosis, and active learning (Adler and Heeringa 2008;Chakaravarthy et al. 2007;Dasgupta 2005;Garey 1972;Guillory and Bilmes 2009;Gupta et al. 2017;Kosaraju et al. 1999). A class-based generalization, the group identification (GI) problem, where objects are partitioned into groups (classes), has also been studied (Bellala et al. 2012;Cicalese et al. 2014;Golovin et al. 2010). The state-of-the-art method achieves O(log n)-approximation in a general setting with an arbitrary object distribution and non-uniform multi-way testing costs (Cicalese et al. 2014). Our paper further generalizes the latter work by considering the discriminative power of the selected tests. To the best of our knowledge, this is the first work to combine identification problems and traditional tree-induction algorithms. Stochastic submodular coverage (StoSC) Tree induction can be seen as a samplebased stochastic submodular-coverage problem (Golovin and Krause 2011;Grammel et al. 2016), by relating a realization of items in the StoSC problem to an object in identification problems. The expected cost of a tree is then equivalent to the expected cost in item selection. Adaptive submodular ranking (ASR) The ASR problem, proposed by Navidi et al. (2020), originates from the line of research of min-sum set cover (Feige et al. 2004;Im et al. 2012), and turns out to generalize the above-mentioned identification problems (Bellala et al. 2012;Cicalese et al. 2014;Golovin et al. 2010). Our formulation follows the framework of ASR, and extends its greedy criterion to incorporate an impurity-reduction component.

Problem definition
In this section, we first formalize the non-overfitting group identification (NGI) problem, and then define a family of decomposable impurity functions for tree induction.
An instance of the NGI problem is specified by a set of objects X = {x 1 , ..., x n }, a set of class labels L = { 1 , ..., k }, and a set of tests D = {d 1 , ..., d m }. The objects in X are drawn from a probability distribution p, i.e., object x in X occurs with probability p(x). Each object x ∈ X is associated with a class (x) in L. A test d ∈ D performed on an object x ∈ X returns a value d(x) ∈ {1, ..., ν d }. We assume that employing test d incurs cost c(d). For simplicity and without loss of generality, we also assume that the cost function c takes integral values. A useful quantity in our later analysis is the minimum object probability p min = min x∈X p(x). Finally, we assume that a threshold parameter θ ∈ [0, 1] is given as input, which determines a stopping condition for the decision-tree construction, as we will see shortly.
We write T (X ) to refer to a decision tree built to classify the objects in X . We omit the reference to the set X when it is clear from the context and just write T . We also write T (S) to refer to a subtree of the decision tree to classify objects in a node S of the tree, where S ⊆ X is the subset of objects. Each internal node S is equipped with a test d in D. Objects in S are partitioned by test d into multiple subnodes according to their testing outcomes d(x). Using this convention we refer to the root of the decision tree simply as X , that is, the complete set of objects to be classified by the tree. Finally, we define p(S) = x∈S p(x).
We stop splitting a node S ⊆ X in the tree T when either (i) the node S is homogeneous, i.e., all objects in S belong to the same class, or (ii) the probability p(S) is no greater than the threshold parameter θ , for instance, in the case of uniform p, the node S has at most θ n objects. As a surrogate for homogeneity, we adopt a function φ over pairs of objects. We define φ(S) to be the number of heterogeneous pairs of objects in the node S, i.e., pairs of objects with distinct classes. Note that φ(S) = 0 when S is homogeneous.
As a measure of complexity for a tree rooted at X , we adopt the measure of expected cost, which we denote by c(T (X )). In particular, we define c(T , x) as the cost of evaluating an object x in T , which is the sum of costs of all tests that x goes through in T . The expected cost of a tree T for a set of objects X is then defined as c(T (X )) = We are now ready to define the NGI problem.
Problem 1 (Non-overfitting group identification (NCI)) Given a problem instance I = (X , L, D, , p, c, θ), with set of objects X , set of class labels L, set of tests D, object labels , probability distribution p, cost function c, and a threshold θ , find a tree T (X ) that minimizes the expected cost c(T (X )) and for all leaf nodes S it satisfies either φ(S) = 0 or p(S) ≤ θ .
The NGI problem generalizes the GI problem by setting θ = 0, and as stated in Sect. 2, the GI problem is NP-hard. Thus, we aim to find a tree T that is an approximate solution, i.e., whose cost c(T ) is bounded with respect to the cost c(T * ) of the optimal tree T * .
Our approach draws inspiration from the adaptive submodular ranking (ASR) problem (Navidi et al. 2020), which can be defined similarly, by replacing each object x i in X with a non-decreasing submodular function f i : 2 D → [0, 1] such that f i (∅) = 0 and f i (D) = 1; recall that D is the set of tests, and thus, each function f i takes as input a subset of tests. We denote the set of non-decreasing submodular functions by We again consider a tree, which recursively partitions F. The tests D and the probability distribution p apply to the set of functions F in the same way that they apply to their corresponding objects. For example, a function f i evaluated on a test d ∈ D returns a value d( f i ) = d(x i ), which determines the branch of the tree that f i will follow. Given a tree T , a function f picks up all tests associated with the nodes it goes through and is fully covered when it reaches its maximum function value f (D). Let c(T , f ) be the cost of covering f in T , defined as the sum of costs of all tests that f goes through in T before it is fully covered. Note that a function is not necessarily covered in a leaf node, it may be covered in an internal node. The expected cost of a tree T is defined in a similar manner as for the NGI problem. The adaptive submodular ranking problem is defined as follows.
Problem 2 (Adaptive submodular ranking (ASR) (Navidi et al. 2020)) Given a problem instance I = (F, D, p, c), with set of submodular functions F, set of tests D, a probability distribution p, and cost function c, find a tree T (F) that covers all functions in F and minimizes the expected cost c( Decomposable impurity functions When constructing decision trees for classification tasks, in addition to having small expected cost, the discriminative power of the selected tests is also vital. A number of different impurity measures have been widely used in deciding a discriminative test in decision trees, such as entropy and Gini index. Such impurity measures are defined as functions h : [0, 1] k → R + , taking as input the class distribution at a given tree node. Impurity functions are expected to satisfy certain conditions (Kearns and Mansour 1999), which capture the notion of "impurity." All impurity functions mentioned in this paper satisfy the following conditions: (1) they obtain the maximum value if the class distribution is uniform, and the minimum value zero if a node is pure (i.e., homogeneous); (2) they are concave; and (3) they are symmetric.
A typical splitting criterion compares the change in impurity before and after performing a test d, defined as h(S) and h(S | d), respectively. The impurity reduction of a test d on a tree node S is defined as d A test that causes larger impurity reduction is considered more discriminative. Based on the concavity property of h, it is easy to show that d(S, d) ≥ 0 for any tree node S and test d. We defer the proof of this claim to the Appendix, Section B.
Before we define a special family of impurity functions for our problem, we first introduce some additional notation. For a node S of the tree, where S ⊆ X , we define S v d as the child node of S by equipping S with test d and following the branch that takes on a specific testing value v. In particular, we define D as the ending node of a path that starts at S ⊆ X and follows a sequence of nodes each equipped with a test d in D ⊆ D by taking on a value of d(x i ). Note that the order of tests in D does not matter in S (i) D . Finally, we denote the total probability of objects in a specific class in S as p (S) = x∈S: (x)= p(x).
We are now ready to define h(S) and h(S | d) for our problem. We require h to be decomposable, i.e., to be a weighted sum over impurity scores in each class. We define: where h (S) can be any function of p (S) p(S) , the proportion of objects of class in S, which ensures that h satisfies the three requirements stated above (i.e., (1) being maximized at uniform class distribution and minimized at homogeneity, (2) concavity, and (3) symmetry).
A wide range of concave impurity functions adopt such a form. For example, h becomes the entropy function when h (S) = − log p (S) p(S) , and it becomes the Gini index when h

Algorithm
The main idea of our approach is to cast the NGI problem as an instance of the ASR problem (Navidi et al. 2020). We achieve this by defining a non-decreasing submodular function for each object. The ASR problem is solved by a greedy algorithm that picks tests to maximize the coverage of the submodular functions while encouraging a

Algorithm 1 Greedy tree-induction algorithm
Input: An instance I = (S, L, D, , p, c, θ), a set of tests D ⊆ D used so far, impurity function h, trade-off parameter λ ≥ 0 Output: A decision tree T 1: Return a decision tree Tree(I , ∅) A recursive call at the top level with D = ∅ 2: function Tree(I , D ) 3: if S is homogeneous or p(S) ≤ θ then 4: Return a leaf node S, labeled with its majority class 5: end for 15: Return a tree rooted at S with children {T v } by test d * 16: end function balanced partition. We further incorporate the impurity-reduction objective into the greedy criterion to encourage the selection of discriminative tests, without losing the approximation guarantee.
Our algorithm for the NGI problem is demonstrated in Algorithm 1. It is a greedy algorithm, which, at each node S, selects a test d that maximizes a cost-benefit greedy score Z (d) consisting of the following three terms: The first term, is the sum of the branch probabilities except the largest-cardinality branch. Maximizing B(d) encourages selecting a test d that yields a balanced split.
The second term, is the re-weighted total sum of the marginal gain in each submodular function, which we will define for our objects shortly. Maximizing E(d) accelerates the progress towards termination. The last term, , is the impurity reduction we defined in Sect. 3, which improves the discrimination of the selected test. The user-defined parameter λ ≥ 0 controls the trade-off between tree complexity and discrimination.
One way to understand the greedy score Z (d) is to view the B and E terms as a regularizer. Notice that maximizing only the first two terms, Step 8 of Algorithm 1, is exactly the greedy criterion used by Navidi et al. (2020) to solve the ASR problem.
We finish the description of our method by showing how to define the submodular function f i for each object x i . We start by defining two monotonically non-decreasing submodular functions. For each object x i ∈ X , both submodular functions take as input a subset of tests D ⊆ D and return a real value. The first function f p i (D ) is defined as the scaled total probability of the objects that do not fall into S (i) D , i.e., the objects that disagree with x i in at least one test in D ⊆ D. Note that eventually, only object The second function f φ i (D ) is defined as the number of heterogeneous pairs that do not fall into S (i) D . Eventually, no heterogeneous pair will exist in S (i) D and the ending node is homogeneous. We define The target (maximum) values for these two functions are both 1, for each object Next we define the disjunction functionf i off p i and f φ i , which remains monotonically non-decreasing and submodular (Deshpande et al. 2016;Guillory and Bilmes 2011). We setf It is easy to see that with a reasonable value of θ (e.g., a multiple of the greatest common divisor of { p(x)}), the minimum positive incremental value of any element and anyf i is Last, we examine the time complexity of Algorithm 1. The greedy score for a tree node S with respect to a test can be computed in O(|S|) time. At each level of the decision tree, the union of disjoint nodes has a total size n. Thus, the worst-case time complexity is O(Hmn), where m is the number of tests, n is the number of objects, and H is the tree height, which is upper bounded by n. In practice, the algorithm is more efficient than what this worst-case bound suggests; it has the same time complexity as standard tree-induction algorithms, such as CART.

Approximation guarantee
In this section we establish the approximation guarantee of Algorithm 1 for the NGI problem. Our main result is the following.
Theorem 1 Algorithm 1 provides an O(log 1/ p min +log n +λ h ) approximation guarantee for the NGI problem.
As a practical consequence of Theorem 1, we have the following corollary, which is a consequence of the fact that for the popular impurity functions the factor h in the approximation ratio of Theorem 1 can be effectively bounded. Note that in practice p min ≥ 1/n when given training data of n points, and thus O(log 1/ p min ) = O(log n). We omit the constant λ for simplicity.

Corollary 2 Algorithm 1 provides an O(log 1/ p min + log n) approximation guarantee for the NGI problem when the impurity function h is either the entropy or the Gini index function.
Proof In addition to the result of Theorem 1 we can show that h is small, compared to the other terms, or bounded by a constant. When h is the entropy function, we have When h is the Gini index function, we have Assuming P = NP, the result given by Theorem 1 is asymptotically the best possible among instances where 1/ p min is polynomial in n. This follows directly from the hardness result of the EI problem (Chakaravarthy et al. 2007), which in turn, is proved via a reduction from the minimum set-cover problem. Recall that by specifying θ to be zero, NGI problem degenerates into EI or GI problems. The constructed EI instance in their reduction asks for a minimum object probability p min such that 1/ p min = (n 3 ), and thus if NGI admits o(log 1/ p min ) = o(log n) approximation, we could solve the set-cover problem with o(log n)-approximation, which is conditionally impossible (Feige 1998).

Remark 1
The NGI problem does not admit an o(log n) approximation algorithm, unless P = NP.

Proof of Theorem 1
Our analysis is similar to the one by Navidi et al. (2020), except that we need a new proof of their key lemma for our new greedy selection rule (Eq. (2)). This is done by leveraging the special structure in the family of impurity functions (Eq. (1)) we employ.
In order to analyze the total cost along a path, we treat cost as discrete "time" -or continuous time if we allow continuous cost -and we divide time geometrically. We refer to the decision tree returned by Algorithm 1 as T A , while we refer to the optimal decision tree as T * . We denote the set of internal (i.e., unfinished) nodes up to time t in T A as C(t), and similarly as C * (t) in T * .
We define C k = C(γ 2 k ) and C * k = C * (2 k−1 ), for a constant γ to be defined shortly. That is, we are interested in the set of unfinished nodes at the end of the k-th geometrically increasing time interval. Notice that the interval length for C is stretched by a factor of 2γ , compared to C * . We define p(C(t)) = S∈C(t) p(S), i.e., the total probability of unfinished nodes at time t. Note that p(C(t)) is non-increasing as t grows, and in the case of integral costs we have p(C * 0 ) = p(C * (2 −1 )) = 1, i.e., no test can be completed within a fractional cost.
The cost of some test may be truncated by the defined geometrical time intervals. To denote the actual cost of a test within an interval, we first define a path π ik in T A to be the sequence of tests involved within time (γ 2 k , ∞) for each object x i . A test d selected by object x i appears in path π ik during time interval (γ 2 k , ∞) ∩ (t i,<d , t i,<d + c(d)], where t i,<d is the total cost before test d for object x i . The truncated cost of a test d ∈ π ik within that intersection is denoted by c ik (d). Note that c ik (d) ≤ c(d). We denote the set of tests before test d in path π ik by π ik,<d .
Our greedy algorithm is identical to the algorithm of Navidi et al. (2020) except that their greedy-selection score Z at Step 8 is replaced by the new score Z at Step 9, in order to encourage impurity reduction of the selected tests. The key in the analysis of Navidi et al. is , which is proven via an intermediate value Z k defined below. We restate their technical result here.

and d(S) is the greedy test for node S. Besides, the first inequality holds regardless of the value of γ and holds as long as d(S) is a greedy test with respect to an additive score Z + D, where D can be any non-negative function; the second inequality holds regardless of the choice of tests d(S) in the decision tree.
Our new greedy score Z is in an additive form required above for the first inequality. Therefore, in our case, the difficulty mainly lies in the second inequality. We can prove that a similar lemma holds for our new greedy score.

and d(S) is the greedy test for node S.
Proof The first inequality is easy to show. Notice that compared to Z , Z introduces a third term of impurity reduction D in Eq. (2), which is always non-negative (see Sect. 3) and thus Z (d) ≥ Z (d). Thus, the first inequality in Lemma 3 also holds for the tree generated by the new greedy score. Since the first inequality in Lemma 3 does not depend on the value of γ , we replace it with the new γ , which completes the proof.
The second inequality requires more work. We denote the sum of the D terms in From Lemma 3 we know that Z ∞ k − G ≤ γ p(C k )/15. We now upper bound the additional term G. We omit S in d(S) when it is clear from the context.
where step (3) follows by enumerating the summands in a different order, step (4) is due to c ik (d) ≤ c(d), and step (5) follows by considering the telescoping series of the impurity reduction along a path of an object. Putting everything together gives Next, we use another simple lemma that provides an upper bound on the expected cost C A of T A , and a lower bound on the optimal cost C * of T * . This result is a consequence of the geometrical division of time. For example, to obtain an upper bound for C A , we assume that the set of unfinished nodes stays the same as C(γ 2 k ) during the time interval (γ 2 k , γ 2 k+1 ]. Recall that p(C(t)) is a non-increasing function of time t.
Lemma 5 (Navidi et al. 2020, Lemma 2.2) The expected cost C A of the tree T A produced by Algorithm 1, and the cost C * of the optimal tree T * for the NGI problem, satisfy the following inequalities We are now ready to prove our main result, Theorem 1, stated in Sect. 5. The proof relies on combining the results of Lemma 4 with the upper and lower bounds provided by Lemma 5.

Experimental evaluation
In this section, we evaluate the performance of our enhanced decision-tree algorithms by comparing them against strong baselines on a large collection of real-world datasets. Some additional experimental results are presented in the Appendix, including further experimental results for CART (Section C), further experimental results on tree size (Section D), additional statistical tests (Section E), and more visual examples (Section G). Our implementation and pre-processing scripts can be found in a Github repository. 2 Datasets We evaluate our methods on 20 datasets from the UCI Machine Learning Repository (Dua and Graff 2017) and OpenML (Vanschoren et al. 2013). Information about the datasets is shown in Table 1. We experiment with datasets containing up to 0.6 million objects and 5 thousand features. We set the limit θ to be 0.005 for all datasets except for small ones, whose θ are set accordingly so that the minimum leaf size is 2. For all datasets, 70% of the data points are used for training, 10% for validation and the rest for testing. Numerical features are categorized into multiple bins Numerical features are categorized into 5 bins by the k-means strategy by the k-means strategy, which can adapt to uneven data distributions. All categorical features are then binarized to avoid biases towards features with a large number of levels (Strobl et al. 2007). All identical objects are coalesced into a single object, and the sampling probability is set accordingly. To fulfill the realizability assumption, the majority class is assigned to each identical data point in the training set, which may have different classes otherwise, due to noise or feature discretization. Apart from the original datasets with unit test cost, we additionally create more challenging scenarios, where each test cost is independently drawn from the set {1, . . . , 10}. Table 2. The algorithms that implement the proposed approach are denoted as enhanced C4.5 (EC4.5) and enhanced CART (ECART). Baselines include the following:

Algorithms and baselines A summary of the algorithms is displayed in
• The ASR method (Navidi et al. 2020), which is the greedy algorithm without the newly-introduced impurity-reduction term. • Impure Pairs (IP), which maximizes the reduction in the number of impure pairs at each split, i.e., the unweighted edge cut among different classes (Golovin et al. 2010;Cicalese et al. 2014). • BAL, which is an unsupervised balanced-tree algorithm that greedily selects the test that splits the current node most evenly. • The two traditional algorithms C4.5 and CART, and their cost-benefit versions that select a test using a cost-weighted criterion (denoted with a prefix 'C').  (Breiman et al. 1984) [p][C]C4.5 Traditional C4.5 (Quinlan 1993) [p]ECART Enhanced CART (proposed method) [p]EC4.5 Enhanced C4.5 (proposed method) Prefix 'p-' indicates a variant with post-pruning To ensure a meaningful comparison, we measure performance for all methods based on the same stopping criteria. All algorithms perform two-way splitting. Splitting of tree nodes stops if homogeneity is achieved or if the minimum-probability limit is reached. We examine the performance of C4.5 and CART with post-pruning (denoted with a prefix 'p') or without. We adopt the standard minimal cost-complexity pruning approach (Breiman et al. 1984), which prunes a tree node having many leaves if its impurity is no much larger than the total impurity of its leaves. The parameter that controls the stringency of the pruning is determined by cross-validation over a logspace from 10 −5 to 1.
The only hyperparameter in our algorithm (λ) controls the trade-off between complexity and discrimination. The effect of λ is summarized in Fig. 2. For large values of λ our algorithms turn into the traditional tree-induction algorithms C4.5 and CART; on the other hand, if λ is zero, our algorithms turn into the greedy algorithm for the ASR problem. As we are working with a bi-criteria optimization problem, there is no golden rule in deciding the best value of the hyperparameter. In this experiment, we aim to decide a value of λ that preserves comparable accuracy while reduces the complexity as much as possible. Thus, we tune the hyperparameter λ by starting with a large λ and gradually decreasing it before a significant drop (larger than 1%) is seen in the predictive accuracy over the validation set. Note also that λ is invariant to the data size, as the greedy score only depends on the distributions before and after the split. Results We evaluate all methods using ROC AUC as a measure of predictive power, expected cost as a measure of tree complexity, and tree size (i.e., the number of tree nodes) as an auxiliary measure of global tree complexity. A full result on tree size is deferred to Section D in Appendix. Reported results, shown in Fig. 5, are averages over 5 executions with random train-test splits. We conduct the Bonferroni-Dunn test with significance level α = 0.05 for average ranks (Demšar 2006), and report the critical difference diagram in Figs. 3 and 4, where the proposed method pEC4.5 (or pECART) is tested against the other methods, and methods closer to the right end have a better rank. We see that the predictive power and tree complexity of pECART and pEC4.5 are statistically not significantly different from the respective best performer, while it is significantly better than most other baselines. Two methods C4.5 and CART Fig. 2 The effect of the trade-off hyperparameter λ. ROC AUC is on validation data, and expected tree height or tree size on training data lead to similar behavior; we focus on C4.5 below and discuss its results in details. Full results for CART and its enhancements are presented in the Appendix, Section C.
It can be seen that post-pruning has a noticeable positive effect on both the accuracy and complexity for the C4.5 algorithm. However, even after post-pruning, pC4.5 is still ranked closely to un-pruned EC4.5 in terms of the expected cost, and in some datasets, the expected cost of pC4.5 is about two times larger than that of EC4.5 in Fig. 5. This is reasonable because post-pruning mainly removes tree nodes near the bottom, but fails to rescue early bad splits near the root. On the other hand, post-pruning is significantly more beneficial than impurity reduction for the global tree size. Also note that postpruning has less effect on EC4.5 in terms of accuracy, which indicates that un-pruned EC4.5 alone is robust to overfitting.
The decision tree produced by BAL is the worst in both aspects. This is expected for predictive power as BAL is an unsupervised method, but it is quite surprising for complexity. It turns out that BAL often has to keep expanding a balanced tree until the minimum leaf size is reached, as tree nodes rarely achieve homogeneity. This behavior (e) (f) Fig. 3 Critical difference for the Bonferroni-Dunn test on significance level α = 0.05 for average ranks of algorithms among 20 tested datasets. Methods closer to the right end have a better rank. The method that is compared with other methods is pEC4.5, and methods lying outside the thick interval are significantly different from pEC4.5 reinforces the argument that discriminative tests help accelerating termination and reducing expected cost. The IP algorithm achieves better performance in both aspects than the ASR algorithm. However, IP has a too strong bias towards a balanced split, that it favors a random test over a discriminative one in the example we provide in Section A. This bias is also reflected in Fig. 5 where it falls behind ECART by more than 10% accuracy in some datasets. By further statistical tests we conduct in Section E, the predictive power of ASR and IP are statistically indistinguishable from the unsupervised BAL.
Finally, regarding running time, algorithm EC4.5 typically runs 3-4 times longer than C4.5, but there are instances that the latter algorithm constructs very skewed trees and it takes more time to complete (details in Appendix, Section F).
The benefit of the proposed method becomes more pronounced in non-uniformcost scenarios, shown in Fig. 6. It turns out that the cost-benefit traditional trees fail to reduce the expected cost, which indicates the need for more sophisticated techniques like ours to tackle non-uniform costs. Our algorithms obtain comparable predictive power, while achieving up to 90% lower expected cost than the traditional trees. We conclude that our enhancement, given in the form of a regularizer, strikes an excellent balance between predictive power and expected tree height.

Conclusion
In this paper, we proposed a novel algorithm to construct a general decision tree with asymptotically tight approximation guarantee on expected cost under mild assumptions. The algorithm can be used to assimilate many existing standard impurity functions so as to enhance their corresponding splitting criteria with a complexity guarantee. Through empirical evaluation on various datasets and scenarios, we verified the effectiveness of our algorithm both in terms of accuracy and complexity. Potential future directions include the study of different complexity measures, further termination criteria, and incorporating a broader family of impurity functions. Performance results with unit test costs. All plots in the same row share the same x-and y-axes. Error bars are also shown Fig. 6 Performance results with non-uniform test costs number of objects from the two different classes in a tree node. We now discuss the first criterion. In the left tree Fig. 7a, the root, the left, and the right node have 50 × 50, 0, and 26 × 50 heterogeneous pairs, respectively. In the right tree Fig. 7b, there are 50 × 50, 25 × 25, and 25 × 25 pairs, respectively. The left split decreases the number of heterogeneous pairs by 50 × 50 − 0 − 26 × 50 = 24 × 50. The right split decreases the number of heterogeneous pairs by 50 × 50 − 25 × 25 − 25 × 25 = 50 × 25. Thus, the ASR criterion will select the (non-discriminative) split Fig. 7b.

Appendix B Impurity reduction is non-negative
By the concavity property of h, it is easy to show that the impurity-reduction function d(S, d) is non-negative, for any tree node S and test d. In particular, we have where p S denotes the class distribution vector of S, and Inequality (B1) is by concavity.

Appendix C Further experimental results for CART
See Figs. 8 and 9.
(a) (b) Fig. 7 A simple example demonstrating that ASR split criteria may lead to non-discriminative decision trees Fig. 8 Performance results with unit test costs. All plots in the same row share the same x-and y-axes. Error bars are also shown

Appendix F Running time
Note that algorithm BAL misses results over some datasets because its running time is too long Fig. 14.
We also demonstrate the running time for two selected datasets with unit costs, with a large number of data objects and features, to explore the impact of number of objects and features to the running time. In general, as reflected in the worst-case time complexity O(Hmn), the algorithms complete their computation quickly in the case of a large number of data objects (n), or large number of features (m), but not both. Furthermore, the dependency on n is slightly worse than on m, as the tree height H typically has a logarithmic dependence on n (Fig. 15).

Appendix G Visual examples of real-life datasets
We visualize datasets that have meaningful features and whose trees are small enough to be contained in the paper. We also adjust the minimum leaf size (1% of the data size) to produce smaller trees.

G.1 Visualization of decision trees for shoppers dataset
See Figs. 16, 17 and 18.