EndtoEnd Learning of Decision Trees and Forests
Abstract
Conventional decision trees have a number of favorable properties, including a small computational footprint, interpretability, and the ability to learn from little training data. However, they lack a key quality that has helped fuel the deep learning revolution: that of being endtoend trainable. Kontschieder et al. (ICCV, 2015) have addressed this deficit, but at the cost of losing a main attractive trait of decision trees: the fact that each sample is routed along a small subset of tree nodes only. We here present an endtoend learning scheme for deterministic decision trees and decision forests. Thanks to a new model and expectation–maximization training scheme, the trees are fully probabilistic at train time, but after an annealing process become deterministic at test time. In experiments we explore the effect of annealing visually and quantitatively, and find that our method performs on par or superior to standard learning algorithms for oblique decision trees and forests. We further demonstrate on image datasets that our approach can learn more complex split functions than common oblique ones, and facilitates interpretability through spatial regularization.
Keywords
Decision forests Endtoend learning Efficient inference Interpretability1 Introduction
Neural networks are currently the dominant classifier in computer vision (Russakovsky et al. 2015; Cordts et al. 2016), whereas decision trees and decision forests have proven their worth when training data or computational resources are scarce (Barros et al. 2012; Criminisi and Shotton 2013). One can observe that both neural networks and decision trees are composed of basic computational units, the perceptrons and nodes, respectively. A crucial difference between the two is that in a standard neural network, all units are evaluated for every input, while in a reasonably balanced decision tree with I inner split nodes, only \(\mathcal {O}(\log I)\) split nodes are visited. That is, in a decision tree, a sample is routed along a single path from the root to a leaf, with the path conditioned on the sample’s features. Various works are now exploring the relation between both classification approaches (Ioannou et al. 2016; Wang et al. 2017), such as the Deep Neural Decision Forests (DNDFs) (Kontschieder et al. 2015). Similar to deep neural networks, DNDFs require evaluating all computational paths to all leaf nodes in a tree for each test sample, which results in high accuracy, but incurs large computational and memory costs especially as trees grow deeper.
Our work proposes an orthogonal approach. We seek to stick to traditional decision trees and forests as inference models for their advantages, while improving learning of such trees through endtoend training with backpropagation, one of the hallmarks of neural networks. It is efficiency, induced by the sparsity of the sampledependent computational graph, that piques our interest in decision trees. Further, we also hope to profit from their relative interpretability. Endtoend training allows optimizing all levels of a decision tree jointly. Furthermore, features can now be jointly learned through linear nodes, but also through more complex split functions such as small convolutional neural networks (CNNs). This is a feature that has so far been missing in deterministic decision trees, which are usually constructed greedily without subsequent tuning. We propose a mechanism to remedy this deficit.
1.1 Related Work
Random forests are ensembles of decision trees, and were introduced by Breiman (2001). In this section, we review use cases for trees and forests, choices for split functions, different optimization strategies, and the connection to deep neural networks.
Applications While neural networks have these days superseded all other approaches in terms of achievable accuracy on many benchmarks (Cardona et al. 2010; Lin et al. 2014; Russakovsky et al. 2015; Cordts et al. 2016), stateoftheart networks are not easy to interpret, are fairly hungry for training data, often require weeks of GPU training and have a computational and memory footprint that limits their use on small embedded devices. Decision trees and decision tree ensembles, such as random forests, generally achieve lower accuracy on large datasets, but are fundamentally more frugal. They have shown their effectiveness on a variety of classification tasks (Barros et al. 2012; FernándezDelgado et al. 2014) and also found wide application in computer vision, e.g. Viola and Jones (2001), Shotton et al. (2011), Criminisi and Shotton (2013), Dollár et al. (2014), Zhang et al. (2017), Cordts et al. (2017). They are well suited for tasks where computational resources are limited, e.g. realtime human pose estimation in the Microsoft Kinect (Shotton et al. 2011), or few and unbalanced training samples are available, e.g. during online object tracking (Zhang et al. 2017).
Decision trees are also found in expert systems, since they are widely recognized as interpretable models which divide a complex classification task in several simpler ones. For instance, Worachartcheewan et al. (2010), PinhasHamiel et al. (2013), Huang et al. (2015) use their interpretability for diabetes research, and Guh et al. (2011) interpret their outcome prediction of in vitro fertilization. Likewise, De Ville (2006) explains the application of decision trees in business analytics.
Split functions There have been several attempts to train decision trees with more complex split functions. Menze et al. (2011) have benchmarked oblique random forests on various binary classification problems. These oblique random forests used linear and nonlinear classifiers at each split in the decision trees and thereby combined more than one feature at a time. Montillo et al. (2013) have successfully approximated the information gain criterion using a sigmoid function and a smoothness hyperparameter. Expanding these ideas, Laptev and Buhmann (2014) have trained small convolutional neural networks (CNNs) at each split in a decision tree to perform binary segmentation. Rota Bulo and Kontschieder (2014) also apply a greedy strategy to learn neural networks for each split node and hence learn the structure of the tree. Notably, these approaches use gradient optimization techniques, but are lacking joint optimization of an entire tree, i.e. endtoend learning of the entire model.
Optimization Hyafil and Rivest (1976) have shown that the problem of finding an optimal decision tree is NPcomplete. As a consequence, the common approach is to find axisaligned splits by exhaustive search, and learn a decision tree with a greedy levelbylevel training procedure as proposed by Breiman et al. (1984). In order to improve their performance, it is common practice to engineer split features for a specific task (Lepetit et al. 2005; Gall and Lempitsky 2009; Kontschieder et al. 2013; Cordts et al. 2017). Evolutionary algorithms are another group of optimization methods which can potentially escape local optima, but are computationally expensive and heuristic, requiring to tune many parameters (cf. Barros et al. (2012) for a survey).
Norouzi et al. (2015b) propose an algorithm for optimization of an entire tree with a given structure. They show a connection between optimizing oblique splits and structured prediction with latent variables. As a result, they formulate a convex–concave upper bound on the tree’s empirical loss. The same upper bound is used to find an initial tree structure in a greedy algorithm. Their method is restricted to linear splits and relies on the kernel trick to introduce higher order split features. Alternating Decision Forests (Schulter et al. 2013) instead include a global loss when growing the trees, thereby optimizing the whole forest jointly.
Some works have explored gradientbased optimization of a full decision tree model already. While Suárez and Lutsko (1999) focused on a fuzzy approach of decision tree, Jordan (1994) introduced hierarchical mixtures of experts. In the latter model the predictions of expert classifiers are weighted based on conditional path probabilities in a fully probabilistic tree.
Kontschieder et al. (2015) make use of gradientbased decision tree learning to learn a deep CNN and use it as a feature extractor for an entire ensemble of decision trees. They use sigmoid functions to model the probabilistic routes and employ a loglikelihood objective for training. However, their inference model is unlike a standard tree as it stays fuzzy or probabilistic after training. When predicting new samples, all leaves and paths need to be evaluated for every sample, which subverts the computational benefits of trees. Furthermore, they consider only balanced trees, so the number of evaluated split functions at test time grows exponentially with increased tree depth.
Connections to deep neural networks Various works explore the connections between neural networks and traditional decision tree ensembles. Sethi (1990), Welbl (2014) cast decision tree ensembles to neural networks, which enables gradient descent training. As long as the structure of the trees is preserved, the optimized parameters of the neural network can also be mapped back to the decision forest. Subsequently, Richmond et al. (2016) map stacked decision forests to CNNs and found an approximate mapping back. Frosst and Hinton (2017) focus on using the learned predictions of neural networks as a training target for probabilistic decision trees.
A related research direction is to learn conditional computations in deep neural networks. In Ioannou et al. (2016), Bolukbasi et al. (2017), McGill and Perona (2017), Wang et al. (2018), Huang et al. (2018), several models of neural networks with separate, conditional data flows are discussed. Still, the structure of the resulting inference models is fixed a priori.
1.2 Contributions

We propose to learn deterministic decision trees and forests in an endtoend fashion. Unlike related endtoend approaches (Kontschieder et al. 2015), we obtain trees with deterministic nodes at test time. This results in efficient inference as each sample is only routed along one unique path of only \(\mathcal {O}(\log I)\) out of the I inner nodes in a tree. To reduce variance, we can also combine multiple trees in a decision forest ensemble. Furthermore, an endtoend trainable tree can provide interpretable classifiers on learned visual features, similar to how decision trees are used in financial or medical expert systems on handcrafted features. In this context, we show the benefit of regularizing the spatial derivatives of learned features when samples are images or image patches.

To enable endtoend training of a decision tree, we propose to use differentiable probabilistic nodes at train time only. We develop a new probabilistic split criterion that generalizes the longestablished information gain (Quinlan 1990). A key aspect of this new tree formulation is the introduction of a steepness parameter for the decision (Montillo et al. 2013). The proposed criterion is asymptotically identical to information gain in the limit of very steep nonlinearities, but allows to better model class overlap in the vicinity of a split decision boundary.

A matching optimization procedure is proposed. During training, the probabilistic trees are optimized using the Expectation–Maximization algorithm (Jordan and Jacobs 1994). Importantly, the steepness parameter is incrementally adjusted in an annealing scheme to make decisions ever more deterministic, and bias the model towards crispness. The proposed procedure also constructs the decision trees levelbylevel, hence trees will not grow branches any further than necessary. Compared to initialization with balanced trees (Kontschieder et al. 2015) our approach reduces the expected depth of the tree, which further improves efficiency.
2 Methods
Consider a classification problem with input space \(\mathcal {X} \subset \mathbb {R}^p\) and output space \(\mathcal {Y} = \{1,\ldots ,K\}\). The training set is defined as \(\{\varvec{x}_1,\ldots ,\varvec{x}_N\} = \mathcal {X}_t \subset \mathcal {X}\) with corresponding classes \(\{y_1,\ldots ,y_N\} = \mathcal {Y}_t \subset \mathcal {Y}\).
2.1 Standard Decision Tree and Notation
In binary decision trees (Fig. 1c), split functions \(s : \mathbb {R} \rightarrow [0,1]\) determine the routing of a sample through the tree, conditioned on that sample’s features. The split function controls whether the splits are deterministic or probabilistic. The prediction is made by the leaf node that is reached by the sample.
Split nodes Each split node \(i \in \{1,\ldots ,I\}\) computes a split feature from a sample, and sends that feature into a split function. That function is a map \(f_{\varvec{\beta }_i} : \mathbb {R}^p \rightarrow \mathbb {R}\) parametrized by \(\varvec{\beta }_i\). For example, oblique splits are a linear combination of the input, i.e. \(f_{\varvec{\beta }_i}(\varvec{x}) = (\varvec{x}^T, 1) \cdot \varvec{\beta }_i \) with \(\varvec{\beta }_i \in \mathbb {R}^{p+1}\). Similarly, an axisaligned split perpendicular to axis a is represented by an oblique split whose only nonzero parameters are at index a and \(p+1\). We write \(\varvec{\theta }_{\varvec{\beta }} = (\varvec{\beta }_1,\ldots ,\varvec{\beta }_I)\) to denote the collection of all split parameters in the tree.
Leaf nodes Each leaf \(\ell \in \{1,\ldots ,L\}\) stores the parameters of a categorical distribution over classes \(k \in \{1,\ldots ,K\}\) in a vector \(\varvec{\pi }_\ell \in [0,1]^K\). These vectors are normalized such that the probability of all classes in a leaf sum to \(\sum _{k=1}^K (\varvec{\pi }_\ell )_k = 1\). We define \(\varvec{\theta }_{\varvec{\pi }} = (\varvec{\pi }_1, \ldots ,\varvec{\pi }_L)\) to include all leaf parameters in the tree.
In standard deterministic decision trees as proposed in Breiman et al. (1984), the split function is a step function \(s(x) = \varTheta (x)\) with \(\varTheta (x) = 1\) if \(x > 0\) and \(\varTheta (x) = 0\) otherwise.
2.2 Probabilistic Decision Tree
2.3 Expectation–Maximization
In summary, each iteration of the algorithm requires evaluation of Eqs. 10 and 13, as well as at least one update of the split parameters based on Eq. 14. This iterative algorithm can be applied to a binary decision tree of any given structure.
2.4 Complex Splits and Spatial Regularization
The proposed optimization procedure only requires the split features f to be differentiable with respect to the split parameters. As a result, it is possible to implement more complex splits than axisaligned or oblique splits. For example, it is possible to use a small convolutional neural network (CNN) as split feature extractor for f and learn its parameters (Sect. 3.4).
2.5 Decision Tree Construction
The previous sections outlined how to fit a decision tree to training data, given a fixed tree topology (parameter learning). Additionally to this deterministic decision tree Finetuning, we propose a Greedy algorithm to construct a tree by successively splitting nodes and optimizing them on subsets of the training data.
 1.
Initialize the decision tree with a single candidate node as the tree root.
 2.
Split the training data into subsets. Starting from the root node with the entire training dataset, the data is successively decomposed using deterministic routing. As a result, nonoverlapping subsets are assigned to the candidate nodes.
 3.For each candidate node:
 (a)
If the training data subset of the candidate node is pure or the maximum number of attemps has been reached, skip steps 3b to 3d for this node and fix it as a leaf node.
 (b)
Replace node with a new tree stump, i.e. one split node and two leaf nodes.
 (c)
Optimize only the tree stump using the Finetune algorithm (see Sect. 2.3) on the assigned training data subset for the specified number of epochs.
 (d)
If training the stump failed, then try training a new stump by repeating from 3a.
 (a)
 4.
Find leaf node candidates that may be split according to the specified tree limits.
 5.
If candidate nodes are found, repeat from 2. Otherwise, stop, the decision tree construction is finished.
During training of a tree stump, only one split node and two leaf nodes are optimized. As a result, the loglikelihood objective (Eq. 4) then resembles an approximation of the widely used information gain criterion Quinlan (1990, 1993) (Sect. 2.6).
After this greedy structure learning, the nodes in the entire resulting tree can be finetuned jointly as described in Sect. 2.3, this time with probabilistic routing of all training data.
2.6 Relation to Information Gain and Leaf Entropies
We now show that maximization of the loglikelihood of the probabilistic decision tree model approximately minimizes the weighted entropies in the leaves. The steeper the splits become, the better the approximation.
In conclusion, we have shown that for \(\gamma \rightarrow \infty \), maximizing the loglikelihood objective minimizes a weighted sum of leaf entropies. For the special case of a single split with two leaves, this is the same as maximizing the information gain. Consequently, the loglikelihood objective (Eq. 4) can be regarded as a generalization of the information gain criterion (Quinlan 1990) to an entire tree.
2.7 Decision Forest
Following the ideas introduced by Breiman (2001), we combine decision trees to a decision forest. Specifically, each decision tree is constructed with our Greedy algorithm on the full dataset. Afterwards, using our Finetune algorithm, each tree is optimized endtoend. Note that the result is a decision forest rather than a random forest, since each tree is trained independently on all train data rather instead of on random subsets.
In order to reduce correlation between the decision tree predictions, we train each split function only on a subset of the available features. For each split, this feature subset is sampled from a uniform distribution or, in the case of 2D images, will consist of connected 2D patches of the image.
3 Experiments
We conduct experiments on data from various domains. For quantitative comparison of our endtoend learned oblique decision trees (E2EDT), we evaluate the performance on the multivariate but unstructured datasets used in Norouzi et al. (2015b) (Sect. 3.1). In order to understand the learning process of the probabilistic training and deterministic inference model, we visually examine the models on an image segmentation dataset (Sect. 3.2). Next, we show that the proposed algorithm can learn meaningful spatial features on MNIST, FashionMNIST and ISBI, as has previously been demonstrated in neural networks but not in decision trees (Sect. 3.3). We also demonstrate that a deterministic decision tree with complex split nodes can be trained endtoend, by using a small neural network in each split node (Sect. 3.4). Further, we quantitatively evaluate the effect of the steepness annealing in an endtoend learned decision forest (E2EDF, Sect. 3.5) and compare the tradeoff between computational load and accuracy to stateoftheart decision forests (Sect. 3.6).
3.1 Performance of Oblique Decision Trees
We compare the performance of our algorithm in terms of accuracy to all results reported in Norouzi et al. (2015b). In order to provide a fair comparison, we refrain from using pruning, ensembles and regularization.
Datasets Norouzi et al. (2015b) reports results on the following four datasets: MNIST (LeCun et al. 1998), SensIT (Duarte and Hu 2004), Connect4 (Dua and Graff 2017) and Protein (Wang 2002). The multiclass classification datasets are obtained from the LIBSVM repository (Fan and Lin 2011). When a separate test set is not provided, we randomly split the data into a training set with 80% of the data and use 20% for testing. Likewise, when no validation set is provided, we randomly extract 20% of the training set as validation set.
Compared algorithms We compare algorithms that use a deterministic decision tree for prediction, with either oblique or axisaligned splits. The following baselines were evaluated in Norouzi et al. (2015b): Axisaligned: conventional axisaligned splits based on information gain; OC1: oblique splits optimized with coordinate descent as proposed in Murthy (1996); Random: selected the best of randomly generated oblique splits based on information gain; CO2: greedy oblique tree algorithm based on structured learning (Norouzi et al. 2015a); Nongreedy: nongreedy oblique decision tree algorithm based on structured learning (Norouzi et al. 2015b).
We compare the results of these algorithms with two variants of our proposed method. Here, Greedy E2EDT denotes a greedy initialization where each oblique split is computed using the EM optimization. For each depth, we apply the Finetune E2EDT algorithm to the tree obtained from the Greedy E2EDT algorithm at that depth. In the following we refer to them as Greedy and Finetune.
Hyperparameters and initialization We keep all hyperparameters fixed and conduct a grid search over the number of training epochs in \(\{20, 35, 50, 65\}\), using a train/validation split. The test data is only used to report the final performance.
The split steepness hyperparameter is set to \(\gamma = 1.0\) initially and increased by 0.1 after each epoch (one epoch consists of the split parameter \(\varvec{\theta }_{\varvec{\beta }}\) updates of all training batches as well as the update of the leaf predictions \(\varvec{\theta }_{\varvec{\pi }}\)). Initial split directions are sampled from the unit sphere and the categorical leaf predictions are initialized uniformly.
Results Figure 2 shows the test and training statistical accuracy of the different decision tree learning algorithms. The accuracy of a classifier is defined as the ratio of correctly classified samples in the respective set. It was evaluated for a single tree at various maximum depths. The red solid lines show the result of our proposed algorithm, the dashed lines represent results reported by Norouzi et al. (2015b).
Our algorithms achieve higher test accuracy than previous work, especially in extremely shallow trees. The highest increase in test accuracy is observed on the MNIST data set. Here, we significantly outperform previous approaches for oblique decision trees at all depths. In particular, an oblique decision tree of depth 4 is already sufficient to surpass all competitors.
On SensIT and Protein we perform better than or on par with the Nongreedy approach proposed in Norouzi et al. (2015b). Note that further hyperparameter tuning may reduce overfitting, e.g. on the Protein dataset, and thus the results may improve. We did not include this here, as we aimed to provide a fair comparison and show the performance given very little parametertuning.
In conclusion, our proposed (E2EDT) algorithm is able to learn more accurate deterministic oblique decision trees than the previous approaches.
3.2 Visual Convergence of Training and Inference Model
During training, we gradually steer the probabilistic training model towards a deterministic model by increasing the steepness \(\gamma \). We now visually examine the difference between the probabilistic training model and the deterministic inference model. For this purpose, we train an oblique decision tree for a binary image segmentation task on the ISBI challenge dataset (Cardona et al. 2010). This challenging image segmentation benchmark comprises serial section Transmission Electron Microscopy images (Fig. 3a) and binary annotations of neurons and membranes (Fig. 3e). For every pixel, we take a 9\(\times \)9 window around the current pixel as input features to an oblique decision tree. Consequently, the learned parameters at each split node can be regarded as a spatial kernel. We initialize a balanced oblique decision tree of depth 6 and use the Finetune algorithm to optimize the entire tree. We use the default steepness increase of \(\varDelta \gamma = 0.1\) per epoch.
Results Figure 3a shows a sample image of the input and Fig. 3e the corresponding groundtruth labels. Figure 3b–d illustrate the posterior probability (Eq. 2) predicted by the probabilistic training model at different training stages. The posterior probabilities of the corresponding inference models are shown below, in Fig. 3f–h. The visualization of the prediction shows pixels more likely to be of class “membrane” with darker color.
3.3 Interpretation of Spatially Regularized Parameters
We now investigate the effects of spatial regularization (Sect. 2.4) on the parameters of oblique decision trees learned with our algorithm. Recall that regularization penalizes differences in adjacent parameters. For this purpose, we train oblique decision trees on the MNIST digit dataset (LeCun et al. 1998), the FashionMNIST fashion product dataset (Xiao et al. 2017) and the ISBI image segmentation dataset (Cardona et al. 2010). For MNIST and FashionMNIST, the training images consist of \(28\times 28\) images. For the segmentation task on ISBI, a sliding window of size \(31 \times 31\) is used as input features for each pixel in the center of the window.
Results In Fig. 4 we visualized selected parameters of the oblique splits at various depths with and without regularization. The learned parameter vectors are reshaped to the respective training image dimensions, and linearly normalized to the full grayscale range. In both cases, we select parameter vectors that display interesting visible structures.
The parameters without regularization appear very noisy. In contrast, with regularization the algorithm learns smoother parameter patterns, without decreasing the accuracy of the decision trees. The patterns learned on the MNIST show visible sigmoidal shapes and even recognizable digits. On the FashionMNIST dataset, the regularized parameters display the silhouettes of coats, pants and sneakers. Likewise, our algorithm is able to learn the structures of membranes on the realworld biological electron microscopy images from the ISBI dataset.
3.4 CNN Split Features
We test the effectiveness of CNNs as split features in a decision tree on MNIST. At each split, we trained a very simple CNN of the following architecture: Convolution \(5\times 5\) kernel @ 3 output channels \(\rightarrow \) Max Pool \(2\times 2\)\(\rightarrow \) ReLU \(\rightarrow \) Convolution \(5\times 5\) @ 6 \(\rightarrow \) Max Pool \(2\times 2\)\(\rightarrow \) ReLU \(\rightarrow \) Fully connected layer \(96\times 50\)\(\rightarrow \) ReLU \(\rightarrow \) Fully connected layer \(50\times 1\). The final scalar output is the split feature, which is the input to the split function.
Again, we train greedily to initialize the tree, however we split nodes in a bestfirst manner, based on highest information gain. As a result, the trees can be fairly unbalanced despite impure leaves. We now choose to stop at a maximum of 10 leaves, as we aim to increase interpretability and efficiency by having one expert leaf per class.
Results In this setting, a single decision tree achieves a test accuracy of \(98.2\% \pm 0.3\%\) deterministic evaluation of nodes. For comparison, a standard random forest ensemble with 100 trees only reaches \(96.79\% \pm 0.07\%\).
Such decision tree models provide interesting benefits in interpretability and efficiency, which are the main advantages of decision trees. When a sample was misclassified it is straightforward to find the split node that is responsible for the error. This offers interpretability as well as the possibility to improve the overall model. Other methods, such as OneVsOne or OneVsRest multiclass approaches, provide similar interpretability, however at a much higher cost at test time. This is due to the fact that in a binary decision tree with K leaves, i.e. a leaf for each class, it is sufficient to evaluate \(\mathcal {O}(\log K)\) split nodes. In OneVsOne and OneVsAll it is necessary to evaluate \(K (K1) / 2\) and K different classifiers at test time, respectively.
3.5 Steepness Annealing Analysis
In Sect. 2, we motivate the introduction and annealing of the steepness hyperparameter based on two observations. Firstly, steeper decisions, although hindering endtoend learning, reflect our final inference model more closely. Secondly, in the limit of steep decisions, our learning objective approximates the information gain (see Sect. 2.6), which is well established for decision tree learning.
In this experiment, we investigate the effectiveness of annealing the steepness hyperparameter. For this purpose, we train decision forest ensembles of oblique deterministic decision trees (see Sect. 2.7). We use different annealing schemes for the steepness to study the impact on the performance. The steepness is always initialized as \(\gamma = 1.0\) and \(\varDelta \gamma > 0\) denotes the enforced increase in steepness after each epoch. Thus, \(\varDelta \gamma = 0\) effectively ignores the hyperparameter as it will stay constant during training. We perform this comparison for three different settings of the number of epochs (15, 30, 45). This means that during the Greedy tree construction each split is trained for exactly this number of epochs. Afterwards, each tree is optimized endtoend based on our Finetune algorithm for three times as many epochs as during the construction phase (e.g. 30 epochs Greedy and 90 epochs Finetune training). This choice is motivated by validation experiments which showed the importance of the Finetune algorithm in the decision forest and do not affect the comparison of different \(\varDelta \gamma \).
Datasets We follow the procedure described in section 5.1 of Kontschieder et al. (2015), and use the same datasets, number of features, and number of trees as they do. These datasets are Letter (Frey and Slate 1991), USPS (Hull 1994) and MNIST (LeCun et al. 1998). The features are randomly chosen for each split separately. For completeness, details on the dataset and specific settings are listed in Table 2 in the Appendix.
Comparison of the validation accuracy of our endtoend learned deterministic decision forests for different values of the gradual steepness increase \(\varDelta \gamma \) on various datasets (see Appendix, Table 2)
Dataset  \(\varDelta \gamma \)  15 Epochs  30 Epochs  45 Epochs 

Letter  0.0  76.7  81.1  84.7 
Letter  0.01  81.8  89.2  92.5 
Letter  0.1  92.6  94.5  95.5 
USPS  0.0  88.5  92.3  94.1 
USPS  0.01  91.5  95.3  96.3 
USPS  0.1  96.1  96.6  96.7 
MNIST  0.0  97.9  98.1  98.1 
MNIST  0.01  97.9  98.1  98.1 
MNIST  0.1  97.7  97.8  97.4 
3.6 Tradeoff Between Computational Load and Accuracy
Due to the conditional data flow, deterministic decision forests only evaluate a fraction of the entire model during prediction and thus require significantly less computations than a probabilistic forest model. We now quantify the tradeoff in computational load and accuracy of our endtoend learned deterministic decision forests compared to the stateoftheart probabilistic shallow Neural Decision Forests (sNDF) by Kontschieder et al. (2015). For this purpose, we evaluate our decision forest (E2EDF) on the same datasets which were used in their evaluation (see Sect. 3.5). Both models, E2EDF and sNDF, are based on oblique splits and we use the same maximum depth per tree and the same number of trees in a forest as the sNDF.
We additionally compare our results to other deterministic tree ensemble methods: the standard random forest (RF), boosted trees (BT) and alternating decision forests (ADF). The corresponding results were reported by Schulter et al. (2013) and are always based on 100 trees in the ensemble with maximum depth of either 10, 15 or 25. Since their work only lists ranges of explored parameter settings, we will base the estimated computational load (i.e. number of split evaluations) on the most favorable parameter settings.
Since BT, RF and ADF are in practice limited to linear split functions, we restrict our E2EDF models to oblique splits as well in this comparison. To train our E2EDF models, we use our default steepness increase of \(\varDelta \gamma = 0.1\) per epoch. On USPS as well as Letter, the models are trained for 45 epochs, whereas on MNIST, training is done only for 15 epochs due to the larger amount of training data. Note that, as in Sect. 3.5, we Finetune the final tree for three times as many epochs as during the Greedy training (e.g. for USPS: 45 epochs Greedy and 135 epochs Finetune). Training is done on the full training data, i.e. including validation data, and evaluate on the provided test data. The reported accuracy is averaged over three runs.
Results The tradeoff in terms of computational load and accuracy of the different decision forest models is shown in Fig. 6. We find that deterministic E2EDF achieves higher average accuracy than RF and BT on all datasets, and outperforms all other methods on MNIST. Compared to ADF, the results of E2EDF are competitive, although relative performance varies between datasets. A possible explanation is that the ADF results were obtained using different hyperparameters that allow more and deeper trees, which can lead to significant differences as shown in Schulter et al. (2013).
On Letter and USPS, sNDF achieves higher accuracy but at several orders of magnitude higher computational cost as it lacks the conditional data flow property. In fact, a single tree in the sNDF requires a total of 1023 split evaluations, which is more than for our entire forest models, namely up to 1000 evaluations on USPS. A complete overview of the number split function evaluations per algorithm is given in Table 3 in the Appendix.
Figure 6 further presents the impact of using fewer decision trees in our forest model by illustrating the performance of small ensembles (\(T \in \{1,3,5,10\}\)). On MNIST and USPS we observe that even smaller E2EDF ensembles with only \(T=10\) trees already obtains competitive accuracy.
4 Conclusion
We presented a new approach to train deterministic decision trees with gradientbased optimization in an endtoend manner, E2EDT. The approach uses a probabilistic tree formulation during training to facilitate backpropagation and optimize all splits of a tree jointly.
We found that by adjusting the steepness of the decision boundaries in an annealing scheme, the method learns increasingly more crisp trees that capture uncertainty as distributions at the leaf nodes, rather than as distributions over multiple paths. The resulting optimized trees are therefore deterministic rather than probabilistic, and run efficiently at test time as only a single path through the tree is evaluated. This approach outperforms previous training algorithms for oblique decision trees. In a forest ensemble, our method shows competitive or superior results to the stateoftheart sNDF, even though our trees only evaluate a fraction of the split functions at test time. Unlike ADF, we are not restricted to only use oblique split functions, thanks to the gradientbased optimization. We show that it is straightforward to include more complex split features, such as convolutional neural networks, or to add spatial regularization constraints. Another demonstrated benefit is that the learned decision tree can also help interpret how the decision of a visual classification tasks is constructed from a sequence of simpler tests on visual features.
Future work can proceed in various directions. First, alternatives for the annealing scheme could be explored, e.g. the changes in the steepness of tree splits might be adjusted dynamically rather than in a fixed schedule. Second, we have so far only optimized each tree independently, but potentially optimizing and refining the whole forest jointly could yield further improvements, similar to ADF and sNDF.
Overall, the presented approach provides high flexibility and the potential for accurate models that maintain interpretability and efficiency due to the conditional data flow.
Footnotes
 1.
Code is available at http://www.github.com/tomsal/endtoenddecisiontrees
Notes
References
 Barros, R. C., Basgalupp, M. P., De Carvalho, A. C., & Freitas, A. A. (2012). A survey of evolutionary algorithms for decisiontree induction. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(3), 291–312.CrossRefGoogle Scholar
 Bolukbasi, T., Wang, J., Dekel, O., & Saligrama, V. (2017). Adaptive neural networks for fast testtime prediction. arXiv:1702.07811
 Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324.CrossRefzbMATHGoogle Scholar
 Breiman, L., Friedman, J., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Boca Raton: Chapman & Hall/CRC.zbMATHGoogle Scholar
 Cardona, A., Saalfeld, S., Preibisch, S., Schmid, B., Cheng, A., Pulokas, J., et al. (2010). An integrated micro and macroarchitectural analysis of the drosophila brain by computerassisted serial section electron microscopy. PLOS Biology, 8(10), 1–17. https://doi.org/10.1371/journal.pbio.1000502.CrossRefGoogle Scholar
 Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In 2016 IEEE computer society conference on computer vision and pattern recognition.Google Scholar
 Cordts, M., Rehfeld, T., Enzweiler, M., Franke, U., & Roth, S. (2017). Treestructured models for efficient multicue scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(7), 1444–1454.CrossRefGoogle Scholar
 Criminisi, A., & Shotton, J. (2013). Decision forests for computer vision and medical image analysis. Berlin: Springer.CrossRefGoogle Scholar
 De Ville, B. (2006). Decision trees for business intelligence and data mining: Using SAS enterprise miner. Cary: SAS Institute.Google Scholar
 Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8), 1532–1545.CrossRefGoogle Scholar
 Dua, D., & Graff, C. (2017). UCI machine learning repository. Retrieved February 18, 2019 from http://archive.ics.uci.edu/ml.
 Duarte, M. F., & Hu, Y. H. (2004). Vehicle classification in distributed sensor networks. Journal of Parallel and Distributed Computing, 64(7), 826–838.CrossRefGoogle Scholar
 Eilers, P. H. C., & Marx, B. D. (1996). Flexible smoothing with Bsplines and penalties. Statistical Science, 11, 89–121.MathSciNetCrossRefGoogle Scholar
 Fan, R. E., & Lin, C. J. (2011). Libsvm data: Classification, regression and multilabe. Retrieved May 30, 2017 from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
 FernándezDelgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15, 3133–3181.MathSciNetzbMATHGoogle Scholar
 Frey, P. W., & Slate, D. J. (1991). Letter recognition using Hollandstyle adaptive classifiers. Machine Learning, 6(2), 161–182.Google Scholar
 Frosst, N., & Hinton, G. (2017). Distilling a neural network into a soft decision tree. arXiv:1711.09784.
 Gall, J., & Lempitsky, V. (2009). Classspecific hough forests for object detection. In 2009 IEEE computer society conference on computer vision and pattern recognition (pp. 1022–1029). https://doi.org/10.1109/CVPR.2009.5206740.
 Guh, R. S., Wu, T. C. J., & Weng, S. P. (2011). Integrating genetic algorithm and decision tree learning for assistance in predicting in vitro fertilization outcomes. Expert Systems with Applications, 38(4), 4437–4449. https://doi.org/10.1016/j.eswa.2010.09.112.CrossRefGoogle Scholar
 Hehn, T. M., & Hamprecht, F. A. (2018). Endtoend learning of deterministic decision trees. In German conference on pattern recognition (pp. 612–627). Berlin, Springer.Google Scholar
 Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., & Weinberger, K. (2018). Multiscale dense networks for resource efficient image classification. In International conference on learning representations (ICLR).Google Scholar
 Huang, G. M., Huang, K. Y., Lee, T. Y., & Weng, J. T. Y. (2015). An interpretable rulebased diagnostic classification of diabetic nephropathy among type 2 diabetes patients. BMC Bioinformatics, 16(1), S5.CrossRefGoogle Scholar
 Hull, J. J. (1994). A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5), 550–554.CrossRefGoogle Scholar
 Hyafil, L., & Rivest, R. L. (1976). Constructing optimal binary decision trees is NPcomplete. Information Processing Letters, 5(1), 15–17.MathSciNetCrossRefGoogle Scholar
 Ioannou, Y., Robertson, D., Zikic, D., Kontschieder, P., Shotton, J., Brown, M., & Criminisi, A. (2016). Decision forests, convolutional networks and the models inbetween. arXiv:1603.01250.
 Jordan, M. I. (1994). A statistical approach to decision tree modeling. In Proceedings of the seventh annual conference on computational learning theory, New York, NY, USA, COLT ’94 (pp. 13–20).Google Scholar
 Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2), 181–214. https://doi.org/10.1162/neco.1994.6.2.181.CrossRefGoogle Scholar
 Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.Google Scholar
 Kontschieder, P., Fiterau, M., Criminisi, A., & Rota Bulò S. (2015). Deep neural decision forests. In ICCV.Google Scholar
 Kontschieder, P., Kohli, P., Shotton, J., & Criminisi, A. (2013). Geof: Geodesic forests for learning coupled predictors. In 2013 IEEE computer society conference on computer vision and pattern recognition.Google Scholar
 Laptev, D., & Buhmann, J. M. (2014). Convolutional decision trees for feature learning and segmentation. In German Conference on Pattern Recognition (pp. 95–106). Springer, Berlin.Google Scholar
 LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRefGoogle Scholar
 Lepetit, V., Lagger, P., & Fua, P. (2005). Randomized trees for realtime keypoint recognition. In 2005 IEEE computer society conference on computer vision and pattern recognition (vol. 2, pp. 775–781 vol. 2). https://doi.org/10.1109/CVPR.2005.288.
 Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. arXiv:1405.0312.
 McGill, M., & Perona, P. (2017). Deciding how to decide: Dynamic routing in artificial neural networks. In Precup, D., & Teh, Y.W. (Eds.) Proceedings of the 34th international conference on machine learning, PMLR, International Convention Centre, Sydney, Australia, Proceedings of Machine Learning Research (vol. 70, pp. 2363–2372).Google Scholar
 Menze, B. H., Kelm, B. M., Splitthoff, D. N., Koethe, U., & Hamprecht, F. A. (2011). On oblique random forests. Springer (pp. 453–469).Google Scholar
 Montillo, A., Tu, J., Shotton, J., Winn, J., Iglesias, J., Metaxas, D., & Criminisi, A. (2013). Entanglement and differentiable information gain maximization. In Decision forests for computer vision and medical image analysis, Chapter 19 (pp. 273–293). Springer.Google Scholar
 Murthy, K. V. S. (1996). On growing better decision trees from data. Ph.D. thesis, The Johns Hopkins University.Google Scholar
 Norouzi, M., Collins, M. D., Fleet, D. J., & Kohli, P. (2015a). Co2 forest: Improved random forest by continuous optimization of oblique splits. arXiv:1506.06155.
 Norouzi, M., Collins, M. D., Johnson, M., Fleet, D. J., & Kohli, P. (2015b). Efficient nongreedy optimization of decision trees. In NIPS.Google Scholar
 Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch. In NIPSW.Google Scholar
 PinhasHamiel, O., Hamiel, U., Greenfield, Y., Boyko, V., GraphBarel, C., Rachmiel, M., et al. (2013). Detecting intentional insulin omission for weight loss in girls with type 1 diabetes mellitus. International Journal of Eating Disorders, 46(8), 819–825. https://doi.org/10.1002/eat.22138.CrossRefGoogle Scholar
 Quinlan, J. R. (1990). Induction of decision trees. In Shavlik, J. W., Dietterich, T. G. (Eds.), Readings in machine learning, Morgan Kaufmann, originally published in Machine Learning 1:81–106, 1986.Google Scholar
 Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco, CA: Morgan Kaufmann Publishers Inc.Google Scholar
 Richmond, D., Kainmueller, D., Yang, M., Myers, E., & Rother, C. (2016). Mapping autocontext decision forests to deep convnets for semantic segmentation. In Richard C Wilson, E. R. H., Smith, W. A. P. (Eds.), Proceedings of the British machine vision conference (BMVC), BMVA Press (pp. 144.1–144.12). https://doi.org/10.5244/C.30.144.
 Rose, K., Gurewitz, E., & Fox, G. C. (1990). Statistical mechanics and phase transitions in clustering. Physics Review Letters, 65, 945–948. https://doi.org/10.1103/PhysRevLett.65.945.CrossRefGoogle Scholar
 Rota Bulo, S., & Kontschieder, P. (2014). Neural decision forests for semantic image labelling. In 2014 IEEE computer society conference on computer vision and pattern recognition.Google Scholar
 Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252. https://doi.org/10.1007/s112630150816y.MathSciNetCrossRefGoogle Scholar
 Schulter, S., Wohlhart, P., Leistner, C., Saffari, A., Roth, P. M., & Bischof, H. (2013). Alternating decision forests. In 2013 IEEE computer society conference on computer vision and pattern recognition (pp. 508–515). https://doi.org/10.1109/CVPR.2013.72.
 Sethi, I. K. (1990). Entropy nets: From decision trees to neural networks. Proceedings of the IEEE, 78(10), 1605–1613.CrossRefGoogle Scholar
 Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Realtime human pose recognition in parts from single depth images. In 2011 IEEE computer society conference on computer vision and pattern recognition (pp. 1297–1304). https://doi.org/10.1109/cvpr.2011.5995316.
 Suárez, A., & Lutsko, J. F. (1999). Globally optimal fuzzy decision trees for classification and regression. IEEE Transactions on Pattern Analysis Machine Intelligence, 21(12), 1297–1311.CrossRefGoogle Scholar
 Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In 2001 ieee computer society conference on computer vision and pattern recognition (p. 511). IEEE.Google Scholar
 Wang, J. Y. (2002). Application of support vector machines in bioinformatics. Master’s thesis, National Taiwan University, Department of Computer Science and Information Engineering.Google Scholar
 Wang, S., Aggarwal, C., & Liu, H. (2017). Using a random forest to inspire a neural network and improving on it. In Proceedings of the 2017 SIAM international conference on data mining (pp. 1–9). SIAM.Google Scholar
 Wang, X., Yu, F., Dou, Z. Y., Darrell, T., & Gonzalez, J. E. (2018). Skipnet: Learning dynamic routing in convolutional networks. In The European conference on computer vision (ECCV).Google Scholar
 Welbl, J. (2014). Casting random forests as artificial neural networks (and profiting from it). In GCPR.Google Scholar
 Worachartcheewan, A., Nantasenamat, C., IsarankuraNaAyudhya, C., Pidetcha, P., & Prachayasittikul, V. (2010). Identification of metabolic syndrome using decision tree analysis. Diabetes Research and Clinical Practice, 90(1), e15–e18.CrossRefGoogle Scholar
 Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747.
 Zhang, L., Varadarajan, J., Nagaratnam Suganthan, P., Ahuja, N., & Moulin, P. (2017). Robust visual tracking using oblique random forests. In 2017 IEEE computer society conference on computer vision and pattern recognition (pp. 5589–5598). IEEE.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.