Abstract
Oblique decision trees recursively divide the feature space by using splits based on linear combinations of attributes. Compared to their univariate counterparts, which only use a single attribute per split, they are often smaller and more accurate. A common approach to learn decision trees is by iteratively introducing splits on a training set in a top–down manner, yet determining a single optimal oblique split is in general computationally intractable. Therefore, one has to rely on heuristics to find nearoptimal splits. In this paper, we adapt the crossentropy optimization method to tackle this problem. The approach is motivated geometrically by the observation that equivalent oblique splits can be interpreted as connected regions on a unit hypersphere which are defined by the samples in the training data. In each iteration, the algorithm samples multiple candidate solutions from this hypersphere using the von Mises–Fisher distribution which is parameterized by a mean direction and a concentration parameter. These parameters are then updated based on the best performing samples such that when the algorithm terminates a high probability mass is assigned to a region of nearoptimal solutions. Our experimental results show that the proposed method is wellsuited for the induction of compact and accurate oblique decision trees in a small amount of time.
1 Introduction
Decision trees are among the most popular classification and regression models in the field of data mining and machine learning. Due to their human comprehensible structure, they are easy to interpret which helps providing insight into the data under consideration. The most popular employed decision tree models are univariate, involving only a single attribute per split. These splits can be interpreted as axisparallel hyperplanes in the feature space. The restriction to a single attribute simplifies the decision tree induction process and the interpretation of the individual splits. Nevertheless, these trees are often less accurate than other machine learning models. Moreover, they often become extensively large when the data cannot be split adequately by axis parallel hyperplanes which lowers their explanatory value. In these situations, the univariate splits can also be misleading as they are only meaningful in combinations with other splits further down the respective branch. Oblique decision trees overcome these shortcomings by employing splits based on linear combinations of attributes which can be interpreted as affine hyperplanes in the feature space. Although the individual splits are harder to interpret, the reduced size can increase the explanatory value and an expert in the respective domain is able to draw meaningful conclusions from the coefficients of the oblique splits which can be viewed as weights for the individual attributes.
The problem of finding optimal oblique splits for a given splitting criterion, however, is far more complex than finding univariate splits which complicates the task of inducing oblique decision trees in an acceptable amount of time. Motivated by the underlying problem’s structure, we develop an efficient crossentropy optimization (CE) approach for finding nearoptimal oblique splits that can be used in the wellknown recursive partitioning scheme for inducing oblique decision trees. In our evaluation we show the advantages of our proposed method compared to the univariate recursive partitioning decision tree induction method and the popular OC1 (Murthy et al. 1993, 1994) algorithm for inducing oblique decision trees. Furthermore, we show that it is competitive with other popular prediction models.
2 Oblique decision tree induction
In this work, we address the problem of determining optimal splits for oblique decision tree induction. Given d realvalued attributes (features) \(X_1,\ldots ,X_d\) and a response variable Y taking values in the domain \(\text {dom}(Y)\), an oblique decision tree is a binary rooted tree structure for which every inner node is associated with a rule of the form
and every leaf node is associated with a response value in \(\text {dom}(Y)\). These rules can be interpreted as hyperplanes that divide the feature space into two halfspaces. To predict the response variable of unseen data points \(x\in {\mathbb {R}}^d\), one pursues the unique path from the root node to one of the leaf nodes according to the specified rules and returns the value stored in the leaf node at the end of this path. As an example, Fig. 1 shows an oblique decision tree for the wellknown Iris dataset (Fisher 1936) constructed with our proposed CE method.
To learn high quality oblique decision trees one makes use of a training set (X, y) consisting of a matrix of n observations \(X\in {\mathbb {R}}^{n\times d}\) and a vector of responses \(y\in \text {dom}(Y)^n\). Throughout this work, we use the notation \(x_i\) for \(i=1,\ldots ,n\) to refer to a specific row of the matrix X, corresponding to the ith observation of the training set and \(y_i\) for \(i=1,\ldots ,n\) to refer to the response variable of the ith observation. We follow the commonly applied strategy of greedily dividing the training data in a top–down manner until a stopping criteria is met or no further splitting is possible because the data is indistinguishable. For classification tasks the stopping criterion is usually that all leaf nodes are sufficiently pure, i.e. the response values in the associated subset of training samples are mostly equal. For continuous domains, one usually stops whenever the mean squared error or mean absolute error of the response values of the subset of training samples is below a certain threshold. Another commonly applied strategy is to stop branching along a path when the sample size associated with the terminal node falls below a prior defined threshold. Finally, each leaf node is assigned a response variable. For classification, this is usually the class of the majority of the associated training samples and for regression the mean value or the median of the responses is used.
To determine good splits an adequate splitting criterion has to be defined beforehand. For classification trees, criteria are typically based on impurity measures which are defined as functions \(i:{\mathcal {P}}_m\rightarrow {\mathbb {R}}_{\ge 0}\) on the standard \((m1)\)simplex \({\mathcal {P}}_m:=\lbrace (p_1,\ldots ,p_m)\in [0,1]^m: \sum _{k=1}^m p_k=1 \rbrace \). The most frequently used impurity measures are:

Classification error: \(e(p)=1\max _{k=1,\ldots ,m} p_k\)

Entropy: \(h(p)=\sum _{k=1}^{m} p_k\log p_k\)

Gini impurity: \(g(p)=1\sum _{k=1}^{m} p_k^2\)
For a classification task with \(m\in {\mathbb {N}}\) different classes, let \(p=(p_1,\ldots ,p_m)\) denote the vector of relative frequencies of class labels of some subset of observations. Then, i(p) can be interpreted as a measure of heterogeneity of the class labels for the subset. Consequently, in order to evaluate a split, one uses the weighted sum of the impurities of the resulting two subset. For regression trees typical criteria to measure the heterogeneity of a set of n observations (X, y) are:

Mean squared error: \(\frac{1}{n}\sum _{i=1}^n(y_i\tilde{y_i})^2\)

Mean absolute error: \(\frac{1}{n}\sum _{i=1}^ny_i\tilde{y_i}\)
where \(\tilde{y_i}\) denotes the predicted value for observation \(x_i\) which usually corresponds to the mean or median of the vector y. A splitting criterion is again derived by taking the weighted sum of these measures for the two subsets resulting from a split. Various other splitting criteria, such as the twoing rule for classification (Breiman et al. 1984), have been proposed in the literature but we refrain from going into further detail as the presented heuristic is applicable to all of these criteria. Throughout the remainder of this work, the splitting criterion will be denoted by the function \(q:{\mathbb {R}}^{d+1}\times {\mathbb {R}}^{n\times d}\times \text {dom}(Y)^n\longrightarrow {\mathbb {R}}\) such that q(a, X, y), expresses the value of the splitting criterion of an oblique split defined by \(a\in {\mathbb {R}}^{d+1}\) on a training set (X, y). Without loss of generality, we assume that q should be minimized.
The focus of this work is the method of determining high quality oblique splits with respect to the employed splitting criterion and in the remainder of this work we describe our adaption of the CE method based on the von Mises–Fisher distribution.
3 Related work and contribution
Oblique decision tree induction has been an interesting topic of research over the past decades. The major problem related to this topic is the complexity of inducing these trees. Heath (1993) shows that even the task of finding a single optimal oblique split is NPcomplete for some splitting criteria including classification error. One approach to overcome this problem is to build oblique decision trees in a top–down manner by recursively introducing heuristically obtained oblique splits. Various heuristics have been applied for finding nearoptimal splits in this context. The first one, called CARTLC, is introduced by Breiman et al. (1984) who suggest a deterministic hillclimbing approach that sequentially updates the coefficients of the split until a local optimum is reached. To escape local optima, Heath et al. (1993) propose a simulated annealing heuristic which perturbs the hyperplane parameters one at a time. In their algorithm called OC1, Murthy et al. (1993, 1994) improve Breiman’s hill climbing approach by introducing randomization techniques also with the goal of avoiding premature convergence. Other strategies for finding oblique splits are based on metaheuristics such as simulated annealing, genetic algorithms or evolutionary algorithms (CantúPaz and Kamath 2003) or algorithms based on logistic regression (Mola and Siciliano 2002; Truong 2009), linear discriminants (Loh and Shih 1997; Li et al. 2003; Siciliano et al. 2008; LópezChau et al. 2013) or Householder transformations (Wickramarachchi et al. 2016). Recently, also mathematical optimization approaches have been proposed for inducing oblique decision trees that neglect the recursive partitioning scheme. Bertsimas and Dunn (2017) propose an integer linear programming formulation for optimizing decision trees of a predetermined depth. Blanquero et al. (2020) propose a continuous optimization approach instead for optimizing randomized oblique decision trees. The major drawback of these approaches lies in the complexity of solving the proposed mathematical optimization programs. As a consequence, they can only be solved optimally in a reasonable amount of time for small to mediumsized datasets and for small depths.
The CE method was first introduced by Rubinstein for the estimation of rare event probabilities (Rubinstein 1997) and was later adapted for solving continuous and combinatorial optimization problems (Rubinstein 1999; Rubinstein and Kroese 2004; De Boer et al. 2005). CE is a modelbased search method (Zlochin et al. 2004) that uses a parameterized probabilistic model for sampling feasible solutions that is iteratively updated until the model assigns a high probability mass to a region of nearoptimal solutions. It has been successfully applied to a wide range of optimization problems and it has also been used in the context of machine learning by Mannor et al. (2005) who develop a CE method for building support vector machines to solve classification problems.
In this work, we follow the recursive top–down approach to induce oblique decision trees. We develop a CE method based on the von Mises–Fisher distribution that is wellsuited for finding highquality oblique splits for classification and regression tasks and therefore poses an interesting alternative to existing algorithms. It is inspired by the geometry of the underlying problem structure and uses easily comprehensible parameters for finetuning the execution of the algorithm. Our evaluation shows that it is suitable for inducing highly accurate and compact oblique decision trees in a small amount of time which makes it an interesting option for reallife applications in the field of machine learning and data analytics.
4 Preliminaries
4.1 The crossentropy optimization method
In this section, we briefly summarize the CE method applied in this work. For a more comprehensive description of the method and its applications the interested reader is referred to Rubinstein and Kroese (2004), De Boer et al. (2005).
We consider a generic minimization problem of the form
with an arbitrary objective function \(q:{\mathcal {A}}\rightarrow {\mathbb {R}}\) defined on the set of feasible solutions \({\mathcal {A}}\).
Let \({\mathcal {F}}=\lbrace f_\theta : \theta \in \varTheta \rbrace \) denote a family of probability density functions (pdf) on the domain \({\mathcal {A}}\) parameterized by an element \(\theta \) in the parameter space \(\varTheta \). The CE method iteratively generates a sequence of parameters \(\theta _1,\ldots ,\theta _T\) and a sequence of levels \(\gamma _1,\ldots ,\gamma _T\) for \(T\in {\mathbb {N}}\) as follows: The first parameter \(\theta _1\) is given as an input. In each iteration \(t=1,\ldots ,T\) a random sample \(x_1,\ldots ,x_N\in {\mathcal {A}}\) of size \(N\in {\mathbb {N}}\) is derived from \(f_{\theta _{t}}\). Then, the samples are ordered based on their performance, i.e. \(q(x_{(1)})\le \cdots \le q(x_{(N)})\), and \(\gamma _t\) is calculated as the \(\rho \)quantile \(\gamma _t=q(x_{(\lceil \rho N\rceil )})\) for \(0<\rho <1\). Subsequently, \(\theta _{t+1}\) is derived by calculating the maximum likelihood parameter \(\theta _E\) for the elite samples \(E:=\lbrace x_i: i=1,\ldots ,N; q(x_i)\le \gamma _t\rbrace \) and setting \(\theta _{t+1}=\theta _E\). Alternatively, some smoothed update formula can be used that appropriately combines \(\theta _E\) and \(\theta _{t}\) to obtain \(\theta _{t+1}\). Intuitively, this update of the model parameter based on the best performing samples increases the new pdf’s likelihood of generating good solutions and therefore, the final pdf \(f_{\theta _T}\) concentrates its probability mass to a region of nearoptimal solutions. Ideally, one hopes that the sequence of pdfs \((f_{\theta _t})_{t\in {\mathbb {N}}}\) converges to the point mass of an optimal solution \(a^*\) such that the sequence of levels \((\gamma _t)_{t\in {\mathbb {N}}}\) convergences to the optimal objective value \(\gamma ^*\).
4.2 The von Mises–Fisher distribution
The CE method relies on a family of pdfs on the solution space that has to be specified in advance. In this work, we make use of the von Mises–Fisher distribution (Fisher 1953). This distribution is often referred to as the equivalent of the Gaussian distribution on the \((d1)\)sphere \(S_{d1}=\lbrace a\in {\mathbb {R}}^{d}:\Vert a \Vert _2=1\rbrace \). The pdf of the ddimensional von Mises–Fisher distribution is given by
for \(\theta \in {\mathbb {R}}^d\). The parameter \(\kappa :=\Vert \theta \Vert _2\) is referred to as the concentration parameter and \(\mu :=\frac{\theta }{\Vert \theta \Vert _2}\) denotes the mean direction. Intuitively, for \(\kappa =0\) one obtains a uniform distribution on \(S_{d1}\) and a higher value of \(\kappa \) indicates a higher concentration of the distribution around the mean direction \(\mu \). The normalization constant \(C_d(\kappa )\) is defined by
where \(I_\alpha \) denotes the modified Bessel function of the first kind of order \(\alpha \). Figure 2 illustrates the von Mises–Fisher distribution for different concentration parameters \(\kappa \) and a constant mean direction \(\mu \).
Maximum likelihood estimation To update the parameters of the pdfs in the CE method, the maximum likelihood estimates (MLE) need to be determined. As mentioned in Banerjee et al. (2005), Mardia and Jupp (1999), Dhillon and Sra (2003), given a sample \(a_1,\ldots ,a_N\in {\mathbb {R}}^{d}\) the MLE of \(\mu \) and \(\kappa \) for the von Mises–Fisher distribution are given by
where
Since there doesn’t exist an analytical solution for the inverse of the ratio of modified Bessel functions, \(\kappa \) can only be approximated. Nevertheless, sufficiently accurate approximations are available for our purposes:

Banerjee et al. (2005) give a simple approximation of \(\kappa \):
$$\begin{aligned} {\hat{\kappa }}_0=\frac{{\overline{R}}(d{\overline{R}}^2)}{1{\overline{R}}^2} \end{aligned}$$(4) 
Sra (2012) develops a more exact approximation by performing a few iterations of the Newton method starting from \({\hat{\kappa }}_0\), as derived in Eq. 4:
$$\begin{aligned} \begin{aligned} {\hat{\kappa }}_1&={\hat{\kappa }}_0\frac{A_d({\hat{\kappa }}_0){\overline{R}}}{1A_d({\hat{\kappa }}_0)^2\frac{d1}{{\hat{\kappa }}_0}A_d({\hat{\kappa }}_0)}\\ {\hat{\kappa }}_2&={\hat{\kappa }}_1\frac{A_d({\hat{\kappa }}_1){\overline{R}}}{1A_d({\hat{\kappa }}_1)^2\frac{d1}{{\hat{\kappa }}_1}A_d({\hat{\kappa }}_1)} \end{aligned} \end{aligned}$$(5)This approximation is more accurate than the one in Eq. 4, yet the evaluation of \(A_d({\hat{\kappa }}_0)\) is expensive in high dimensions and also introduces the risk of floating point over and underflows.
Sampling from the von Mises–Fisher distribution For our CE algorithm, in order to ensure low running times, we need an efficient algorithm for sampling from the von Mises–Fisher distribution. The description of such an algorithm is beyond the scope of this paper, yet, the interested reader is referred to Wood’s algorithm (Wood 1994) which is based on Ulrich’s results (Ulrich 1984).
5 The crossentropy method for optimization of oblique splits
5.1 Geometric motivation
In this work, the problem under consideration is finding the coefficients \(a\in {\mathbb {R}}^{d+1}\) of an oblique split on a training set (X, y) that minimizes a certain splitting criterion q. Note that the quality of an oblique split does not depend on the length \(\Vert a \Vert _2\) of the vector \(a\in {\mathbb {R}}^{d+1}\) but solely on its direction \(\nicefrac {a}{\Vert a\Vert _2}\). Thus, as \(a=0\) is no meaningful solution, we can, more adequately, identify the feasible region by the dsphere \(S_d=\lbrace a\in {\mathbb {R}}^{d+1}:\Vert a \Vert _2=1\rbrace \) instead of \({\mathbb {R}}^{d+1}\). Hence, the problem can alternatively be expressed as:
Note that each observation \(x_i\) for \(i=1,\ldots , n\) from the training set X defines a hyperplane \(H(x_i):=\lbrace a\in {\mathbb {R}}^{d+1}: a_1x_{i1}+\cdots +a_dx_{id}+a_{d+1}= 0 \rbrace \) which divides \(S_d\) into two half spheres
and
Observation \(x_i\) satisfies the rule \(\sum _{j=1}^{d} a_jX_j+a_{d+1}\ge 0\) if and only if \(a\in S_d^+(x_i)\). Therefore, the nonempty sets in \({\mathcal {R}}:=\lbrace R= \bigcap _{i=1}^n S_d^{\sigma _i}(x_i): \sigma \in \lbrace +,\rbrace ^n\ \wedge \ R\ne \emptyset \rbrace \) correspond to connected regions of solutions on the sphere \(S_d\) with the same value for the splitting criterion. Hence, the problem of finding an optimal oblique split is equivalent to finding an optimal region \(R\in {\mathcal {R}}\). It follows from Cover’s findings (Cover 1965) that there are up to \(2\sum _{j=0}^{d} \left( {\begin{array}{c}n1\\ j\end{array}}\right) \) regions in \({\mathcal {R}}\) which also explains the complexity of finding an optimal oblique split and the necessity of relying on heuristics. Figure 3 shows the resulting regions on \(S_2\) for a dataset with 30 observations chosen uniformly at random from \([1,1]\times [1,1]\).
The previous observation motivates a heuristic that randomly samples points from \(S_d\) and iteratively updates the sampling distribution by favoring the best points until the sampling density assigns most of its probability mass to a nearoptimal region on the dsphere. The CE optimization framework is suitable for achieving that goal and the von Mises–Fisher distribution provides an efficient parametric sampling scheme and easy to compute accurate approximations of MLEs which are necessary for the CE method.
5.2 Algorithmic description
A summary of our proposed CE method is presented in Algorithm 1. The algorithm gets a training set (X, y) and the splitting criterion as an input. Apart from that, the sample size \(N\in {\mathbb {N}}\), the quantile parameter \(0<\rho <1\), the smoothing parameter \(0\le \alpha <1\) and a termination parameter \(K\in {\mathbb {N}}_0\) have to be specified.
First, our algorithm normalizes the dataset X. The resulting nearoptimal solution \(a^*\) is denormalized at the end of the algorithm such that it equivalently divides the dataset as the solution found for the normalized dataset. The reason for normalizing is discussed in Sect. 5.4.
Initially, we set the iteration counter \(t=1\). The parameter \(\kappa _1\) is initialized to zero and \(\mu _1\) is set to the \((d+1)\)dimensional vector of zeros. This way, the first sample is drawn from the uniform distribution on the dsphere \(S_d\). The incumbent solution \(a^*\) is also initialized uniformly at random. The variable \(\gamma ^*\), which is used to track the minimum value of the level parameters \(\gamma _t\), is initially set to infinity. To count the number of consecutive iterations without improvement of \(\gamma ^*\), the auxiliary variable k is introduced and initialized to zero.
In the tth iteration, the algorithm draws N samples from the von Mises–Fisher distribution with density \(f_{\theta _{t}}\) for \(\theta _{t}=\kappa _{t}\mu _{t}\) using Wood’s algorithm. The samples are sorted in increasing order based on their performance and the level \(\gamma _t\) is computed as the \(\rho \)quantile of the sample performances. If the best sample is better than the incumbent solution \(a^*\), the incumbent solution is updated accordingly. If \(\gamma _t<\gamma ^*\), the variable \(\gamma ^*\) is updated and k is reset to zero. Otherwise, the counter k is increased by one. Subsequently, the set of elite samples E with an objective value smaller or equal to \(\gamma _t\) is determined and the maximum likelihood estimates \(\mu _{E}\) and \(\kappa _{E}\) for the elite samples are approximated as described in Sect. 4.2. The new parameters \(\kappa _{t+1}\) and \(\mu _{t+1}\) are then derived using the smooth update formulas
which combine the current mean direction and concentration with the maximum likelihood estimates for the elite samples. Note that the mean direction of the von Mises–Fisher distribution is a unit vector which explains the scaling in the update formula for \(\mu _{t+1}\). Without scaling, the concentration would be affected. If the parameter \(\alpha \) is set to 0, no smoothing is carried out and the new parameters for \(\mu _{t+1}\) and \(\kappa _{t+1}\) are set directly to the maximum likelihood estimates of the elite samples.
The algorithm stops when the minimum observed level \(\gamma ^*\) has not been updated for K consecutive iterations. This indicates that the algorithm has converged to a region of similar solutions and no further improvement of the incumbent solution can be expected.
5.3 Illustration
In this section, we briefly illustrate the execution of our proposed algorithm for the Iris dataset. For illustration purposes, the dataset is restricted to the attributes sepal length and sepal width and it is normalized using the robust scaling method described in the next section. As objective function the weighted gini impurity is used and the parameters are set to \(\rho =0.1\), \(\alpha =0.8\) and \(K=1\).
Figure 4 summarizes the progress of our method. As we can see, in the first iteration the samples are taken uniformly at random from \(S_2\). In the following iterations the mean direction converges and the concentration increases such that in the final iteration the pdf \(\theta _6\) concentrates its probability mass tightly around a region of optimal solutions. It can be shown that the optimal solution has an objective value of \(\nicefrac {1}{3}\) which corresponds to the value taken on by level parameter \(\gamma _5\). The algorithm terminates after iteration \(t=6\) as the level parameter \(\gamma ^*\) could not be improved.
Figure 5 shows the obtained split for both, the normalized and the original dataset and we can see, that the split perfectly separates the samples of class setosa from the rest of the dataset.
5.4 Normalization
Empirically, we found that our algorithm can vastly benefit from data normalization techniques. We observed that normalization improves the quality of the splits as well as the runtime of our algorithm. The reason for this is that without normalization, the regions in \({\mathcal {R}}\) are not spread evenly across the sphere. To illustrate this, we again consider the Iris dataset restricted to the attributes sepal length and sepal width. Figure 6 shows how the \(S_2\) sphere is divided into regions by the samples in the dataset before and after normalization. For this, example, we used robust normalization, i.e. we set \(x'_{ij}=\frac{x_{ij}Med(X_j)}{IQR(X_j)}\) where \(Med(X_j)\) and \(IQR(X_j)\) correspond to the median and the interquartile range of feature \(X_j\), respectively.
As we can see, without normalization most of the regions in \({\mathcal {R}}\) are concentrated very tightly within a small portion of the unit sphere while after normalization, the regions are spread out more heterogeneously. Thus, normalization enables our algorithm to find meaningful solutions earlier in the optimization process which improves the overall runtime.
Additionally, normalization can help to reduce numerical instabilities in the sampling process and the approximation of the maximum likelihood estimates as the region of equivalent solutions become larger and thus, smaller values of \(\kappa \) are required in the later stages of our algorithm.
Other normalization variants which might lead to better results in certain situations include:

Min–MaxScaling: \(x'_{ij}=\frac{x_{ij}{ Mean}(X_j)}{{ Max}(X_j)\hbox {Min}(X_j)}\)

StandardScaling: \(x'_{ij}=\frac{x_{ij}{ Mean}(X_j)}{\sigma (X_j)}\)
Here \({ Mean}(X_j)\), \(\hbox {Max}(X_j)\), \(\hbox {Min}(X_j)\) and \(\sigma (X_j)\) denote the mean, maximum, minimum and standard deviation of feature \(X_j\), respectively.
At the end of our algorithm, we denormalize the obtained oblique split such that it divides the dataset in the same manner as the split for the normalized data. If we use any normalization method of the form \(x'_{ij}=\frac{x_{ij}\delta _j}{\varDelta _j}\), this can be achieved by setting
for \(j=1,\ldots ,d+1\). Then, \({\overline{a}}\) has the same value of the splitting criterion on (X, y) as \(a^*\) on the normalized dataset \((X',y)\).
5.5 Feature reduction
It is sometimes desirable to exclude some of the attributes in a split in order to decrease the computational effort of the CE method and to increase interpretability of the splits. As an example, consider the situation in which a node relatively deep within the tree should be split. It is quite common that in such a situation the number of attributes is higher than the number of observations associated with the node. As a result, there exist optimal oblique splits which don’t require all of the attributes or equivalently satisfy \(a_j=0\) for some \(j=1,\ldots ,d\). It is, however, not clear beforehand which of the attributes can be left out safely without eliminating the optimal oblique splits.
Yet, there is a simple way to eliminate features due to linear dependencies that, for every oblique split involving all of the attributes, guarantees the existence of an equivalent split which uses less attributes than there are observations in the dataset. The key observation for this idea is summarized by the following lemma.
Lemma 1
Let (X, y) denote a training set consisting of n observations and d features and let \(X_1,\ldots ,X_d\) denote the columns of the matrix X. Further, let \({\mathbf {1}}\) denote the ndimensional allones vector. Let \(J\subseteq \lbrace 1,\ldots ,d \rbrace \) denote a maximal subset of indices such that the set of vectors \(\lbrace X_j: j\in J\rbrace \cup \lbrace {\mathbf {1}}\rbrace \) is linear independent. Then, for every \(a\in {\mathbb {R}}^{d+1}\) there exists \({\overline{a}}\in {\mathbb {R}}^{d+1}\) with \({\overline{a}}_j=0\) for every \(j\not \in J\) that satisfies
Proof
As the set \(\lbrace X_j: j\in J\rbrace \cup \lbrace {\mathbf {1}}\rbrace \) is linear independent and J is maximal, each \(X_{j'}\) for \({j'}\not \in J\) can be expressed as a linear combination of these vectors. It follows that there exist \(\lambda _j\in {\mathbb {R}}\) for \(j\in J\) and \(\lambda _{d+1}\in {\mathbb {R}}\) such that
Consequently, it holds that
Defining \({\overline{a}}\in {\mathbb {R}}^{d+1}\) by
concludes the proof.\(\square \)
Clearly, as \({{\,\mathrm{rank}\,}}(({\mathbf {1}}\ X))\le \min (n,d+1)\) the previous lemma implies the existence of an optimal oblique split involving less attributes than observations if \(n\le d\). A subset J of attributes as required by the lemma can easily be obtained by performing Gaussian elimination on the transposed matrix \(({\mathbf {1}}\ X)^T\).
One should note that the choice of attributes is not necessarily unique and can influence the heuristic search process of the proposed CE method and thus, also affect the objective value of the final solution. Intuitively, to simplify the heuristic search one would rather eliminate attributes that require additional information than ones that are expressive on their own. In order to accommodate for this, we propose the following strategy: One first computes the best univariate split for each attribute and then ranks the attributes by their performance. In each iteration of the Gaussian elimination procedure, we then choose the next pivot element based on this ranking if there are multiple candidates.
5.6 Parallelization
Our proposed CE method can vastly benefit from parallelization. In order to determine the value of the splitting criterion for a sample \(a\in S_d\), n scalar products have to be evaluated, one for each sample in the training set. Consequently, the algorithm spends most of its execution time computing those. Clearly, this can easily be parallelized by dividing the samples evenly into multiple batches and assigning each batch to a different thread which computes the respective scalar products. This way, we can leverage the capabilities of modern multicore processors to speed up the execution time of our algorithm.
6 Evaluation
In this section, a comprehensive evaluation of our proposed method, which in the remainder of this section will be denoted by CEDT, is presented. Our algorithm was implemented in C++ and all experiments were executed on a computer equipped with an Intel Xeon E31231v3 @3.40 GHz (4 Cores) and 32 GB DDR3 RAM running Ubuntu 20.04.
We use 20 classification datasets from the UCI Machine Learning repository (Dheeru and Karra Taniskidou 2017) to conduct our experiments. An overview of the employed datasets is presented in Table 1. One thing to note here is that the number of attributes includes the dummy variables resulting from onehot encoding of categorical attributes, as explained below. We evaluate three different quality criteria which are outofsample accuracy, tree size in terms of the number of leaf nodes and the time necessary for building the trees.
CEDT is compared to the OC1 oblique decision tree induction algorithm and to our own implementation of the greedy top–down algorithm for inducing univariate decision trees (DT). The latter basically corresponds to the wellknown CART algorithm but by using our own implementation, we ensure maximum comparability to CEDT as only the method for deriving the splits is replaced. We also include results of the decision trees when an additional postpruning step is executed. For this, we choose minimal costcomplexity pruning, which was introduced by Breiman et al. (1984). Additionally, for outofsample accuracy we compare CEDT to two other popular classification models, namely neural networks (NN) and support vector machines (SVM). Moreover, we include results for the two treebased ensemble methods random forest (RF) and Gradient Boosting (GB). For these comparison classifiers we use the implementations provided in Python’s Scikitlearn library (Pedregosa et al. 2011) at default parameters.
For each dataset 10fold crossvalidation is performed and the average performance and the standard deviation, each rounded to two decimals, is reported. For the results of the pruned decision trees 10% of the training data is heldout for pruning. Missing numerical values are imputed beforehand with the median of the values in the training set for the respective attribute. For categorical attributes the most frequent category is used instead. Subsequently, onehot encoding is employed to deal with categorical attributes. This preprocessing guarantees that in each run of the 10fold crossvalidation scheme all of the classifiers are trained and tested on the same subsets of observations.
We use the gini impurity as a splitting criterion and for all three decision tree induction algorithms the recursive partitioning is stopped when a leaf node holds less than four samples. This is also the standard setting used by OC1. For the CE method we choose the parameters \(\alpha =0.8\), \(\rho =0.1\), \(K=3\) and \(N=\max (100,2d\log _2(n_l))\) where d denotes the number of attributes and \(n_l\) denotes the number of training observations associated with the leaf node l to be split. This choice ensures that the number of samples drawn in each iteration of the CE method for the root node grows logarithmically with the total number of training observations in the dataset. Moreover the number of samples decreases with every split in the tree but it never falls below 100. This guarantees that at least 10 elite samples for updating the mean direction and the concentration are used even when the number of training observations associated with the node to be split is very small. Note that these parameters have not been optimized specifically for each individual dataset but they generally resulted in a good performance in our preliminary experiments. In practice the given parameters can serve as an orientation but should be tuned further, for example by increasing the number of samples, to improve the performance of the algorithm.
For the evaluations, we follow Demšar’s (2006) guidelines to compare multiple classifiers and perform a Friedman test (Friedman 1937, 1940) to check whether the performances of at least two classifiers differ significantly. This nonparametric statistical test ranks the performances of the algorithms for each dataset from 1 to k where k is the number of algorithms to be compared. The lowest rank is assigned to the best and the highest rank to the worst performing algorithm. In case of ties the average rank is assigned. The nullhypothesis is that the classifiers perform equally well and in this case the observed ranks should be approximately equal. This hypothesis is rejected if the pvalue computed by the Friedman test is below a certain significance level. If a significant difference for at least two classifiers is observed, a subsequent Holm test (Holm 1979) with CEDT as control method is carried out to check which of the other methods perform significantly better or worse than CEDT. For each other method the nullhypotheses is therefore that it performs equally well as CEDT. The pvalues for each of those are calculated and ordered from lowest to highest. If k denotes the number of hypotheses and i the position in the ordering, the Holm test adjusts these pvalues by setting \(p'_{1}=\min \lbrace 1, kp_{1} \rbrace \) and \(p'_{i}=\min \lbrace 1, \max \lbrace p'_{i1},(k+1i)p_{i} \rbrace \rbrace \) for \(i=2,\ldots ,k\). Smaller pvalues are therefore increased more strongly than higher ones which ensures that the familywise error rate is below the required significance level. Then, in order of the original pvalues, the adjusted pvalues are compared to the significance level until it is exceeded for the first time. All hypotheses before that are rejected and we can conclude that there is a significant difference to the performance of the control method. For our evaluation we set the significance level for both statistical tests to 0.05.
6.1 Comparison of accuracy
In this section, we evaluate the accuracy of our proposed method which is defined as the ratio of the number of wellclassified samples to the overall number of samples in the testing data.
Table 2 summarizes the average accuracies over the ten runs of the crossvalidation process of the unpruned decision trees using the decision tree induction algorithms CEDT, DT and OC1 for each of the employed datasets. With a value of 1.5 CEDT has the lowest average rank of the three methods and OC1 and DT share the second place with an average rank of 2.25. With a pvalue of 0.02 the Friedman test reports a significant difference for at least two methods and the posthoc Holm test, as summarized in Table 3, rejects both nullhypotheses. This allows the conclusion that CEDT induced significantly more accurate decision trees in this evaluation.
The results for the pruned decision trees, as summarized in Table 4, are similar. CEDT again has the lowest average rank of 1.45, OC1 has a rank of 2.25 and DT’s rank is 2.3. The Friedman test returns a value of 0.01 and the nullhypotheses is again rejected. As for the unpruned trees, the Holm test, summarized in Table 5, confirms that CEDT significantly outperforms the other two methods.
Table 6 summarizes the results of pruned CEDT, the nontreebased classifiers NN and SVM and the treebased ensemble methods GB and RF. RF has the lowest average rank of 2.3, followed by GB with a rank of 2.375. With a rank of 3.225 CEDT is in third place before SVM and NN with ranks 3.4 and 3.7, respectively. Though with a pvalue of \(1.3\times 10^{2}\) the Friedman test reports a significant difference between at least two classifiers, the posthoc Holm test, which is summarized in Table 7, does not provide sufficient evidence for a significant difference between CEDT and any of the other methods.
6.2 Comparison of tree size
Next, we compare the sizes of the decision trees induced by CEDT, DT and OC1. The tree size is measured in terms of the number of leaf nodes which corresponds to the number of distinct regions the feature space is divided into by the decision tree.
The results for the decision tree induction algorithms without pruning are presented in Table 8. For all but three dataset CEDT induced the smallest decision trees and thus it has the lowest rank of 1.15. OC1 is in the second place with a rank of 1.9 and DT has a rank of 2.95 as it induced the largest trees for all but one dataset. With a pvalue of \(8\times 10^{8}\) the Friedman test confirms a significant difference. The subsequent Holm test, which is summarized in Table 9, rejects both nullhypotheses from which we can conclude that CEDT induced significantly smaller decision trees than DT and OC1 in our experiments.
For the pruned decision trees the results are summarized in Table 10. CEDT again has the lowest rank of 1.325 followed by OC1 with a rank of 1.925 and DT ranks last with a value of 2.75.
The Friedman test reports a significant difference with a pvalue of \(5\times 10^{5}\) but the Holm test, summarized in Table 11, only confirms that CEDT performs significantly better than DT. It does not provide sufficient evidence that the nullhypothesis for OC1 can be rejected.
6.3 Comparison of induction times
Finally, the induction times for the three treebased induction methods are summarized in Table 12. DT is the fastest method for all datasets and thus has the rank 1. OC1 has a rank of 2.4 and CEDT is in the third place with a rank of 2.6. Not surprisingly, the Friedman test reports a significant difference with a pvalue of \(3\times 10^{7}\) and the posthoc Holm test, summarized in Table 13, confirms that DT is significantly faster than CEDT. It does, however, not provide evidence for a significant difference between the induction times of OC1 and CEDT.
6.4 Discussion of the results
Overall our evaluation shows that the presented CE method is wellsuited for inducing compact and accurate decision trees. The induction algorithm itself, without additional postpruning, significantly outperforms both the univariate and the other oblique decision tree induction algorithm in terms of accuracy and tree size. We therefore conclude that the induction algorithm is more powerful than the two existing methods regarding these criteria.
With additional postpruning, CEDT is still significantly better in terms of accuracy. Although it still has the lowest rank for tree size, this difference can only be confirmed to be significant for DT but not for OC1. We conclude OC1 could benefit more from pruning than CEDT. We assume that this observation can be explained by the fact that CEDT’s splits are more effective at dividing the feature space and therefore induce smaller trees in the first place. As a consequence the splits are less likely to be pruned.
Compared to DT the performance increase comes at the cost of increased induction time. This is due to the fact that determining optimal univariate splits is far less complex than finding optimal oblique splits. Still our results indicate that the increase in induction time is acceptable in many situations. Considering our results, we assume that the runtime of the algorithm is in the same order of magnitude as other oblique decision tree induction algorithms such as OC1.
In terms of accuracy CEDT also managed to achieve the lowest rank compared to the nontreebased classifiers NN and SVM. The fact that the difference is not significant is so far not suprising as clearly there is no prediction model that works best in all situations. Nonetheless, the result supports the claim that CEDT is competitive with those algorithms and can even outperform them for a variety of datasets.
Combining multiple prediction models into ensembles typically yields more robust and accurate predictors than using single models. This explains why GB and RF are the most accurate classifiers in our evaluation. Although the difference to CEDT is not confirmed to be significant, we believe that ensemble methods are generally superior in terms of pure prediction accuracy over single decision trees. The drawback is that these ensemble methods cannot be interpreted which is the main reason to use a single decision tree instead. These ensemble methods, however, are not restricted to univarite decision trees as base models and therefore, our proposed CE method can also be employed for gradient boosting or to build oblique random forests which is an interesting open topic for further research.
7 Conclusion
In this work, we illustrate how the problem of finding optimal oblique splits can be equivalently interpreted as locating an optimal region on a hypersphere intersected by hyperplanes defined by the observations in a dataset. This view on the problem motivates the application of the proposed CE algorithm that uses the von Mises–Fisher distribution. Our evaluation indicates that this method is wellsuited for efficiently inducing accurate and compact oblique decision trees that are often superior to decision trees induced by existing algorithms and competitive with other nontreebased prediction models. As a future work, we would like to test our method on reallife classification and regression tasks. Furthermore, it will be interesting to investigate the advantageous effects of our oblique decision trees on ensemble methods such as random forests or gradient boosting.
References
Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises–Fisher distributions. J Mach Learn Res 6(Sep):1345–1382
Bertsimas D, Dunn J (2017) Optimal classification trees. Mach Learn 106(7):1039–1082. https://doi.org/10.1007/s1099401756339
Blanquero R, Carrizosa E, MoleroRío C, Morales DR (2020) Sparsity in optimal randomized classification trees. Eur J Oper Res 284(1):255–272. https://doi.org/10.1016/j.ejor.2019.12.002
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth and Brooks, Monterey
CantúPaz E, Kamath C (2003) Inducing oblique decision trees with evolutionary algorithms. IEEE Trans Evol Comput 7(1):54–68. https://doi.org/10.1109/TEVC.2002.806857
Cover TM (1965) Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans Electron Comput 3:326–334. https://doi.org/10.1109/PGEC.1965.264137
De Boer PT, Kroese DP, Mannor S, Rubinstein RY (2005) A tutorial on the crossentropy method. Ann Oper Res 134(1):19–67. https://doi.org/10.1007/s104790055724z
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accesed 2 October 2020
Dhillon IS, Sra S (2003) Modeling data using directional distributions. Technical reports on TR0306, Department of Computer Sciences, The University of Texas at Austin, Austin, TX, USA
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188. https://doi.org/10.1111/j.14691809.1936.tb02137.x
Fisher RA (1953) Dispersion on a sphere. Proc R Soc Lond Ser A Math Phys Sci 217(1130):295–305. https://doi.org/10.1098/rspa.1953.0064
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701. https://doi.org/10.1080/01621459.1937.10503522
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92. https://doi.org/10.1214/aoms/1177731944
Heath D, Kasif S, Salzberg S (1993) Induction of oblique decision trees. In: Proceedings of the thirteenth international joint conference on artificial intelligence. Morgan Kaufmann Publishers, pp 1002–1007
Heath DG (1993) A geometric framework for machine learning. Ph.D. Thesis, Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70
Li XB, Sweigart JR, Teng JT, Donohue JM, Thombs LA, Wang SM (2003) Multivariate decision trees using linear discriminants and tabu search. IEEE Trans Syst Man Cybern Part A Syst Hum 33(2):194–205. https://doi.org/10.1109/TSMCA.2002.806499
Loh WY, Shih YS (1997) Split selection methods for classification trees. Stat Sinica 7(4):815–840
LópezChau A, Cervantes J, LópezGarcía L, Lamont FG (2013) Fisher’s decision tree. Expert Syst Appl 40(16):6283–6291. https://doi.org/10.1016/j.eswa.2013.05.044
Mannor S, Peleg D, Rubinstein R (2005) The cross entropy method for classification. In: Proceedings of the 22nd international conference on machine learning, association for computing machinery, New York, NY, USA, ICML’05, pp 561–568, https://doi.org/10.1145/1102351.1102422
Mardia KV, Jupp PE (1999) Directional statistics, vol 494. Wiley, Chichester
Mola F, Siciliano R (2002) Discriminant analysis and factorial multiple splits in recursive partitioning for data mining. In: Roli F, Kittler J (eds) Multiple classifier systems. Springer, Berlin, pp 118–126. https://doi.org/10.1007/3540454284_12
Murthy SK, Kasif S, Salzberg S, Beigel R (1993) Oc1: A randomized algorithm for building oblique decision trees. In: Proceedings of AAAI. Citeseer, pp 322–327
Murthy SK, Kasif S, Salzberg S (1994) A system for induction of oblique decision trees. J Artif Intell Res 2:1–32. https://doi.org/10.1613/jair.63
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikitlearn: machine learning in Python. J Mach Learn Res 12:2825–2830
Rubinstein R (1999) The crossentropy method for combinatorial and continuous optimization. Methodol Comput Appl Probab 1(2):127–190. https://doi.org/10.1023/A:1010091220143
Rubinstein RY (1997) Optimization of computer simulation models with rare events. Eur J Oper Res 99(1):89–112. https://doi.org/10.1016/S03772217(96)003852
Rubinstein RY, Kroese DP (2004) The crossentropy method: a unified approach to combinatorial optimization, MonteCarlo simulation, and machine learning, vol 133. Springer, Berlin
Siciliano R, Aria M, D’Ambrosio A (2008) Posterior prediction modelling of optimal trees. In: Brito P (ed) COMPSTAT 2008, PhysicaVerlag HD, Heidelberg, pp 323–334, https://doi.org/10.1007/9783790820843_27
Sra S (2012) A short note on parameter approximation for von Mises–Fisher distributions: and a fast implementation of \(I_s (x)\). Comput Stat 27(1):177–190. https://doi.org/10.1007/s001800110232x
Truong AKY (2009) Fast growing and interpretable oblique trees via logistic regression models. Ph.D. Thesis, Oxford University, Oxford, United Kingdom
Ulrich G (1984) Computer generation of distributions on the msphere. J R Stat Soc Ser C (Appl Stat) 33(2):158–163. https://doi.org/10.2307/2347441
Wickramarachchi D, Robertson B, Reale M, Price C, Brown J (2016) HHCART: an oblique decision tree. Comput Stat Data Anal 96:12–23. https://doi.org/10.1016/j.csda.2015.11.006
Wood AT (1994) Simulation of the von Mises–Fisher distribution. Commun Stat Simul Comput 23(1):157–164. https://doi.org/10.1080/03610919408813161
Zlochin M, Birattari M, Meuleau N, Dorigo M (2004) Modelbased search for combinatorial optimization: a critical survey. Ann Oper Res 131(1–4):373–395. https://doi.org/10.1023/B:ANOR.0000039526.52305.af
Open Access
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bollwein, F., Westphal, S. Oblique decision tree induction by crossentropy optimization based on the von Mises–Fisher distribution. Comput Stat 37, 2203–2229 (2022). https://doi.org/10.1007/s00180022011957
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180022011957
Keywords
 Oblique decision trees
 Crossentropy optimization
 von Mises–Fisher distribution
 Classification
 Regression