1 Introduction

Decision trees are among the most popular classification and regression models in the field of data mining and machine learning. Due to their human comprehensible structure, they are easy to interpret which helps providing insight into the data under consideration. The most popular employed decision tree models are univariate, involving only a single attribute per split. These splits can be interpreted as axis-parallel hyperplanes in the feature space. The restriction to a single attribute simplifies the decision tree induction process and the interpretation of the individual splits. Nevertheless, these trees are often less accurate than other machine learning models. Moreover, they often become extensively large when the data cannot be split adequately by axis parallel hyperplanes which lowers their explanatory value. In these situations, the univariate splits can also be misleading as they are only meaningful in combinations with other splits further down the respective branch. Oblique decision trees overcome these shortcomings by employing splits based on linear combinations of attributes which can be interpreted as affine hyperplanes in the feature space. Although the individual splits are harder to interpret, the reduced size can increase the explanatory value and an expert in the respective domain is able to draw meaningful conclusions from the coefficients of the oblique splits which can be viewed as weights for the individual attributes.

The problem of finding optimal oblique splits for a given splitting criterion, however, is far more complex than finding univariate splits which complicates the task of inducing oblique decision trees in an acceptable amount of time. Motivated by the underlying problem’s structure, we develop an efficient cross-entropy optimization (CE) approach for finding near-optimal oblique splits that can be used in the well-known recursive partitioning scheme for inducing oblique decision trees. In our evaluation we show the advantages of our proposed method compared to the univariate recursive partitioning decision tree induction method and the popular OC1 (Murthy et al. 1993, 1994) algorithm for inducing oblique decision trees. Furthermore, we show that it is competitive with other popular prediction models.

2 Oblique decision tree induction

In this work, we address the problem of determining optimal splits for oblique decision tree induction. Given d real-valued attributes (features) \(X_1,\ldots ,X_d\) and a response variable Y taking values in the domain \(\text {dom}(Y)\), an oblique decision tree is a binary rooted tree structure for which every inner node is associated with a rule of the form

$$\begin{aligned} \sum _{j=1}^{d} a_jX_j+a_{d+1}\ge 0 \end{aligned}$$

and every leaf node is associated with a response value in \(\text {dom}(Y)\). These rules can be interpreted as hyperplanes that divide the feature space into two halfspaces. To predict the response variable of unseen data points \(x\in {\mathbb {R}}^d\), one pursues the unique path from the root node to one of the leaf nodes according to the specified rules and returns the value stored in the leaf node at the end of this path. As an example, Fig. 1 shows an oblique decision tree for the well-known Iris dataset (Fisher 1936) constructed with our proposed CE method.

Fig. 1
figure 1

Oblique decision tree for the Iris datset constructed with the CE method

To learn high quality oblique decision trees one makes use of a training set (Xy) consisting of a matrix of n observations \(X\in {\mathbb {R}}^{n\times d}\) and a vector of responses \(y\in \text {dom}(Y)^n\). Throughout this work, we use the notation \(x_i\) for \(i=1,\ldots ,n\) to refer to a specific row of the matrix X, corresponding to the i-th observation of the training set and \(y_i\) for \(i=1,\ldots ,n\) to refer to the response variable of the i-th observation. We follow the commonly applied strategy of greedily dividing the training data in a top–down manner until a stopping criteria is met or no further splitting is possible because the data is indistinguishable. For classification tasks the stopping criterion is usually that all leaf nodes are sufficiently pure, i.e. the response values in the associated subset of training samples are mostly equal. For continuous domains, one usually stops whenever the mean squared error or mean absolute error of the response values of the subset of training samples is below a certain threshold. Another commonly applied strategy is to stop branching along a path when the sample size associated with the terminal node falls below a prior defined threshold. Finally, each leaf node is assigned a response variable. For classification, this is usually the class of the majority of the associated training samples and for regression the mean value or the median of the responses is used.

To determine good splits an adequate splitting criterion has to be defined beforehand. For classification trees, criteria are typically based on impurity measures which are defined as functions \(i:{\mathcal {P}}_m\rightarrow {\mathbb {R}}_{\ge 0}\) on the standard \((m-1)\)-simplex \({\mathcal {P}}_m:=\lbrace (p_1,\ldots ,p_m)\in [0,1]^m: \sum _{k=1}^m p_k=1 \rbrace \). The most frequently used impurity measures are:

  • Classification error: \(e(p)=1-\max _{k=1,\ldots ,m} p_k\)

  • Entropy: \(h(p)=-\sum _{k=1}^{m} p_k\log p_k\)

  • Gini impurity: \(g(p)=1-\sum _{k=1}^{m} p_k^2\)

For a classification task with \(m\in {\mathbb {N}}\) different classes, let \(p=(p_1,\ldots ,p_m)\) denote the vector of relative frequencies of class labels of some subset of observations. Then, i(p) can be interpreted as a measure of heterogeneity of the class labels for the subset. Consequently, in order to evaluate a split, one uses the weighted sum of the impurities of the resulting two subset. For regression trees typical criteria to measure the heterogeneity of a set of n observations (Xy) are:

  • Mean squared error: \(\frac{1}{n}\sum _{i=1}^n(y_i-\tilde{y_i})^2\)

  • Mean absolute error: \(\frac{1}{n}\sum _{i=1}^n|y_i-\tilde{y_i}|\)

where \(\tilde{y_i}\) denotes the predicted value for observation \(x_i\) which usually corresponds to the mean or median of the vector y. A splitting criterion is again derived by taking the weighted sum of these measures for the two subsets resulting from a split. Various other splitting criteria, such as the twoing rule for classification (Breiman et al. 1984), have been proposed in the literature but we refrain from going into further detail as the presented heuristic is applicable to all of these criteria. Throughout the remainder of this work, the splitting criterion will be denoted by the function \(q:{\mathbb {R}}^{d+1}\times {\mathbb {R}}^{n\times d}\times \text {dom}(Y)^n\longrightarrow {\mathbb {R}}\) such that q(aXy), expresses the value of the splitting criterion of an oblique split defined by \(a\in {\mathbb {R}}^{d+1}\) on a training set (Xy). Without loss of generality, we assume that q should be minimized.

The focus of this work is the method of determining high quality oblique splits with respect to the employed splitting criterion and in the remainder of this work we describe our adaption of the CE method based on the von Mises–Fisher distribution.

3 Related work and contribution

Oblique decision tree induction has been an interesting topic of research over the past decades. The major problem related to this topic is the complexity of inducing these trees. Heath (1993) shows that even the task of finding a single optimal oblique split is NP-complete for some splitting criteria including classification error. One approach to overcome this problem is to build oblique decision trees in a top–down manner by recursively introducing heuristically obtained oblique splits. Various heuristics have been applied for finding near-optimal splits in this context. The first one, called CART-LC, is introduced by Breiman et al. (1984) who suggest a deterministic hill-climbing approach that sequentially updates the coefficients of the split until a local optimum is reached. To escape local optima, Heath et al. (1993) propose a simulated annealing heuristic which perturbs the hyperplane parameters one at a time. In their algorithm called OC1, Murthy et al. (1993, 1994) improve Breiman’s hill climbing approach by introducing randomization techniques also with the goal of avoiding premature convergence. Other strategies for finding oblique splits are based on meta-heuristics such as simulated annealing, genetic algorithms or evolutionary algorithms (Cantú-Paz and Kamath 2003) or algorithms based on logistic regression (Mola and Siciliano 2002; Truong 2009), linear discriminants (Loh and Shih 1997; Li et al. 2003; Siciliano et al. 2008; López-Chau et al. 2013) or Householder transformations (Wickramarachchi et al. 2016). Recently, also mathematical optimization approaches have been proposed for inducing oblique decision trees that neglect the recursive partitioning scheme. Bertsimas and Dunn (2017) propose an integer linear programming formulation for optimizing decision trees of a predetermined depth. Blanquero et al. (2020) propose a continuous optimization approach instead for optimizing randomized oblique decision trees. The major drawback of these approaches lies in the complexity of solving the proposed mathematical optimization programs. As a consequence, they can only be solved optimally in a reasonable amount of time for small to medium-sized datasets and for small depths.

The CE method was first introduced by Rubinstein for the estimation of rare event probabilities (Rubinstein 1997) and was later adapted for solving continuous and combinatorial optimization problems (Rubinstein 1999; Rubinstein and Kroese 2004; De Boer et al. 2005). CE is a model-based search method (Zlochin et al. 2004) that uses a parameterized probabilistic model for sampling feasible solutions that is iteratively updated until the model assigns a high probability mass to a region of near-optimal solutions. It has been successfully applied to a wide range of optimization problems and it has also been used in the context of machine learning by Mannor et al. (2005) who develop a CE method for building support vector machines to solve classification problems.

In this work, we follow the recursive top–down approach to induce oblique decision trees. We develop a CE method based on the von Mises–Fisher distribution that is well-suited for finding high-quality oblique splits for classification and regression tasks and therefore poses an interesting alternative to existing algorithms. It is inspired by the geometry of the underlying problem structure and uses easily comprehensible parameters for fine-tuning the execution of the algorithm. Our evaluation shows that it is suitable for inducing highly accurate and compact oblique decision trees in a small amount of time which makes it an interesting option for real-life applications in the field of machine learning and data analytics.

4 Preliminaries

4.1 The cross-entropy optimization method

In this section, we briefly summarize the CE method applied in this work. For a more comprehensive description of the method and its applications the interested reader is referred to Rubinstein and Kroese (2004), De Boer et al. (2005).

We consider a generic minimization problem of the form

$$\begin{aligned} \min _{a\in {\mathcal {A}}} q(a) \end{aligned}$$
(1)

with an arbitrary objective function \(q:{\mathcal {A}}\rightarrow {\mathbb {R}}\) defined on the set of feasible solutions \({\mathcal {A}}\).

Let \({\mathcal {F}}=\lbrace f_\theta : \theta \in \varTheta \rbrace \) denote a family of probability density functions (pdf) on the domain \({\mathcal {A}}\) parameterized by an element \(\theta \) in the parameter space \(\varTheta \). The CE method iteratively generates a sequence of parameters \(\theta _1,\ldots ,\theta _T\) and a sequence of levels \(\gamma _1,\ldots ,\gamma _T\) for \(T\in {\mathbb {N}}\) as follows: The first parameter \(\theta _1\) is given as an input. In each iteration \(t=1,\ldots ,T\) a random sample \(x_1,\ldots ,x_N\in {\mathcal {A}}\) of size \(N\in {\mathbb {N}}\) is derived from \(f_{\theta _{t}}\). Then, the samples are ordered based on their performance, i.e. \(q(x_{(1)})\le \cdots \le q(x_{(N)})\), and \(\gamma _t\) is calculated as the \(\rho \)-quantile \(\gamma _t=q(x_{(\lceil \rho N\rceil )})\) for \(0<\rho <1\). Subsequently, \(\theta _{t+1}\) is derived by calculating the maximum likelihood parameter \(\theta _E\) for the elite samples \(E:=\lbrace x_i: i=1,\ldots ,N; q(x_i)\le \gamma _t\rbrace \) and setting \(\theta _{t+1}=\theta _E\). Alternatively, some smoothed update formula can be used that appropriately combines \(\theta _E\) and \(\theta _{t}\) to obtain \(\theta _{t+1}\). Intuitively, this update of the model parameter based on the best performing samples increases the new pdf’s likelihood of generating good solutions and therefore, the final pdf \(f_{\theta _T}\) concentrates its probability mass to a region of near-optimal solutions. Ideally, one hopes that the sequence of pdfs \((f_{\theta _t})_{t\in {\mathbb {N}}}\) converges to the point mass of an optimal solution \(a^*\) such that the sequence of levels \((\gamma _t)_{t\in {\mathbb {N}}}\) convergences to the optimal objective value \(\gamma ^*\).

4.2 The von Mises–Fisher distribution

The CE method relies on a family of pdfs on the solution space that has to be specified in advance. In this work, we make use of the von Mises–Fisher distribution (Fisher 1953). This distribution is often referred to as the equivalent of the Gaussian distribution on the \((d-1)\)-sphere \(S_{d-1}=\lbrace a\in {\mathbb {R}}^{d}:\Vert a \Vert _2=1\rbrace \). The pdf of the d-dimensional von Mises–Fisher distribution is given by

$$\begin{aligned} f_{\theta }(x)=C_d(\Vert \theta \Vert _2)\exp (\theta ^Tx) \end{aligned}$$

for \(\theta \in {\mathbb {R}}^d\). The parameter \(\kappa :=\Vert \theta \Vert _2\) is referred to as the concentration parameter and \(\mu :=\frac{\theta }{\Vert \theta \Vert _2}\) denotes the mean direction. Intuitively, for \(\kappa =0\) one obtains a uniform distribution on \(S_{d-1}\) and a higher value of \(\kappa \) indicates a higher concentration of the distribution around the mean direction \(\mu \). The normalization constant \(C_d(\kappa )\) is defined by

$$\begin{aligned} C_d(\kappa )=\frac{\kappa ^{\nicefrac {d}{2}-1}}{ (2\pi )^{\nicefrac {d}{2}}I_{\nicefrac {d}{2}-1}(\kappa ) } \end{aligned}$$

where \(I_\alpha \) denotes the modified Bessel function of the first kind of order \(\alpha \). Figure 2 illustrates the von Mises–Fisher distribution for different concentration parameters \(\kappa \) and a constant mean direction \(\mu \).

Fig. 2
figure 2

Illustration of the von Mises–Fisher distribution for different concentration parameters and a constant mean direction \(\mu =\left( \frac{1}{3},\frac{2}{3},\frac{2}{3}\right) \)

Maximum likelihood estimation To update the parameters of the pdfs in the CE method, the maximum likelihood estimates (MLE) need to be determined. As mentioned in Banerjee et al. (2005), Mardia and Jupp (1999), Dhillon and Sra (2003), given a sample \(a_1,\ldots ,a_N\in {\mathbb {R}}^{d}\) the MLE of \(\mu \) and \(\kappa \) for the von Mises–Fisher distribution are given by

$$\begin{aligned} \mu =\frac{\sum _{i=1}^{N} a_i}{\Vert \sum _{i=1}^{N} a_i \Vert _2},\quad \kappa =A_{d}^{-1}({\overline{R}}) \end{aligned}$$
(2)

where

$$\begin{aligned} A_d(\kappa )=\frac{I_{\nicefrac {d}{2}}(\kappa )}{I_{\nicefrac {d}{2}-1}(\kappa )}=\frac{\sum _{i=1}^{N} a_i}{N}={\overline{R}}. \end{aligned}$$
(3)

Since there doesn’t exist an analytical solution for the inverse of the ratio of modified Bessel functions, \(\kappa \) can only be approximated. Nevertheless, sufficiently accurate approximations are available for our purposes:

  • Banerjee et al. (2005) give a simple approximation of \(\kappa \):

    $$\begin{aligned} {\hat{\kappa }}_0=\frac{{\overline{R}}(d-{\overline{R}}^2)}{1-{\overline{R}}^2} \end{aligned}$$
    (4)
  • Sra (2012) develops a more exact approximation by performing a few iterations of the Newton method starting from \({\hat{\kappa }}_0\), as derived in Eq. 4:

    $$\begin{aligned} \begin{aligned} {\hat{\kappa }}_1&={\hat{\kappa }}_0-\frac{A_d({\hat{\kappa }}_0)-{\overline{R}}}{1-A_d({\hat{\kappa }}_0)^2-\frac{d-1}{{\hat{\kappa }}_0}A_d({\hat{\kappa }}_0)}\\ {\hat{\kappa }}_2&={\hat{\kappa }}_1-\frac{A_d({\hat{\kappa }}_1)-{\overline{R}}}{1-A_d({\hat{\kappa }}_1)^2-\frac{d-1}{{\hat{\kappa }}_1}A_d({\hat{\kappa }}_1)} \end{aligned} \end{aligned}$$
    (5)

    This approximation is more accurate than the one in Eq. 4, yet the evaluation of \(A_d({\hat{\kappa }}_0)\) is expensive in high dimensions and also introduces the risk of floating point over- and underflows.

Sampling from the von Mises–Fisher distribution For our CE algorithm, in order to ensure low running times, we need an efficient algorithm for sampling from the von Mises–Fisher distribution. The description of such an algorithm is beyond the scope of this paper, yet, the interested reader is referred to Wood’s algorithm (Wood 1994) which is based on Ulrich’s results (Ulrich 1984).

5 The cross-entropy method for optimization of oblique splits

5.1 Geometric motivation

In this work, the problem under consideration is finding the coefficients \(a\in {\mathbb {R}}^{d+1}\) of an oblique split on a training set (Xy) that minimizes a certain splitting criterion q. Note that the quality of an oblique split does not depend on the length \(\Vert a \Vert _2\) of the vector \(a\in {\mathbb {R}}^{d+1}\) but solely on its direction \(\nicefrac {a}{\Vert a\Vert _2}\). Thus, as \(a=0\) is no meaningful solution, we can, more adequately, identify the feasible region by the d-sphere \(S_d=\lbrace a\in {\mathbb {R}}^{d+1}:\Vert a \Vert _2=1\rbrace \) instead of \({\mathbb {R}}^{d+1}\). Hence, the problem can alternatively be expressed as:

$$\begin{aligned} \min _{a\in S_d} q(a,X,y) \end{aligned}$$
(6)

Note that each observation \(x_i\) for \(i=1,\ldots , n\) from the training set X defines a hyperplane \(H(x_i):=\lbrace a\in {\mathbb {R}}^{d+1}: a_1x_{i1}+\cdots +a_dx_{id}+a_{d+1}= 0 \rbrace \) which divides \(S_d\) into two half spheres

$$\begin{aligned} S_d^+(x_i):=\left\{ a\in S_d: \sum _{j=1}^{d} a_jx_{ij}+a_{d+1}\ge 0 \right\} \end{aligned}$$

and

$$\begin{aligned} S_d^-(x_i):=\left\{ a\in S_d:\sum _{j=1}^{d} a_jx_{ij}+a_{d+1}< 0 \right\} . \end{aligned}$$

Observation \(x_i\) satisfies the rule \(\sum _{j=1}^{d} a_jX_j+a_{d+1}\ge 0\) if and only if \(a\in S_d^+(x_i)\). Therefore, the nonempty sets in \({\mathcal {R}}:=\lbrace R= \bigcap _{i=1}^n S_d^{\sigma _i}(x_i): \sigma \in \lbrace +,-\rbrace ^n\ \wedge \ R\ne \emptyset \rbrace \) correspond to connected regions of solutions on the sphere \(S_d\) with the same value for the splitting criterion. Hence, the problem of finding an optimal oblique split is equivalent to finding an optimal region \(R\in {\mathcal {R}}\). It follows from Cover’s findings (Cover 1965) that there are up to \(2\sum _{j=0}^{d} \left( {\begin{array}{c}n-1\\ j\end{array}}\right) \) regions in \({\mathcal {R}}\) which also explains the complexity of finding an optimal oblique split and the necessity of relying on heuristics. Figure 3 shows the resulting regions on \(S_2\) for a dataset with 30 observations chosen uniformly at random from \([-1,1]\times [-1,1]\).

Fig. 3
figure 3

Geometric interpretation of the solution space for a dataset with 30 observations randomly chosen from \([-1,1]\times [-1,1]\). The solution space can be represented by \(S_2\) and the dataset divides it into multiple connected regions

The previous observation motivates a heuristic that randomly samples points from \(S_d\) and iteratively updates the sampling distribution by favoring the best points until the sampling density assigns most of its probability mass to a near-optimal region on the d-sphere. The CE optimization framework is suitable for achieving that goal and the von Mises–Fisher distribution provides an efficient parametric sampling scheme and easy to compute accurate approximations of MLEs which are necessary for the CE method.

5.2 Algorithmic description

figure a

A summary of our proposed CE method is presented in Algorithm 1. The algorithm gets a training set (Xy) and the splitting criterion as an input. Apart from that, the sample size \(N\in {\mathbb {N}}\), the quantile parameter \(0<\rho <1\), the smoothing parameter \(0\le \alpha <1\) and a termination parameter \(K\in {\mathbb {N}}_0\) have to be specified.

First, our algorithm normalizes the dataset X. The resulting near-optimal solution \(a^*\) is denormalized at the end of the algorithm such that it equivalently divides the dataset as the solution found for the normalized dataset. The reason for normalizing is discussed in Sect. 5.4.

Initially, we set the iteration counter \(t=1\). The parameter \(\kappa _1\) is initialized to zero and \(\mu _1\) is set to the \((d+1)\)-dimensional vector of zeros. This way, the first sample is drawn from the uniform distribution on the d-sphere \(S_d\). The incumbent solution \(a^*\) is also initialized uniformly at random. The variable \(\gamma ^*\), which is used to track the minimum value of the level parameters \(\gamma _t\), is initially set to infinity. To count the number of consecutive iterations without improvement of \(\gamma ^*\), the auxiliary variable k is introduced and initialized to zero.

In the t-th iteration, the algorithm draws N samples from the von Mises–Fisher distribution with density \(f_{\theta _{t}}\) for \(\theta _{t}=\kappa _{t}\mu _{t}\) using Wood’s algorithm. The samples are sorted in increasing order based on their performance and the level \(\gamma _t\) is computed as the \(\rho \)-quantile of the sample performances. If the best sample is better than the incumbent solution \(a^*\), the incumbent solution is updated accordingly. If \(\gamma _t<\gamma ^*\), the variable \(\gamma ^*\) is updated and k is reset to zero. Otherwise, the counter k is increased by one. Subsequently, the set of elite samples E with an objective value smaller or equal to \(\gamma _t\) is determined and the maximum likelihood estimates \(\mu _{E}\) and \(\kappa _{E}\) for the elite samples are approximated as described in Sect. 4.2. The new parameters \(\kappa _{t+1}\) and \(\mu _{t+1}\) are then derived using the smooth update formulas

$$\begin{aligned} \mu _{t+1}=\frac{\alpha \mu _{t}+(1-\alpha )\mu _{E}}{\Vert \alpha \mu _{t}+(1-\alpha )\mu _{E}\Vert _2},\quad \kappa _{t+1}=\alpha \kappa _{t}+(1-\alpha )\kappa _{E} \end{aligned}$$

which combine the current mean direction and concentration with the maximum likelihood estimates for the elite samples. Note that the mean direction of the von Mises–Fisher distribution is a unit vector which explains the scaling in the update formula for \(\mu _{t+1}\). Without scaling, the concentration would be affected. If the parameter \(\alpha \) is set to 0, no smoothing is carried out and the new parameters for \(\mu _{t+1}\) and \(\kappa _{t+1}\) are set directly to the maximum likelihood estimates of the elite samples.

The algorithm stops when the minimum observed level \(\gamma ^*\) has not been updated for K consecutive iterations. This indicates that the algorithm has converged to a region of similar solutions and no further improvement of the incumbent solution can be expected.

5.3 Illustration

In this section, we briefly illustrate the execution of our proposed algorithm for the Iris dataset. For illustration purposes, the dataset is restricted to the attributes sepal length and sepal width and it is normalized using the robust scaling method described in the next section. As objective function the weighted gini impurity is used and the parameters are set to \(\rho =0.1\), \(\alpha =0.8\) and \(K=1\).

Figure 4 summarizes the progress of our method. As we can see, in the first iteration the samples are taken uniformly at random from \(S_2\). In the following iterations the mean direction converges and the concentration increases such that in the final iteration the pdf \(\theta _6\) concentrates its probability mass tightly around a region of optimal solutions. It can be shown that the optimal solution has an objective value of \(\nicefrac {1}{3}\) which corresponds to the value taken on by level parameter \(\gamma _5\). The algorithm terminates after iteration \(t=6\) as the level parameter \(\gamma ^*\) could not be improved.

Fig. 4
figure 4

Progress of the cross-entropy method for the Iris dataset restricted to the first two attributes. The samples drawn in each iteration are represented by a cross and the elite samples by a circle. The arrow represents the mean direction \(\mu _{t}\) used for sampling in iteration \(t=1,\ldots ,6\)

Figure 5 shows the obtained split for both, the normalized and the original dataset and we can see, that the split perfectly separates the samples of class setosa from the rest of the dataset.

Fig. 5
figure 5

The resulting oblique split for the normalized and original Iris dataset restricted to the first two attributes of the Iris dataset

5.4 Normalization

Empirically, we found that our algorithm can vastly benefit from data normalization techniques. We observed that normalization improves the quality of the splits as well as the runtime of our algorithm. The reason for this is that without normalization, the regions in \({\mathcal {R}}\) are not spread evenly across the sphere. To illustrate this, we again consider the Iris dataset restricted to the attributes sepal length and sepal width. Figure 6 shows how the \(S_2\) sphere is divided into regions by the samples in the dataset before and after normalization. For this, example, we used robust normalization, i.e. we set \(x'_{ij}=\frac{x_{ij}-Med(X_j)}{IQR(X_j)}\) where \(Med(X_j)\) and \(IQR(X_j)\) correspond to the median and the interquartile range of feature \(X_j\), respectively.

Fig. 6
figure 6

Illustration of solution space for the first two attributes of the Iris dataset before and after normalization

As we can see, without normalization most of the regions in \({\mathcal {R}}\) are concentrated very tightly within a small portion of the unit sphere while after normalization, the regions are spread out more heterogeneously. Thus, normalization enables our algorithm to find meaningful solutions earlier in the optimization process which improves the overall runtime.

Additionally, normalization can help to reduce numerical instabilities in the sampling process and the approximation of the maximum likelihood estimates as the region of equivalent solutions become larger and thus, smaller values of \(\kappa \) are required in the later stages of our algorithm.

Other normalization variants which might lead to better results in certain situations include:

  • Min–Max-Scaling: \(x'_{ij}=\frac{x_{ij}-{ Mean}(X_j)}{{ Max}(X_j)-\hbox {Min}(X_j)}\)

  • Standard-Scaling: \(x'_{ij}=\frac{x_{ij}-{ Mean}(X_j)}{\sigma (X_j)}\)

Here \({ Mean}(X_j)\), \(\hbox {Max}(X_j)\), \(\hbox {Min}(X_j)\) and \(\sigma (X_j)\) denote the mean, maximum, minimum and standard deviation of feature \(X_j\), respectively.

At the end of our algorithm, we denormalize the obtained oblique split such that it divides the dataset in the same manner as the split for the normalized data. If we use any normalization method of the form \(x'_{ij}=\frac{x_{ij}-\delta _j}{\varDelta _j}\), this can be achieved by setting

$$\begin{aligned} {\overline{a}}_j:={\left\{ \begin{array}{ll} \frac{a^*_j}{\varDelta _j} ,&{} \text {if}\quad j<d+1\\ a^*_{d+1}-\sum _{j=1}^d \frac{a^*_j\delta _j}{\varDelta _j}, &{}\text {if}\quad j=d+1 \end{array}\right. } \end{aligned}$$

for \(j=1,\ldots ,d+1\). Then, \({\overline{a}}\) has the same value of the splitting criterion on (Xy) as \(a^*\) on the normalized dataset \((X',y)\).

5.5 Feature reduction

It is sometimes desirable to exclude some of the attributes in a split in order to decrease the computational effort of the CE method and to increase interpretability of the splits. As an example, consider the situation in which a node relatively deep within the tree should be split. It is quite common that in such a situation the number of attributes is higher than the number of observations associated with the node. As a result, there exist optimal oblique splits which don’t require all of the attributes or equivalently satisfy \(a_j=0\) for some \(j=1,\ldots ,d\). It is, however, not clear beforehand which of the attributes can be left out safely without eliminating the optimal oblique splits.

Yet, there is a simple way to eliminate features due to linear dependencies that, for every oblique split involving all of the attributes, guarantees the existence of an equivalent split which uses less attributes than there are observations in the dataset. The key observation for this idea is summarized by the following lemma.

Lemma 1

Let (Xy) denote a training set consisting of n observations and d features and let \(X_1,\ldots ,X_d\) denote the columns of the matrix X. Further, let \({\mathbf {1}}\) denote the n-dimensional all-ones vector. Let \(J\subseteq \lbrace 1,\ldots ,d \rbrace \) denote a maximal subset of indices such that the set of vectors \(\lbrace X_j: j\in J\rbrace \cup \lbrace {\mathbf {1}}\rbrace \) is linear independent. Then, for every \(a\in {\mathbb {R}}^{d+1}\) there exists \({\overline{a}}\in {\mathbb {R}}^{d+1}\) with \({\overline{a}}_j=0\) for every \(j\not \in J\) that satisfies

$$\begin{aligned} \sum _{j=1}^{d} a_jX_{j}+a_{d+1}{\mathbf {1}}= \sum _{j=1}^{d} {\overline{a}}_jX_{j}+{\overline{a}}_{d+1}{\mathbf {1}}. \end{aligned}$$

Proof

As the set \(\lbrace X_j: j\in J\rbrace \cup \lbrace {\mathbf {1}}\rbrace \) is linear independent and J is maximal, each \(X_{j'}\) for \({j'}\not \in J\) can be expressed as a linear combination of these vectors. It follows that there exist \(\lambda _j\in {\mathbb {R}}\) for \(j\in J\) and \(\lambda _{d+1}\in {\mathbb {R}}\) such that

$$\begin{aligned} \sum _{j\in J} \lambda _jX_j+ \lambda _{d+1}{\mathbf {1}}= \sum _{j'\not \in J} a_{j'}X_{j'}. \end{aligned}$$

Consequently, it holds that

$$\begin{aligned} \sum _{j=1}^{d} a_jX_{j}+a_{d+1}{\mathbf {1}}=\sum _{j\in J} (a_j+\lambda _j)X_{j}+(a_{d+1}+\lambda _{d+1}){\mathbf {1}} \end{aligned}$$

Defining \({\overline{a}}\in {\mathbb {R}}^{d+1}\) by

$$\begin{aligned} {\overline{a}}_j:={\left\{ \begin{array}{ll} (a_j+\lambda _j) ,&{} \text {if}\quad j\in J\cup \lbrace d+1\rbrace \\ 0 ,&{} \text {else} \end{array}\right. } \end{aligned}$$

concludes the proof.\(\square \)

Clearly, as \({{\,\mathrm{rank}\,}}(({\mathbf {1}}\ X))\le \min (n,d+1)\) the previous lemma implies the existence of an optimal oblique split involving less attributes than observations if \(n\le d\). A subset J of attributes as required by the lemma can easily be obtained by performing Gaussian elimination on the transposed matrix \(({\mathbf {1}}\ X)^T\).

One should note that the choice of attributes is not necessarily unique and can influence the heuristic search process of the proposed CE method and thus, also affect the objective value of the final solution. Intuitively, to simplify the heuristic search one would rather eliminate attributes that require additional information than ones that are expressive on their own. In order to accommodate for this, we propose the following strategy: One first computes the best univariate split for each attribute and then ranks the attributes by their performance. In each iteration of the Gaussian elimination procedure, we then choose the next pivot element based on this ranking if there are multiple candidates.

5.6 Parallelization

Our proposed CE method can vastly benefit from parallelization. In order to determine the value of the splitting criterion for a sample \(a\in S_d\), n scalar products have to be evaluated, one for each sample in the training set. Consequently, the algorithm spends most of its execution time computing those. Clearly, this can easily be parallelized by dividing the samples evenly into multiple batches and assigning each batch to a different thread which computes the respective scalar products. This way, we can leverage the capabilities of modern multicore processors to speed up the execution time of our algorithm.

6 Evaluation

In this section, a comprehensive evaluation of our proposed method, which in the remainder of this section will be denoted by CE-DT, is presented. Our algorithm was implemented in C++ and all experiments were executed on a computer equipped with an Intel Xeon E3-1231v3 @3.40 GHz (4 Cores) and 32 GB DDR3 RAM running Ubuntu 20.04.

We use 20 classification datasets from the UCI Machine Learning repository (Dheeru and Karra Taniskidou 2017) to conduct our experiments. An overview of the employed datasets is presented in Table 1. One thing to note here is that the number of attributes includes the dummy variables resulting from one-hot encoding of categorical attributes, as explained below. We evaluate three different quality criteria which are out-of-sample accuracy, tree size in terms of the number of leaf nodes and the time necessary for building the trees.

Table 1 Dataset description

CE-DT is compared to the OC1 oblique decision tree induction algorithm and to our own implementation of the greedy top–down algorithm for inducing univariate decision trees (DT). The latter basically corresponds to the well-known CART algorithm but by using our own implementation, we ensure maximum comparability to CE-DT as only the method for deriving the splits is replaced. We also include results of the decision trees when an additional post-pruning step is executed. For this, we choose minimal cost-complexity pruning, which was introduced by Breiman et al. (1984). Additionally, for out-of-sample accuracy we compare CE-DT to two other popular classification models, namely neural networks (NN) and support vector machines (SVM). Moreover, we include results for the two tree-based ensemble methods random forest (RF) and Gradient Boosting (GB). For these comparison classifiers we use the implementations provided in Python’s Scikit-learn library (Pedregosa et al. 2011) at default parameters.

For each dataset 10-fold cross-validation is performed and the average performance and the standard deviation, each rounded to two decimals, is reported. For the results of the pruned decision trees 10% of the training data is held-out for pruning. Missing numerical values are imputed beforehand with the median of the values in the training set for the respective attribute. For categorical attributes the most frequent category is used instead. Subsequently, one-hot encoding is employed to deal with categorical attributes. This preprocessing guarantees that in each run of the 10-fold cross-validation scheme all of the classifiers are trained and tested on the same subsets of observations.

We use the gini impurity as a splitting criterion and for all three decision tree induction algorithms the recursive partitioning is stopped when a leaf node holds less than four samples. This is also the standard setting used by OC1. For the CE method we choose the parameters \(\alpha =0.8\), \(\rho =0.1\), \(K=3\) and \(N=\max (100,2d\log _2(n_l))\) where d denotes the number of attributes and \(n_l\) denotes the number of training observations associated with the leaf node l to be split. This choice ensures that the number of samples drawn in each iteration of the CE method for the root node grows logarithmically with the total number of training observations in the dataset. Moreover the number of samples decreases with every split in the tree but it never falls below 100. This guarantees that at least 10 elite samples for updating the mean direction and the concentration are used even when the number of training observations associated with the node to be split is very small. Note that these parameters have not been optimized specifically for each individual dataset but they generally resulted in a good performance in our preliminary experiments. In practice the given parameters can serve as an orientation but should be tuned further, for example by increasing the number of samples, to improve the performance of the algorithm.

For the evaluations, we follow Demšar’s (2006) guidelines to compare multiple classifiers and perform a Friedman test (Friedman 1937, 1940) to check whether the performances of at least two classifiers differ significantly. This non-parametric statistical test ranks the performances of the algorithms for each dataset from 1 to k where k is the number of algorithms to be compared. The lowest rank is assigned to the best and the highest rank to the worst performing algorithm. In case of ties the average rank is assigned. The null-hypothesis is that the classifiers perform equally well and in this case the observed ranks should be approximately equal. This hypothesis is rejected if the p-value computed by the Friedman test is below a certain significance level. If a significant difference for at least two classifiers is observed, a subsequent Holm test (Holm 1979) with CE-DT as control method is carried out to check which of the other methods perform significantly better or worse than CE-DT. For each other method the null-hypotheses is therefore that it performs equally well as CE-DT. The p-values for each of those are calculated and ordered from lowest to highest. If k denotes the number of hypotheses and i the position in the ordering, the Holm test adjusts these p-values by setting \(p'_{1}=\min \lbrace 1, kp_{1} \rbrace \) and \(p'_{i}=\min \lbrace 1, \max \lbrace p'_{i-1},(k+1-i)p_{i} \rbrace \rbrace \) for \(i=2,\ldots ,k\). Smaller p-values are therefore increased more strongly than higher ones which ensures that the family-wise error rate is below the required significance level. Then, in order of the original p-values, the adjusted p-values are compared to the significance level until it is exceeded for the first time. All hypotheses before that are rejected and we can conclude that there is a significant difference to the performance of the control method. For our evaluation we set the significance level for both statistical tests to 0.05.

6.1 Comparison of accuracy

In this section, we evaluate the accuracy of our proposed method which is defined as the ratio of the number of well-classified samples to the overall number of samples in the testing data.

Table 2 summarizes the average accuracies over the ten runs of the cross-validation process of the unpruned decision trees using the decision tree induction algorithms CE-DT, DT and OC1 for each of the employed datasets. With a value of 1.5 CE-DT has the lowest average rank of the three methods and OC1 and DT share the second place with an average rank of 2.25. With a p-value of 0.02 the Friedman test reports a significant difference for at least two methods and the post-hoc Holm test, as summarized in Table 3, rejects both null-hypotheses. This allows the conclusion that CE-DT induced significantly more accurate decision trees in this evaluation.

Table 2 Comparison of accuracy between unpruned CE-DT, DT and OC1
Table 3 Holm post-hoc test for accuracy of the unpruned trees

The results for the pruned decision trees, as summarized in Table 4, are similar. CE-DT again has the lowest average rank of 1.45, OC1 has a rank of 2.25 and DT’s rank is 2.3. The Friedman test returns a value of 0.01 and the null-hypotheses is again rejected. As for the unpruned trees, the Holm test, summarized in Table 5, confirms that CE-DT significantly outperforms the other two methods.

Table 4 Comparison of accuracy between pruned CE-DT, DT and OC1
Table 5 Holm post-hoc test for accuracy of the pruned trees

Table 6 summarizes the results of pruned CE-DT, the non-tree-based classifiers NN and SVM and the tree-based ensemble methods GB and RF. RF has the lowest average rank of 2.3, followed by GB with a rank of 2.375. With a rank of 3.225 CE-DT is in third place before SVM and NN with ranks 3.4 and 3.7, respectively. Though with a p-value of \(1.3\times 10^{-2}\) the Friedman test reports a significant difference between at least two classifiers, the post-hoc Holm test, which is summarized in Table 7, does not provide sufficient evidence for a significant difference between CE-DT and any of the other methods.

Table 6 Comparison of accuracy between pruned CE-DT, NN, SVM, GB and RF
Table 7 Holm post-hoc test for accuracy of pruned CE-DT, NN, SVM, RF and GB

6.2 Comparison of tree size

Next, we compare the sizes of the decision trees induced by CE-DT, DT and OC1. The tree size is measured in terms of the number of leaf nodes which corresponds to the number of distinct regions the feature space is divided into by the decision tree.

The results for the decision tree induction algorithms without pruning are presented in Table 8. For all but three dataset CE-DT induced the smallest decision trees and thus it has the lowest rank of 1.15. OC1 is in the second place with a rank of 1.9 and DT has a rank of 2.95 as it induced the largest trees for all but one dataset. With a p-value of \(8\times 10^{-8}\) the Friedman test confirms a significant difference. The subsequent Holm test, which is summarized in Table 9, rejects both null-hypotheses from which we can conclude that CE-DT induced significantly smaller decision trees than DT and OC1 in our experiments.

Table 8 Comparison of tree size between unpruned CE-DT, DT and OC1
Table 9 Holm post-hoc test for tree size of the unpruned trees

For the pruned decision trees the results are summarized in Table 10. CE-DT again has the lowest rank of 1.325 followed by OC1 with a rank of 1.925 and DT ranks last with a value of 2.75.

Table 10 Comparison of tree size between pruned CE-DT, DT and OC1

The Friedman test reports a significant difference with a p-value of \(5\times 10^{-5}\) but the Holm test, summarized in Table 11, only confirms that CE-DT performs significantly better than DT. It does not provide sufficient evidence that the null-hypothesis for OC1 can be rejected.

Table 11 Holm post-hoc test for tree size of the pruned trees

6.3 Comparison of induction times

Finally, the induction times for the three tree-based induction methods are summarized in Table 12. DT is the fastest method for all datasets and thus has the rank 1. OC1 has a rank of 2.4 and CE-DT is in the third place with a rank of 2.6. Not surprisingly, the Friedman test reports a significant difference with a p-value of \(3\times 10^{-7}\) and the post-hoc Holm test, summarized in Table 13, confirms that DT is significantly faster than CE-DT. It does, however, not provide evidence for a significant difference between the induction times of OC1 and CE-DT.

Table 12 Comparison of induction time between unpruned CE-DT, DT and OC1
Table 13 Holm post-hoc test for induction time of the unpruned trees

6.4 Discussion of the results

Overall our evaluation shows that the presented CE method is well-suited for inducing compact and accurate decision trees. The induction algorithm itself, without additional post-pruning, significantly outperforms both the univariate and the other oblique decision tree induction algorithm in terms of accuracy and tree size. We therefore conclude that the induction algorithm is more powerful than the two existing methods regarding these criteria.

With additional post-pruning, CE-DT is still significantly better in terms of accuracy. Although it still has the lowest rank for tree size, this difference can only be confirmed to be significant for DT but not for OC1. We conclude OC1 could benefit more from pruning than CE-DT. We assume that this observation can be explained by the fact that CE-DT’s splits are more effective at dividing the feature space and therefore induce smaller trees in the first place. As a consequence the splits are less likely to be pruned.

Compared to DT the performance increase comes at the cost of increased induction time. This is due to the fact that determining optimal univariate splits is far less complex than finding optimal oblique splits. Still our results indicate that the increase in induction time is acceptable in many situations. Considering our results, we assume that the runtime of the algorithm is in the same order of magnitude as other oblique decision tree induction algorithms such as OC1.

In terms of accuracy CE-DT also managed to achieve the lowest rank compared to the non-tree-based classifiers NN and SVM. The fact that the difference is not significant is so far not suprising as clearly there is no prediction model that works best in all situations. Nonetheless, the result supports the claim that CE-DT is competitive with those algorithms and can even outperform them for a variety of datasets.

Combining multiple prediction models into ensembles typically yields more robust and accurate predictors than using single models. This explains why GB and RF are the most accurate classifiers in our evaluation. Although the difference to CE-DT is not confirmed to be significant, we believe that ensemble methods are generally superior in terms of pure prediction accuracy over single decision trees. The drawback is that these ensemble methods cannot be interpreted which is the main reason to use a single decision tree instead. These ensemble methods, however, are not restricted to univarite decision trees as base models and therefore, our proposed CE method can also be employed for gradient boosting or to build oblique random forests which is an interesting open topic for further research.

7 Conclusion

In this work, we illustrate how the problem of finding optimal oblique splits can be equivalently interpreted as locating an optimal region on a hypersphere intersected by hyperplanes defined by the observations in a dataset. This view on the problem motivates the application of the proposed CE algorithm that uses the von Mises–Fisher distribution. Our evaluation indicates that this method is well-suited for efficiently inducing accurate and compact oblique decision trees that are often superior to decision trees induced by existing algorithms and competitive with other non-tree-based prediction models. As a future work, we would like to test our method on real-life classification and regression tasks. Furthermore, it will be interesting to investigate the advantageous effects of our oblique decision trees on ensemble methods such as random forests or gradient boosting.