1 Introduction

Outlier detection is an important data mining task and often one of the first steps when acquiring new data. In the financial sector, it is used to detect transactional fraud [24], money laundering [12, 13], and to solve many other related problems [4].

Due to its simplicity and speed, isolation forest (IF) is one of the most popular outlier detection algorithms [16]. It operates on the key observation that decision trees tend to isolate outlier examples relatively early in the tree. Thus, the path length of an example when sorted into a tree gives a (somewhat crude) indication of the outlierness of the observation. IF leverages this empirical insight into an ensemble algorithm that trains multiple isolation trees on bootstrap samples and scores observations based on their average path length. Due to its popularity, multiple variations of IF have been proposed. SCiForest proposes to select the split/feature combination more carefully by introducing a split criterion in [17], whereas the extended isolation forest (EIF) uses arbitrary random slopes instead of axis-aligned splits for splitting to improve its performance [14].

While a decent body of the literature exists on IF, there seems to be a gap in the theoretical understanding of it. More specifically, there seems to be no direct connection between the performance of IF and its variations and the assumptions we may have on the underlying data distribution. In this paper, we investigate this connection more carefully and analyze IF-based approaches from a distributional point of view. We show that all IF-based approaches approximate the underlying probability distribution and that the average path length can be considered an approximation of mixture weights if the data were generated by a mixture distribution. To leverage these insights, we propose the generalized isolation forest (GIF) algorithm. We show that GIF has better outlier detection performance than the IF, EIF, and SCiForest with comparable or better runtime. Additionally, we compare GIF against 9 nearest-neighbor outlier detection algorithms and show that GIF delivers state-of-the-art performance. Our contributions are as follows:

  • We theoretically show that tree-based methods approximate the underlying probability distribution and give a lower bound for the approximation error of fully grown trees.

  • We show that the average path length can be viewed as a (crude) approximation of the mixture weights of a mixture distribution thereby explaining some success of the IF and EIF algorithm.

  • We formalize our theoretical analysis into the generalized isolation forest (GIF) algorithm. It uses randomly sampled representatives as splits and tries to directly estimate the mixture coefficients without relying on the average path length.

  • In our experimental evaluation, we compared GIF with 18 state-of-the-art outlier detection methods on 14 different datasets with over 350, 000 hyper-parameter combinations. We show that the novel algorithm outperforms three existing state-of-the-art tree-based outlier detection algorithms and has comparable performance with nearest-neighbor-based approaches while have a lower runtime. Moreover, we link these results with existing practical studies showing that GIF offers the best performance on some datasets.

This paper is organized as follows. Section 2 surveys related work and discusses the preliminaries. Section 3 presents our theoretical analysis. Section 4 presents the generalized isolation forest algorithm which is then evaluated in Sect. 5. In Sect. 6, we highlight a use-case study of GIF for detecting fraudulent transactions in financial data. The last section concludes the paper.

2 Preliminaries and related work

We focus on unsupervised outlier detection where we have given a dataset containing outliers, and our goal is to find these outliers. We assume that we have given a sample \({\mathcal {S}} = \{x_1, \dots , x_N\}\) of N observations \(x_i \in {\mathbb {R}}^d \subseteq {\mathcal {X}}\) from an unknown distribution \({\mathcal {D}}\). The goal is to assign a score to each observation in \({\mathcal {S}}\) which measures its outlierness. In this paper, we focus on the intersection between density-based and isolation-based approaches and show that they can be understood in the same framework when they utilize trees.

2.1 Density-based approaches

Density-based approaches assume that observations are drawn from a mixture distribution where at least one of the mixtures is ‘rare’ [11]. Density-based approaches require a two-step procedure with both steps being often intertwined. First, we need to model the underlying distribution as good as possible, and then we decide which of the observations might be outliers. A common example of this approach is a Gaussian mixture model [20] that assume a mixture of Gaussian to be fitted with an EM-style algorithm [1].

Tree-based density estimation techniques have been proposed as a faster, assumption-free alternative [1, 9, 23, 26]. These approaches rely on variations of decision trees to accurately approximate the underlying distribution and formulate some post-fitting rules to detect outliers with the help of the trees. It is well-known that most variations of decision trees can approximate any distribution with sufficient accuracy and even randomly fitted trees converge against the true underlying distribution, given enough training data[26]. In general, training (random) trees is fast, since they only require to sample a set of different splits and sort the data accordingly. Moreover, trees can be combined into an ensemble to stabilize their performance which can be parallelized easily, thus retaining the performance advantages of trees [2].

2.2 Isolation-based approaches

Isolation-based approaches assume that some observations can be easily isolated from the remaining ones and are therefore outliers. Arguably, the most used popular method in this family is isolation forest (IF) [16]. Isolation forest utilizes an ensemble of randomly constructed trees to estimate the outlierness of each observation by measuring its average path length. More formally, consider a binary decision tree in which each node performs a comparison \(x_i \le t\) where \(i \in {\mathbb {N}}\) is a randomly chosen feature index and t is a randomly chosen threshold from the available feature values in \({\mathcal {S}}\). For each observation, we count the number of comparisons h(x) required to traverse the tree starting with its root node. We refer to this as the path length of x and let \({\mathbb {E}}[h(x)]\) denote the average path length across the ensemble of trees. Liu et al. empirically observed in [16] that outlier observations tend to be isolated earlier during tree traversal which indicates that trees tend to isolate outlier observations. More formally, they propose to use

$$\begin{aligned} \mathrm{{score}}(x) = 2^{-\frac{{\mathbb {E}}[h(x)]}{C(N)}} \end{aligned}$$

as the scoring rule where C(N) is the harmonic number depending on the size of the dataset N. The original publication of isolation forest justified this scoring rule by empirical observations, but the authors later gave a more mathematical intuition [18]. They argue that the average path length of observations in randomly fitted trees on uniformly distributed observations from the interval [lu] is smaller for points near the fringe of the interval u and l. They then show that in this case, the distribution of average path length is given by a Catalan number, which in turn can be approximated with the original ranking score used by IF.

IF constructs trees using a random combination of split and feature value. Thus, a natural extension of this approach is to select the split/feature combination more carefully using a split criterion. Liu et al. proposed in [17] the SCiForest algorithm which uses utilizes the dispersion of the sample to rate each split. Let \({\mathcal {S}} = {\mathcal {S}}_l \cup {\mathcal {S}}_r\) be the dataset split into to disjunct sets \({\mathcal {S}}_l\) and \({\mathcal {S}}_r\), then they propose to use that split which maximizes

$$\begin{aligned} d_\mathrm{{gain}}({\mathcal {S}}) = \frac{\sigma ({\mathcal {S}}) - 0.5\cdot (\sigma ({\mathcal {S}}_l) + \sigma ({\mathcal {S}}_r))}{\sigma ({\mathcal {S}})} \end{aligned}$$

where \(\sigma (\cdot )\) denotes the dispersion.

Recently, an extension to IF was proposed by Hariri et al. in [14] called extended isolation forest (EIF). The EIF algorithm improves on the split strategy of the original IF formulation by considering the selection of a random slope rather than a random variable and value. The authors motivate this split strategy by the restriction of IF that only considers horizontal and vertical branch cuts leading to artifacts in the resulting anomaly scores.

2.3 Proximity-based approaches

Proximity-based approaches assume that similar objects behave similarly. Thus, rare outliers might only have a few similar objects nearby. Proximity-based approaches usually introduce a distance metric (or similarity function) to quantify the differences in observations. This is arguably the largest class of outlier detection methods. We will not look at this family theoretically, but compare them against isolation-based approaches in a principled manner. This algorithm family mainly consists of two different lines of work. K-nearest-neighbor methods can be seen as global methods that base their scoring on the neighborhood of \(k \in {\mathbb {N}}\) points for a given observation. For example, KNN [27] uses the largest distance in the k neighborhood as a scoring rule, whereas KNNW uses the sum of distances in the neighborhood [5]. Local methods on the other hand usually use a reachability neighborhood which includes all points in the \(\varepsilon \)-ball around a given observation. This way, they use the density of points in the neighborhood to score observations. For example, local outlier factors (LOF) [7] uses the inverse, normalized reachability to score observations, whereas SimplifiedLOF [30] simplifies the reachability computation. A more detailed discussion and comparisons between global and local proximity-based methods can be found in [8]. The authors kindly made their source code for a variety of methods and experiments availableFootnote 1 on which we will base our experimental analysis. However, we note that this work does not include recent advances in the ensembling of proximity-based approaches. As suggested by multiple authors, the ensembling of proximity-based methods such as KNN using bootstrap samples can improve the overall results. Therefore, we also include these into our experimental analysis and thereby enhancing the analysis by Campos et al. in [8]. More specifically, for evaluation, we also use aNNE [31], LeSiNN [21], and iNNE [6].

3 Isolation-based approaches as density estimation

Before we present our method, we want to formalize outlier detection more precisely. Dixon proposed in [11] to write outlier detection as a mixture of distributions, where at least one distribution is ‘rare.’ More formally, we assume that \({\mathcal {D}}\) is a mixture of K distributions where neither K nor the individual distributions are known:

$$\begin{aligned} p_{{\mathcal {D}}}(x) = \sum _{i=1}^K w_i p_i(x_i) \end{aligned}$$
(1)

Here, \(\mathbf {w} = (w_1, \dots , w_K)\) is the probability vector of a categorical distribution. For outlier detection, we assume that at least one mixture distribution has a probability near zero, that is \(w_i \approx 0\). Our goal is to characterize the corresponding distribution with small mixture weights and therefore distinguish it from the remaining mixtures. To do so, we employ a two-step procedure: First, we approximate \(p_{{\mathcal {D}}}\) as good as possible using the sample \({\mathcal {S}}\) we have given. Then, we use this characterization to find ‘rare’ events in the data which are potential outliers.

3.1 Approximating the mixture distribution

Let us tackle the first challenge now. To approximate \(p_{{\mathcal {D}}}\), we wish to find a function \(f^* \in {\mathcal {F}}\) from some set of functions \({\mathcal {F}}\) which matches the true distribution as close as possible:

$$\begin{aligned} f^*&= \arg \min _{f \in {\mathcal {F}}} \int _{{\mathcal {X}}} (f(x) - p_{{\mathcal {D}}}(x))^2 \mathrm{d}x \\&= \arg \min _{f \in {\mathcal {F}}} \int _{{\mathcal {X}}} (f(x))^2 - 2f(x)p_{{\mathcal {D}}}(x) + (p_{{\mathcal {D}}}(x))^2 \mathrm{d}x \\&= \arg \min _{f \in {\mathcal {F}}} \int _{{\mathcal {X}}} (f(x))^2 - 2f(x)p_{{\mathcal {D}}}(x) \mathrm{d}x \end{aligned}$$

where the second line is due to the binomial formula and the third line is because \((p_{{\mathcal {D}}}(x))^2\) has no impact on the minimization over f. As usually done in machine learning, we may approximate the true distribution \(p_{{\mathcal {D}}}\) with Monte Carlo approximation using the given sample \({\mathcal {S}}\):

$$\begin{aligned} f^* = \arg \min _{f \in {\mathcal {F}}} \int _{{\mathcal {X}}} (f(x))^2 \mathrm{d}x - 2\sum _{i=1}^N f(x_i) \frac{1}{N} \end{aligned}$$
(2)

This approximation is justified by the law of large numbers and becomes more and more exact the larger N becomes. Ram and Gray showed that for \(N\rightarrow \infty \) this minimizer is exact and consistent [26]. It is still difficult to solve this problem without any assumptions on \({\mathcal {F}}\) since we need to integrate over \({\mathcal {X}}\). To efficiently find a minimizer for this function, we assume that the f breaks the space \({\mathcal {X}}\) into L non-overlapping regions \({\mathcal {R}}_0,\dots ,{\mathcal {R}}_L\) where the points in each region follow a uniform distribution. More formally:

$$\begin{aligned} f(x) = \sum _{i=1}^L \mathbb {1}\{x \in {\mathcal {R}}_i\} \sum _{j=1}^N \frac{\mathbb {1}\{x_j \in {\mathcal {R}}_i\}}{N} = \sum _{i=1}^L \mathbb {1}\{x \in {\mathcal {R}}_i\} g_i \end{aligned}$$
(3)

A common example of this type of function would be a histogram. Substituting f(x) in Eq. 2 with Eq. 3 leads to

$$\begin{aligned} f^*&= \arg \min _{f \in {\mathcal {F}}} \int _{{\mathcal {X}}} (f(x))^2 \mathrm{d}x - 2\sum _{i=1}^N f(x_i) \frac{1}{N} \\&= \arg \min _{f \in {\mathcal {F}}} \int _{{\mathcal {X}}} \left( \sum _{j=1}^L \mathbb {1}\{x \in {\mathcal {R}}_j\} g_j\right) ^2 \mathrm{d}x \\&\quad 2\sum _{i=1}^N \sum _{j=1}^L \mathbb {1}\{x_i \in {\mathcal {R}}_j\} g_j \frac{1}{N} \\&= \arg \min _{f \in {\mathcal {F}}} \sum _{i=1}^L \left( g_i\right) ^2 V({\mathcal {R}}_i) - 2 \sum _{j=1}^L (g_j)^2 \\&= \arg \min _{f \in {\mathcal {F}}} \sum _{i=1}^L g_i^2 (V({\mathcal {R}}_i) - 2) \end{aligned}$$

where \(V({\mathcal {R}}_i)\) denotes the volume of the i-th region and the third line is due to the fact that all except one summand is 0. Now consider the equivalent maximization problem:

$$\begin{aligned} f^* = \arg \max _{f \in {\mathcal {F}}} \sum _{i=1}^L (2-V({\mathcal {R}}_i))g_i^2 \end{aligned}$$
(4)

Informally, to maximize Eq. 4, we need to find small, dense regions so that \(V({\mathcal {R}}_i)\) is small, but \(g_i\) is large. Note that the number of regions L is part of our model function and as such can be chosen to maximize Eq. 4. Also note that if we isolate a single point in a region, we have \(V(R_i) \rightarrow 0\) and \(g_i \rightarrow 1/N\). It follows that any tree-based algorithm which fully isolates single points with fully grown trees (where \(L = N\)) solves problem 4 to some extent by providing the following lower bound:

$$\begin{aligned} \sum _{i=1}^L (2-V({\mathcal {R}}_i))g_i^2 = \sum _{i=1}^N \frac{2}{N^2} = \frac{2}{N} \end{aligned}$$

In other words, any tree-based algorithm which has sufficiently many fine-grained splits guarantees some approximation quality of the underlying probability distribution.

3.2 Finding outlier mixtures

Now, consider the second challenge: How do we find outlier distributions given an approximation of \(p_{{\mathcal {D}}}\)? By the previous discussion, we assume a tree-based model with sufficient approximation quality. Hence:

$$\begin{aligned} p_{{\mathcal {D}}}(x) = \sum _{i=1}^K w_i p_i(x) \approx \sum _{i=1}^L g_i \mathbb {1}\{x \in {\mathcal {R}}_i\} \end{aligned}$$

For \(L = K\), we may view \(p_i(x) \approx \mathbb {1}\{x \in {\mathcal {R}}_i\}\) and \(w_i \approx g_i\). Recall that per definition, the outlier distributions are characterized by a very small mixture weight \(w_i\approx 0\), so our goal is to find small \(g_i\). Most directly, we can present the regions to an expert who could examine all points in a region \({\mathcal {R}}_i\) and estimate the mixture weight of \(w_i\) given her expert belief. In this case, we can directly identify outlier regions.

However, what can we do when no such expert is available? For \(L = K\), we can directly check the mixture weights \(g_i\) and use these as outlier scores since there is a one-to-one correspondence between both. Interestingly, for \(L > K\), we find a similar relationship. Let \(L = n \cdot K\) with \(n\in {\mathbb {N}}\). The intuition is that we wish to match the L leaf nodes of the tree with the number of unknown mixtures. To do so, we now introduce artificial mixtures that use the same density \(p_i\) but only a fraction of the original mixture weight \(w_i\). Let, without loss of generality, the mixtures be sorted so that \(w_1 \ge w_2 \ge \dots \ge w_K\). We copy each mixture n times and rescale the probability accordingly:

$$\begin{aligned} p_{{\mathcal {D}}}(x)&= \sum _{i=1}^K w_i p_i(x) = \sum _{j=1}^n \sum _{i=1}^K \frac{1}{n} w_i p_i(x) = \sum _{i=1}^L \frac{1}{n} w_i p_i(x) \\&=\sum _{i=1}^L \frac{K}{L} w_i p_i(x) = \sum _{i=1}^L \widetilde{w}_i p_i(x) \end{aligned}$$

where the second line is due to \(n = \frac{L}{K}\). Note that if L is not a multiple of K, we copy the mixtures \(\lfloor \frac{K}{L} \rfloor \) times and then rescale the remaining \(L \mod K\) mixtures according to their sorting, starting with the largest one. This scaling preserves the relative mixture order \(\widetilde{w}_1 \ge \widetilde{w}_2 \ge \dots \ge \widetilde{w}_L\). It follows that we can use the estimated mixture weights \(g_i\) to rate the outlierness of regions if \(L\ge K\).

3.3 Relationship to isolation forest and its siblings

Before we present our algorithm, we want to discuss the isolation forest algorithm and its siblings extended isolation forest and SCiForest within the context of our theoretical framework. As presented in the previous section, any tree-based algorithm can be used to approximate \(p_{{\mathcal {D}}}\) to some degree, hence including IF, EIF and SCiForest. Now, consider the scoring rule used by these methods

$$\begin{aligned} \mathrm{{score}}(x) = 2^{-\frac{{\mathbb {E}}[h(x)]}{C(N)}} \end{aligned}$$

Let \({\mathcal {R}}(x)\) denote the region in which the observation x belongs to and let \(|{\mathcal {R}}(x)|\) denote the amount of training data that falls into that region. Recall that we are interested in giving an ordering of outlierness for each observation and therefore we may use any scoring rule as long as it preserves the original outlier ordering. We assume that regions that imply a longer decision path generally contain fewer examples. More formally:

$$\begin{aligned} \frac{1}{|{\mathcal {R}}(x)|} \sim {\mathbb {E}}[h(x)] \end{aligned}$$

This assumption is justified by the fact that a longer (average) path length means that we are becoming increasingly selective meaning we have fewer and fewer examples in each node. Following this assumption, it is straightforward to show that IF’s scoring rule is a monotone function of the mixture weight:

$$\begin{aligned} \frac{1}{|{\mathcal {R}}(x)|}&\sim {\mathbb {E}}[h(x)] \\ \frac{N}{|{\mathcal {R}}(x)|}&\sim \frac{{\mathbb {E}}[h(x)]}{C(N)} \\ \log _2\left( \frac{N}{|{\mathcal {R}}(x)|}\right)&\sim \log _2\left( \frac{{\mathbb {E}}[h(x)]}{C(N)}\right) \\ \log _2\left( \frac{|{\mathcal {R}}(x)|}{N}\right)&\sim -\log _2\left( \frac{{\mathbb {E}}[h(x)]}{C(N)}\right) \end{aligned}$$

where the second line holds since N and C(N) are nonnegative constants. The third line holds due to the fact that \(\frac{N}{|{\mathcal {R}}(x)|} < 1\) and thus \(\log _2\left( \frac{N}{|{\mathcal {R}}(x)|}\right) < 0\). Note that \(\log _2(\cdot )\) is a monotone function, and therefore it does not change the original ordering of its arguments. IF and EIF’s scoring rule ignores the \(\log _2\) on the right side of the equation which leads to

$$\begin{aligned} \log _2\left( \frac{|{\mathcal {R}}(x)|}{N}\right)&\Uparrow -\frac{{\mathbb {E}}[h(x)]}{C(N)} \\ \frac{|{\mathcal {R}}(x)|}{N}&\Uparrow 2^{-\frac{{\mathbb {E}}[h(x)]}{C(N)}} \end{aligned}$$

where \(\Uparrow \) denotes the fact that if the left side increases, so does the right side and vice-versa. It follows that IF, EIF and SCiForest preserve the original ordering of mixture weights using the average path length as an approximation if longer decision paths in the tree imply fewer points in the regions. This assumption is crucial for the algorithm to work well and justified to some extent as discussed.

4 Generalized isolation forests

The previous section presented a theoretical framework for outlier detection with isolation trees and elaborated on how IF, EIF, and SCiForest fit in. We note two general propositions about these three algorithms: First, they indirectly maximize Eq. 4 for fitting the ensemble and second they estimate the mixture coefficients using the average path length. In this section, we present the generalized isolation forest (GIF) method which utilizes these statements by taking Eq. 4 into account and by directly using the mixture coefficients.

GIF represents a bagging style ensemble of Generalized Isolation Trees (GTr). GTr partitions the observation space \({\mathcal {X}}\) into increasingly smaller regions and uses independent probability estimates for each region. Formally, we represent a tree as a directed graph with a root node where each node has up to K child nodes. Each node in the tree belongs to a sub-region \({\mathcal {R}} \subseteq {\mathcal {X}}\) and all children of each node recursively partition the region of their parent node into K non-overlapping smaller regions. The root node belongs to the entire observation space \({\mathcal {R}}_0 = {\mathcal {X}}\). Each node uses up to K split functions \({\mathcal {S}} = \{s_{{\mathcal {R}}} :{\mathcal {X}} \rightarrow \{0,1\} \}\), where \(s_{{\mathcal {R}}}(x) = 1\) indicate that x belongs to the corresponding region \({\mathcal {R}}\) and \(s_{{\mathcal {R}}}(x) = 0\) indicates that it does not. Note that during split construction, we need to enforce that splits partition the observation space into non-overlapping regions, so that exactly one split is ‘1’ and the remaining ones are ‘0’. Most commonly, we find binary DTs which split the space into 2 subspaces at each node, sometimes called the ‘left’ and ‘right’ split. Once the observation space is sufficiently partitioned, a density function \(g\in {\mathcal {G}} = \{g:{\mathcal {X}} \rightarrow [0,1]\}\) is used for density estimation. As discussed previously, we may use the frequency \(g_i(x) = \frac{1}{N}\sum _{x_j\in {\mathcal {S}}} \mathbb {1}\{x_j\in {\mathcal {R}}_i\} = \frac{|{\mathcal {S}}_i|}{N}\) where \({\mathcal {S}}_i\) is the portion of the training sample belonging to region \(R_i\).

For training a GTr, we use a greedy algorithm similar to classic decision trees. Suppose we have already trained a GTr with n nodes and want to divide the region \({\mathcal {R}}_i\) by another split hypothesis. Let \({\mathcal {S}}_i\) be that part of the training data which falls into region \({\mathcal {R}}_i\), and then we randomly sample K points from \(\mathcal {S_i}\) so that each point induces a sub-region. More formally, we define a split function with

$$\begin{aligned} s_i(x) = {\left\{ \begin{array}{ll} 1 \quad \text {if}~ i = \arg \max \{k(x,x_j) | j = 1,\dots ,K\} \\ 0 \quad \text {otherwise} \end{array}\right. } \end{aligned}$$

where \(x_j\in {\mathcal {S}}_i\) are the selected representatives and \(k:{\mathcal {R}}_i \times {\mathcal {R}}_i \rightarrow [0,1]\) is a kernel function. Once we have partitioned the observation space into enough regions, we stop tree induction. Recall that we aim to maximize

$$\begin{aligned} \arg \max \sum _{i=1}^L (2-V(R_i)) g_i^2 \end{aligned}$$

which is maximized if \(g_i \rightarrow 1\) and \(V(R_i) \rightarrow 0\). Informally, we seek small dense areas that contain many points. However, we are becoming more and more selective the more nodes we add to the tree, so that \(g_i\) becomes smaller, the smaller \(V(R_i)\) gets. Thus, we propose to use a threshold \(\tau \), and whenever \(\tau \ge (2-V(R_i))g^2\), we stop tree induction. Now, the computation of \(V(R_i)\) can be complex for high-dimensional data and irregular-shaped regions. To overcome this, we propose to use the average inner kernel distance in each region. Given the representatives of each region, we compute the average kernel similarity of all points in that region with the respective representative. Intuitively, this has the same meaning as before, because we stop tree induction once we find small, dense regions.

Algorithm 1 summarizes the training of DTr, where samplePoints(D, K) randomly samples up to K points from D if available. If not, all points are selected. Algorithm 2 displays the application of GTr once trained.

figure a
figure b

Randomly constructed trees have large variations in their density estimations, so that we may have very different results between individual trees. To counter this behavior, we combine multiple GTr into a bagging style algorithm similar to IF and EIF. Bagging samples different subsets of the data and / or features to introduce diversity into the ensemble. In the case of GTr, we can also vary the similarity function k as well as the minimum lower bound \(\tau \). Algorithm 3 summarizes this approach.

figure c
Table 1 Datasets used for the evaluation of our method

5 Evaluation

With our empirical evaluation, we want to answer several questions: (1) Does GIF offer better predictive performance than its tree-based siblings IF and EIF? (2) How does GIF perform if we compare it beyond tree-based algorithms, such as k-NN or local outlier factor (LOF)? (3) How does the runtime suffer from considering generalized trees, instead of binary isolation trees? (4) How sensitive is GIF regarding its hyper-parameters?

To evaluate our method, we consider 14 different datasets in total, which demand the detection of outliers for different real-world applications (cf. Table 1). For the first experiment, we compare GIF to other tree-based outlier detectors, namely the original IF method implemented in scikit-learn [22], the EIF algorithm implemented by the original authors, and the SCiForest (SCiF) algorithm.Footnote 2 Our method is currently implemented in C++ and provides an easy-to-use python interface. We intended to publish our code after submission. To compare methods adequately, we perform a grid search of hyper-parameters. In all cases, we choose the number of trees to be \(t = 128\). The accompanying subset size \(\psi \) is chosen to be \(\max (0.25\cdot N, 256)\) where N is the size of the dataset. For the GIF, we vary the kernel function using the RBF-kernel and three different versions of the Matern kernel [28]:

$$\begin{aligned} k_\mathrm{{RBF}}(x_i,x_j)&= \exp \left( -\frac{1}{2\sigma ^2}\cdot p \right) \\ k_{1/2}(x_i,x_j)&= l^2 \exp \left( -\frac{p}{\sigma }\right) \\ k_{3/2}(x_i,x_j)&= l^2 \left( 1 + \frac{\sqrt{3} p}{\sigma } \right) \exp \left( - \frac{\sqrt{3} p}{\sigma }\right) \\ k_{5/2}(x_i,x_j)&= l^2 \left( 1 + \frac{\sqrt{5} p}{\sigma } + \frac{5p^2}{3\sigma } \right) \exp \left( - \frac{\sqrt{5} p}{\sigma }\right) \end{aligned}$$

where \(p = ||x_i - x_j||^2_2\) is the Euclidean distance between \(x_i\) and \(x_j\). The accompanying scaling parameters are chosen using \(S = \left\{ 0.01, 0.5, 0.75, 1, 2, 5, 7.5, 10, 12.5, 15 \right\} \) as \(\sigma = \left\{ s\sqrt{d}^{-1} \mid s \in S \right\} \) with d being the dimension of the respective dataset. The similarity threshold \(\tau \) is chosen by selecting five equidistant values from the interval [0.0, 0.2].

For the second experiment, we will compare the tree-based outlier detectors to nearest-neighbor-based methods. To do so, we consider the aNNE [31] and LeSiNN [21] algorithms, which we have implemented on our own, and the iNNE algorithm [6], which has been implemented by the authors. These three methods employ a bagging style ensemble of nearest-neighbor-based estimators. Hence, they also require the specification of t and \(\psi \), which we will vary in the same fashion, as we did with the tree-based methods in the first experiment. These and all tree-based experiments are conducted on an Intel Xeon CPU E5-2690 CPU with 56 cores and 504 GB RAM. From the number of methods we used and the employed grid parameter optimization it follows, that we conducted 351,690 experiments in total.

Additionally, we will reuse the results provided by Campos et al. [8] in their large-scale study of outlier detection algorithms. The authors of this study focused among others on the questions, how outlier detection methods differ practically, how parameter choices affect the detection quality, and which inherent difficulty can be accounted to various real-world and synthetic datasets. To do so, the study focuses on a large group of detection algorithms, namely neighborhood-based approaches (like LOF or k-NN) and especially discusses the choice of the neighborhood parameterization (i.e., ”k”). In a series of experiments, Campos et al. evaluated 12 different methods on 23 datasets with different hyper-parameters leading to a total of 1,300,758 experiments. Please note, however, that we selected 10 real-world datasets from this study and ignored artificial datasets. Also note that we found that Campos et al. focused on smaller datasets, so use also includes four larger ones into our experimental analysis. To provide a more meaningful characterization of the used datasets, we also adapt the Difficulty metric established in [8]. This metric ranges from 0 (not difficult) to 10 (very difficult) and indicates, how difficult it is, to identify outlying observations in different datasets correctly, given some set of methods (nearest-neighbor methods, in this case). For this discussion, we evaluated the difficulty metric on our own by running the reproduction package provided by Campos et al. [8]. The resulting difficulty for every dataset is given in Table 1. In summary, we considered 115,215 experimental results from [8], which leads to a grand total of 466,905 processed results in this paper.

Table 2 ROC AUC score of the generalized isolation forest (GIF), extended isolation forest (EIF), the isolation forest (IF), and the SCiForest (SCiF) algorithms, which represent the set of tree-based algorithms we have evaluated

5.1 Comparison of tree-based methods

In this experiment, we want to compare GIF with its three tree-based siblings. We measure the predictive performance of these algorithms by the ROC AUC score and report the best score for every dataset and hyper-parameter combination. We also report a \(\varDelta _\text {Iso}\) value, indicating the difference between the ROC AUC score achieved by GIF and the best sibling. The results are presented in Table 2. Note that we focus on the GIF, EIF, IF, SCiF, and \(\varDelta _\text {Iso}\) column for our evaluation.

The results show that GIF exhibits the best predictive performance in 9 of 14 cases when compared to EIF, IF, and SCiF. From a relative point of view, the improvement in terms of ROC AUC score is quite small for some datasets (e.g., PageBlocks, ForestCover, and PenDigits having \(\varDelta _\text {Iso} < 0.05\)), while it is quite large for other datasets (e.g., cardiotocography, satellite and waveform having \(\varDelta _\text {Iso} > 0.1\)). Interestingly, in those cases, in which GIF exhibited inferior performance, the degradation of the ROC AUC score is mostly rather low with delta values ranging from \(-0.0078\) to \(-0.0708\). Only for the Shuttle dataset, we observe a large degradation with \(\varDelta _\text {Iso} = -0.1251\), which results from a superior performance by the SCiF algorithm. The results nevertheless indicate that GIF can improve the predictive performance of IF, respectively, EIF and SCiF in a meaningful way, while remaining competitive to the other algorithms in those cases, in which our method did not show a better performance.

Regarding the absolute ROC AUC scores, it can be observed that especially the outlier analysis of the Wilt dataset seems to be troublesome for tree-based outlier detectors. While GIF provides a notable improvement w.r.t. the ROC AUC score and exposed the best results in this group of algorithms, the predictive performance never exceeds approx. 0.57. This dataset motivates further comparison with other methods to investigate, whether there is a systematic problem among this group of algorithms or whether this dataset in particular is hard to analyze for outliers. The difficulty metric, however, at least from the perspective of nearest-neighbor methods already suggests that this dataset is quite hard to analyze correctly (\(D = 7.74\)).

5.2 Comparison to neighborhood-based outlier detectors

From the previous section, it becomes clear, that the GIF improves the predictive performance in a meaningful way when compared to other tree-based outlier detectors. In this section, we want to compare GIF to other outlier detection methods. The goal of this comparison is not to show that GIF is the best method for all problems but to critically evaluate tree-based outlier detection methods when compared against proximity-based approaches. In addition to aNNe, iNNE, and LeSiNN, Campos et al. provided in [8] an extensive study of these approaches for outlier detection which we will use as an additional baseline here. Please note that the Creditfraud, Forestcover, Satellite, and Mammography datasets are not part of the original study by Campos et al. but are listed as such in Table 2. We have applied their experimental routines on the mentioned datasets and report it under the [8] column. Again we present the relative performance differences by covering the \(\varDelta _\text {NN}\) column, indicating the difference between all neighborhood based-methods and GIF.

From Table 2, it can be seen that nearest-neighbor-based approaches exhibit best results in 9 of 14 cases when compared to every other method. However, the \(\varDelta _\text {NN}\) column indicates that the performance degradation of GIF is quite small (\(|\varDelta _\text {NN}| < 0.05\)) in six of these nine cases resulting in a competitive performance. Conversely, the delta is larger (\(\varDelta _\text {NN} > 0.1\)) in two of these nine cases, which we will discuss here shortly. The first case is given by the Wilt dataset for which the difference is especially large with \(\varDelta _\text {NN} \approx 0.21\). It seems that this particular dataset is hard to analyze for outliers using tree-based methods, with GIF still being the best choice in this group of algorithms. The second case is given by the Annthyroid dataset, which exhibits a large performance degradation with \(\varDelta _\text {NN} \approx 0.12\) where IF performs better with a ROC AUC of 0.708. Hence, we conclude that in the majority of cases neighborhood-based and tree-based approaches deliver similar performances, with some notable outliers in which GIF generally seems to be the best choice among tree-based methods.

Fig. 1
figure 1

Critical difference diagrams depicting the pairwise statistical difference between different sets of methods

From Table 2, we can also observe that GIF outperforms nearest-neighbor methods in five cases, which is extremely meaningful since we compare against 15 different algorithms. Additionally, it can be seen that GIF for cardiotocography, waveform, and satellite shows an improvement w.r.t. to ROC AUC by a large margin with \(\varDelta _\text {NN} > 0.1\) and also providing better performance than EIF and IF. Regarding the Pima and Mammography dataset, we can still observe an improvement in ROC AUC with \(\varDelta _\text {NN} \approx 0.06\) and \(\varDelta _\text {NN} \approx 0.028\), respectively, which is not as large as for the other datasets but still meaningful, keeping in mind that GIF is outperforming 15 different nearest-neighbor-based outlier detection methods and all tree-based methods, i.e., EIF, IF and SCiF.

Finally, we show critical difference (CD) diagrams in Fig. 1, which leverage Wilcoxon–Holm analysis to assess the pairwise statistical difference between different methods [15]. As expected, Fig. 1b shows that the ROC AUC scores from [8] generally seem to be the best given that we compare the best configuration of 12 different methods with GIF. However, we also see that even this very powerful ensemble of classifiers is not statistically significant better than GIF achieving second place. From Fig. 1b we can also derive that aNNE, iNNE, and LeSiNN expose worse ranks than [8], GIF, IF, and EIF, but still are not worse than these methods in terms of statistical significance. Last, we see that IF and EIF are ranked after GIF, where IF is surprisingly ranked before EIF.

Table 3 Mean runtimes for the generalized isolation forest (GIF), the extended isolation forest (EIF), the isolation forest (IF), the SCiForest (SCiF), and three distinct nearest-neighbor methods, namely aNNe, iNNE, and LeSiNN, w.r.t. different datasets

5.3 Runtime analysis

After discussing the predictive performance of our algorithm, we want to take a look at its runtime. We measure the total time from the setup of the algorithm until the scoring of every individual observation in the dataset becomes available. The results in Table 3 correspond to those in Table 2.

GIF has a slower or equal runtime in 7 of 14 cases, when compared to other tree-based (i.e., EIF, IF, and SCiF) and nearest-neighbor-based outlier detectors (i.e., aNNe, iNN, and LeSiNN). The relative runtime improvement of GIF w.r.t. to non-GIF methods (the \(\varDelta \)-column) is mostly ranging well-below one second. These lower runtimes are noteworthy, especially in those cases in which GIF is not only is faster but also produces a better predictive performance (e.g., waveform, Pima, cardiotocography, etc.). Moreover, a very interesting case is given by the forestcover dataset. Here, GIF not only improved the ROC AUC w.r.t. to tree-based methods by \(\varDelta _\text {Iso} \approx 0.017\) but also at a much smaller runtime of about approx. 18 seconds. Comparing GIF to nearest-neighbor-based methods, we observe a slight degradation in predictive performance with \(\varDelta _\text {NN} = -\, 0.016\) but a large improvement in terms of runtime. Conversely, there are cases, in which the GIF algorithm tends to exhibit larger runtimes than competing, tree-based algorithms. This is the case for some datasets like Wilt and PenDigits. However, GIF in these cases also yields higher predictive performances with ROC AUC gains being \(\varDelta _\text {Iso} \approx 0.045\), hence constituting a viable runtime-performance trade-off. On the other side, there are datasets like Annthyroid and Spambase, in which GIF yields inferior predictive performances and a runtime which is measurably higher than competing algorithms, which we evaluated in Table 3.

This is especially visible for the Creditfraud dataset. This particular dataset is quite large (cf. Table 1) and therefore naturally increases the runtime of all algorithms because their time complexity is (also) a function of the subset size, which we set to 25% of the dataset size. Additionally, the evaluation of the exit condition for every node becomes more costly for datasets with larger dimensionality. Nevertheless, we can observe much higher average runtimes for GIF which are approx. 12x larger than EIF and 208x larger than IF, while yielding a relative degradation in predictive performance by \(\varDelta \approx -\,0.01\). It is conceivable that GIF in this case heavily suffers from an inappropriately chosen exit condition which results in very large trees. This is also a possible explanation, why we observe much higher runtimes for Creditfraud when compared to Forestcover, although both datasets do not differ considerably in size.

5.4 Parameter sensitivity

Fig. 2
figure 2

Visualization of resulting runtimes and ROC AUC scores for the GIF algorithm, different datasets, and different K values, which decides how many child nodes are created for every yet unpartitioned node

In the last experiment, we want to evaluate how sensitive the GIF algorithm is to specific parameter changes w.r.t. runtime and predictive performance (i.e., the ROC AUC score). The GIF algorithm requires us to choose the number of child nodes per node (K), an appropriate exit threshold value as well as the kernel function. For a more easy understanding, we focus on a subset of the 7 datasets and use the RBF kernel in all experiments. Since the behavior of the exit threshold \(\tau \) also depends on the kernel function and its parameters, we, therefore, use a constant exit threshold of \(\tau = 0.1\) in all experiments but vary the scaling parameter of the RBF kernel.

First, we want to take a look at the parameter sensitivity of GIF regarding the K parameter. Here, we used that RBF scaling parameter \(\sigma \) which was able to maximize the ROC AUC score. The result of this procedure is shown in Fig. 2.

The plot suggests a rather stable runtime for a large fraction of datasets, as to be expected. The runtime of the GIF algorithm seems to increase moderately for larger K. Interestingly, the satellite dataset in general also seems to show increasing runtimes for a larger K but the increase is much more irregular when compared to the other datasets, while the dataset characteristics do not seem to be significantly different. The plot shows lower runtimes for \(K = 3\) and \(K=6\) suggesting that GIF for some specific dataset might be sensitive to the choice of the K parameter and otherwise tends to build larger, deeper trees, that lead to an unusual increase in runtime. Nevertheless, it can be stated, that the runtimes of the GIF algorithm do not seem to be heavily impacted by an increase in K, albeit there are some datasets which exhibit irregularities in their runtime behavior.

Regarding the ROC AUC score, we find quite stable predictive performances for most datasets when varying the K parameter indicating an insensitiveness of the GIF algorithm with regard to this parameter. It is indeed interesting to note, however, that the ROC AUC score seems to (moderately) decrease with a larger K for datasets like mammography and waveform. Here, it is conceivable that these specific datasets do not profit, but suffer from finding more partitions in their data space. The satellite dataset does not seem to exhibit any regularity with regard to the K parameter.

Fig. 3
figure 3

Visualization of resulting runtimes and ROC AUC scores for the GIF algorithm, different datasets, and different \(\sigma = s\sqrt{d}^{-1}\) values, which controls the scaling of the RBF kernel and hence also controls, how similar different pairs of observations shall be regarded

Second, we want to investigate how much the scaling parameter from the RBF kernel influences runtime and predictive performance. From the definition of the RBF kernel, we know that smaller scalings lead to more dissimilar observations. Conversely, larger scalings lead to more similar observations. As the GTr induction routine seeks to find compact and thus also very similar partitions, the trees potentially become a lot deeper for smaller \(\sigma < 1\) which in turn impacts the runtime negatively. Hence, we expect, that the runtime for smaller \(\sigma \) values is measurably higher than for larger \(\sigma \) values. For analysis, we only consider experiments with \(K = 5\). The results are shown in Fig. 3.

The observed runtime behavior confirms our expectations across all chosen datasets. However, it is interesting to see that the runtime not only decreases for increasing kernel scalings, but the runtimes also seem to reach a plateau for \(s \ge 2.5\). It seems that the tree induction routines experience a saturation effect, in which the runtime is not able to decrease anymore after the scaling parameter exceeded some value. A possible explanation for this is that GIF precludes smaller trees since observations become more similar for larger kernel scaling values. From the accompanying ROC AUC plot, we see a clear runtime vs. performance trade-off in some cases. Datasets like PenDigits, Mammography, Annthyroid, and Wilt seem to benefit from smaller scalings which admittedly lead to higher runtime but improves the predictive performance.

6 Application to financial transaction data

In this section, we highlight the usability of our method in the context of finding transactional fraud in financial data. Transaction fraud is a well-known problem in the financial sector in which criminals perform unauthorized transactions with stolen credit card information [3]. Fraud detection algorithms try to automatically find financial fraud possibly before a dubious transaction is even been processed while keeping the regular transactions untouched. This introduces a multitude of challenges for detection algorithms [3, 10]

  • Transaction fraud data is highly imbalanced since most transactions are non-fraudulent.

  • To provide a higher service of quality and not to interfere with regular transactions, a small false-positive rate is desired.

  • Detection must be performed on time so that regular transactions are processed timely.

  • The detection algorithms should offer some form of interpretability for the operator.

Unfortunately, there is a lack of publicly available datasets in financial services and especially transaction fraud data is limited due to the private nature of these transactions. Thus, for this use-case study, we use the publicly available PaySim [19] dataset, which simulates financial transactions modeled after real-world private datasets. The goal of this experiment is not to compare our GIF method against other methods (as done in the previous section), but show a real-world oriented use-case for financial transaction data. The dataset contains roughly 24 million transactions corresponding to a total timeframe of 30 days. Due to the size of this dataset, we did not consider it for the large-scale experiments performed before. For this experiment, we selected the first 572.500 transactions corresponding to one day of transactions.

The dataset contains five different transaction types (CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER) between two parties (ORIGINAL, DESTINATION) represented as unique identifiers in the simulation. Each transaction is accompanied by the amount (AMOUNT) of the transaction, as well as the balance before and after the transaction for both parties (NEWBALANCE, OLDBALANCE). Each transaction also contains a weak-label IS-FLAGGED originated from a simple rule-based system as well as the true label IS-FRAUD which indicates if the transaction was fraudulent or not. For our study, we decided to ignore the unique identifier (ORIGINAL, DESTINATION) as well as the weak label to not be dependent on external systems. Also, we added the balance difference after the transaction for both parties (NEWBALANCE + AMOUNT - OLDBALANCE) as a feature. It is interesting to note that due to the simulation, only two of the five transactions (CASH-OUT, TRANSFER) include fraudulent transactions, whereas the other three (CASH-IN, DEBIT, PAYMENT) are always non-fraudulent.

We performed 33.210 experiments using different configurations of GIF. The best solution achieved a ROC-AUC of 0.82137. However, this solution also tagged 3513 non-fraudulent transactions as fraudulent. Therefore, we decided to use a configuration which achieves the highest true-negative detection rate, while keeping the false-positive rate below 20. The resulting confusion matrix is given in Table 4. As expected, most transactions are non-fraudulent which are rightfully tagged as such. Moreover, the algorithm is able to identify nine fraudulent activities without any supervised knowledge about these. Unfortunately, there are 260 fraudulent activities which are not found, but only 20 non-fraudulent activities which are wrongly identified. During our experiments, we found a clear trade-off between the true-negative rate and false-negative rate which must be carefully adjusted for the specific problem at hand. For example, we found configurations with a higher true-negative rate at the expense of a higher amount of false-negative predictions. Last we note that the entire day of the transaction was processed in 99.993 seconds, meaning that we processed the entire day in well-below 2 min. We are therefore confident that this method could be run in near real time in a real-world banking situation.

Table 4 Confusion matrix after applying GIF on the PaySim data

7 Conclusion

Outlier detection is an important data mining problem and plays a key role in the financial sector. Isolation forest is one of the most used outlier detection algorithms due to its excellent practical performance. However, the theoretical properties of this algorithm are not very well understood. It is especially unclear under which assumptions IF and its siblings work well. In this paper, we presented a theoretical framework for tree-based outlier detection methods which builds on the widely accepted assumption that outliers are events from rare mixtures in mixture distributions. We showed that trees are well-suited to approximate the underlying mixture distribution and that they can be used to find mixture components with small weights thereby finding potential outliers. Moreover, we showed that IF, EIF, and SCiForest can be analyzed in this framework. Moreover, we showed that the average path length can be used as a scoring rule for outliers if longer decision paths in the tree imply fewer points in the regions. We used these insights to derive a new algorithm called generalized isolation forest. GIF constructs trees with K regions at each node to split the data making it more powerful than traditional isolation trees. Moreover, we directly estimate the mixture coefficients instead of relying on the average path length. In an extensive evaluation, we compared GIF with 18 state-of-the-art outlier detection methods on 14 different datasets with over 350,000 hyper-parameter combinations. We showed that GIF comfortably outperforms other tree-based methods such as IF, EIF, and SCiForest. Additionally, we showed that our algorithm could improve on the state-of-the-art in some many cases.