1 Introduction

According to the well-known definition of Hawkins (1980), an anomalyFootnote 1 is an object that is notably different from most remaining objects. Chandola et al. (2009) subdivided anomalies into three types: point anomalies (an object is considered anomalous when compared against the rest of objects), contextual anomalies (an object is anomalous in a specific context), and collective anomalies (a collection of objects is anomalous with respect to the entire dataset). The analysis of anomalies has a wide range of applications, such as in network security (Ahmed et al. 2016a), bioinformatics (Spinosa and Carvalho 2005), fraud detection (Ahmed et al. 2016b), and fault detection and isolation (Hwang et al. 2009).

Anomaly analysis consists of two equally important tasks: anomaly detection and anomaly explanation. A wealth of ‘shallow’ machine learning based methods, i.e., not based on deep learning, have been proposed to detect anomalies (Chandola et al. 2009). More recently, many deep learning based anomaly detection methods have also been developed (Pang et al. 2021). However, deep learning based anomaly detection methods are notoriously known as not being interpretable, in the sense that generally both the model itself is non-transparent and the resulting anomaly scores are challenging to interpret without the use of a post-hoc explainer. In this paper it is especially the latter that we consider to be problematic, as post-hoc explanations often rely not only on the model but also on the specific explainer used. In addition, deep learning methods typically require large amounts of data and training the models is a time-consuming process. In many real-world applications, however, ‘native’ interpretability (i.e., without posthoc explainer) may be required, and limited data and/or computation time may be available. For these reasons, in this paper we restrict our focus to ‘shallow’ machine learning based methods. In correspondence with this choice, we focus on settings where the amount of data is smaller than is typically required to learn accurate deep models. Most existing shallow methods only consider point anomaly detection, largely ignoring contextual anomaly detection. Moreover, anomaly explanation has received very limited attention. In this paper, we address both the problem of contextual anomaly detection and that of anomaly explanation, for small to moderately sized tabular data having categorical and/or quantitative features.

Fig. 1
figure 1

Different types of anomaly detection. a Distance- and density-based methods both consider object B to be abnormal, because it is far away from other objects and in a low-density region. b Dependency-based methods model the relationship between height and weight, and consequently consider object B to be normal. c Conditional quantiles provide more information about the conditional distribution than just the mean, and d can be used to visualise—using beanplots (or, rather, a variation of a beanplot combined with boxplots, see Sect. 5.2 for details)—why a certain object is considered (ab)normal

Shallow anomaly detectors are typically categorised into distance-based, density-based, and distribution-based approaches (Wang et al. 2019). Distance-based and density-based methods use knowledge about the spatial proximity of objects to identify anomalies, while distribution-based methods use knowledge about the distribution of the data to detect anomalies. These methods work under the assumption that objects having a large distance or different density from their spatial neighbours are anomalous or objects that take rare values under a marginal or joint distribution of features are anomalous, respectively. Using either of these assumptions may lead to false positives though. For example, as shown in Fig. 1a, these methods will mistakenly identify object B as an anomaly if the aim is to detect people that are over- or underweight.

As this is often undesirable, we aim to define and detect anomalies from another perspective, that is, we assume that objects that violate a dependency, i.e., a relationship between features, are anomalous. This is the idea behind dependency-based anomaly detection (Lu et al. 2020a), which is not new but has received limited attention compared to other types of anomalies. It leverages the intrinsic structure and properties of the data to find potentially more relevant anomalies. For example, Fig. 1b shows that this approach will identify object B as normal by modelling the dependency between height and weight.

Traditional anomaly detection techniques—including distance-, density-, and dependency-based methods—treat all features equally when identifying anomalies. However, in domains such as healthcare, sensor networks, and environment protection, some features should never be used directly to quantify anomalousness. For instance, forest fire detection systems should not treat ‘deviating’ values of latitude, longitude, date, and time as an indication of an anomaly. We should not simply discard these features either though, as they may contain relevant information. For example, the ‘normal’ temperature may be higher for certain regions than for others. This motivates us to investigate contextual anomaly detection, which can take such extra features into account.

Contextual anomaly detection assumes that an object is anomalous if it strongly deviates from objects within its ‘context’ of similar objects. The apparent contradiction in this assumption is explained by the division of the features into two disjoint subsets, i.e., contextual features and behavioral features. The contextual features are only used to define the contexts, using some similarity measure, while the behavioral features are only used to determine whether an object deviates from other objects within its context. Domain knowledge often leads to a natural division between contextual and behavioral features.

We observe that both contextual and dependency-based anomaly detection methods identify anomalies by explicitly or implicitly exploring dependencies between features. Concretely, dependency-based anomaly detection methods model dependency relationships between all features explicitly, while contextual anomaly detection methods model dependency relationships between behavioral and contextual features implicitly or explicitly. As far as we know this connection has not yet been pointed out in the literature.

1.1 Approach and contributions

In this paper, we introduce an approach for contextual anomaly detection and explanation that integrates the core principle of dependency-based anomaly detection into contextual anomaly detection to obtain a very accurate approach. As is common, we use regression analysis to model dependencies. Existing methods for dependency-based and contextual anomaly detection that use regression, however, typically only estimate the conditional mean of the response variable and directly interpret that as ‘normal’ value. This strongly limits how well anomalies can be detected, as the conditional mean provides very limited information about the conditional distribution. We therefore use quantile regression, which can model a conditional distribution in much more detail by estimating conditional quantiles. Figure 1c shows how conditional quantiles provide more information about the relationship between weight and height than a conditional mean could.

More specifically, a subset of the features, dubbed contextual features, are used to define the context of an object, while the remaining features, dubbed behavioral features, are used for detecting deviations within a context. In this paper we assume that the contextual features can be mixed but all behavioral features are numerical. Given this context and our aims, we use Quantile Regression Forests (Meinshausen 2006) to perform predictions for each behavioral feature and obtain corresponding uncertainty quantifications. By summing the quantified uncertainties (with a wider quantile interval representing a higher level of uncertainty) for all individual behavioral features, we obtain the anomaly score for a data instance. By attributing parts of the anomaly score to individual behavioral features, the approach intrinsically provides explanations in the sense that it can convey to which extent which features contributed to making the instance an anomaly. This offers advantages to post-hoc explanation methods such as SHAP (Lundberg and Lee 2017), which we do not consider ‘attributable’ for the following two reasons. First, the post-hoc explanation may not match the information/rationale used by the model to detect an anomaly (Li et al. 2022). Second, it has recently been shown (Fokkema et al. 2022) that attribute-based explanations cannot be both recourse sensitive and robust, which is a good reason to avoid such explainers when possible and use a ‘native’ attribution-based method instead.

As far as we are aware, we are the first to use quantile regression for dependency-based or contextual anomaly detection. Specifically, we choose to employ Quantile Regression Forests for three reasons. First, it can model both linear and non-linear dependency between features. Second, in the paper that introduced the method it was empirically shown to outperform other conditional quantile estimators in most cases. Figure 1d shows how the estimated conditional quantiles can be used to approximate the conditional probability density at different locations, which we can use to accurately detect anomalies. Moreover, the quantiles are helpful to explain why an object is considered an anomaly without having to explicitly refer to other objects.

The main contributions of our work can be summarized as follows: (1) We identify a connection between dependency-based traditional anomaly detection methods and contextual anomaly detection methods, and exploit this observation to introduce a novel high-level approach to contextual anomaly detection and explanation; (2) We instantiate this generic approach using quantile regression (and Quantile Regression Forests specifically) for anomaly detection and a beanplot-based visualization for anomaly explanation; and (3) We perform extensive experiments on synthetic and real-world datasets to empirically demonstrate the effectiveness, and interpretability of the proposed method when compared to state-of-the-art methods.

The remainder of this paper is organized as follows. Section 2 discusses related work, both in contextual and traditional anomaly detection. Section 3 introduces notation, formalizes the problem, and presents the high-level approach that we propose. Section 4 then describes some technical preliminaries, most notably Quantile Regression Forests. Section 5 introduces QCAD, our proposed method for Quantile-based Contextual Anomaly Detection and Explanation that instantiates the high-level approach. Section 6 empirically compares QCAD to its competitors, and provides a case study that investigates the use of QCAD to find exceptional football players from data. Section 7 concludes the paper.

2 Related work

We first discuss related work on contextual anomaly detection and explanation, and then proceed with the two most closely related types of traditional anomaly detection: dependency-based and subspace-based anomaly detection.

2.1 Contextual anomaly detection and explanation

Contextual anomaly detection has received particular attention in spatial data (Cai et al. 2013), temporal data (Salvador et al. 2004), and spatio-temporal data (Smets et al. 2009), where spatial and/or temporal features are used to define contexts. These methods are not directly applicable to other domains, where the contexts are defined by other types of features; ‘generic’ contextual anomaly detection has received limited attention in the community.

CAD (Song et al. 2007) is a seminal work that introduced generic contextual anomaly detection. It assumes a user-specified partition of features into contextual (called ‘environmental’) and behavioral (called ‘indicator’) features, and uses Gaussian Mixture Models to fit the distributions of the contextual and behavioral feature spaces. Dependencies between contextual features and behavioral features are then learned by means of ‘mapping functions’, and an object is considered anomalous if it violates the learned functions. For this to work CAD assumes that both the contextual and behavioral features consist of an unknown number of multiple Gaussian components, which may be a strong assumption in practice. Further, CAD can only handle numerical features and is computationally very expensive. Our proposed method makes no assumptions about the distribution of the features, can deal with mixed contextual features and numerical behavioral features, and we will show empirically that it is computationally more efficient than CAD while achieving a higher detection accuracy.

ROCOD (Liang and Parthasarathy 2016) is also closely related, and uses local and global models of expected behavior to describe the dependencies between contextual and behavioral features. Concretely, standard regression models such as CART are used to learn global patterns, with contextual features as predictor variables and behavioral features as response variables. Local patterns are computed based on the means of behavioral feature values of an object’s neighbours. An object’s actual value is compared to the local and global pattern, and the weighted average of these differences forms the anomaly score. As the conditional mean describes only one aspect of a conditional distribution, and is not necessarily the point with the highest probability of occurrence (i.e., the mode). To address this, our method employs quantile regression analysis to estimate conditional quantiles, which provide a much more complete description of the conditional distribution. As a result, our method empirically outperforms ROCOD in terms of accuracy.

With the increasing use of anomaly detection algorithms in safety-critical domains, such as healthcare and manufacturing, the ethical and regulatory obligations to provide explanations for the high-stakes decisions made by these algorithms has become more pronounced (Li et al. 2022). Existing contextual anomaly detection do not provide such explanations though, and non-trivial modifications would be needed for them to do so. Specifically, CAD (Song et al. 2007) computes the anomaly score of a data instance by measuring its deviation from the mapping functions that are learned from the majority of data instances. This approach poses a challenge in generating intrinsic explanations, as it does not allow for the attribution of the anomaly score to individual features. Further, ROCOD (Liang and Parthasarathy 2016), the other known method for contextual anomaly detection, calculates the weighted average of an instance’s differences to the learned local and global patterns as its anomaly score. The local pattern is obtained using its neighbours, while the global pattern is learned using all instances, making it difficult to associate the anomaly score with individual features.

Despite the existence of numerous methods for explaining anomalies, such as those outlined in recent (survey) papers (Panjei et al. 2022; Li et al. 2022; Xu et al. 2021), there has been very limited research on contextual anomaly explanation. Particularly, COIN (Liu et al. 2018) explains outliers by reporting their outlierness score, the features that contribute to its abnormality, and a contextual description of its neighbourhoods. COIN treats all features equally though, while our method divides features into contextual and behavioral features. Further, we develop a visualisation that helps explain contextual anomalies in addition to reporting the overall anomaly score, feature importance, contextual neighbours, and individual anomaly score for each behavioral feature.

The following methods are less relevant because they consider slightly different problems. Valko et al. (2011) construct a non-parametric graph-based method for conditional anomaly detection, which only addresses the problem of a single categorical behavioral feature. Hayes and Capretz (2014) first identify anomalies on behavioral features, and then refine the detected anomalies by clustering all objects on contextual features. Tang et al. (2015) detect group anomalies from multidimensional categorical data. Hong and Hauskrecht (2015) present a contextual anomaly detection framework dedicated to categorical behavioral features which also considers the dependencies between behavioral features. Moreover, Zheng et al. (2017) apply robust metric learning on contextual features to find more meaningful contextual neighbours and then leverage k-NN kernel regression to predict the behavioral feature values. Meghanath et al. (2018) develop ConOut to automatically find and incorporate multiple contexts to identify and interpret outliers.

2.2 Traditional anomaly detection

Although traditional anomaly detection considers a problem that is different from contextual anomaly detection, dependency-based and subspace-based anomaly detection leverage techniques that are related to our method.

2.2.1 Dependency-based anomaly detection

Dependency-based anomaly detection aims to identify anomalies by exploring dependencies between features. Teng (1999) explores the dependency between non-target and target features to identify and correct possible noisy data points. To detect networking intrusions, Huang et al. (2003) present Cross-Feature Analysis to capture the dependency patterns between features in normal networking traffic. To detect disease outbreaks, Wong et al. (2003) propose to explore the dependency between features using a Bayesian network. Noto et al. (2010) propose to detect anomalies by using an ensemble of models, with each model exploring the dependency between a response feature and other features. Babbar and Chawla (2012) use Linear Gaussian Bayesian networks to detect anomalies that violate causal relationships.

LoPAD (Lu et al. 2020b) first uses a Bayesian network to find the Markov Blankets of each feature. Then, a predictive model (e.g., CART) is learnt for each individual feature as response variable, with its Markov Blankets as predictor variables. Given an object, LoPAD computes the Euclidean distance between its actual value and predicted value (i.e., conditional mean) as its anomaly score, for each feature. The resulting anomaly scores are normalized and summed to obtain the final anomaly score for an object. The method does not distinguish contextual and behavioral features and—like ROCOD—uses conditional means to represent the conditional distribution. Consequently, LoPAD cannot (accurately) detect contextual anomalies.

2.2.2 Subspace-based anomaly detection

Subspace-based anomaly detection seeks to find anomalies in part of the feature space. Specifically, Kriegel et al. (2009) propose SOD to identify outliers in varying subspaces. Concretely, they construct axis-parallel subspaces spanned by the neighbours of a given object. On this basis, they investigate whether this object deviates significantly from its neighbours on any of these subspaces. Furthermore, Kriegel et al. (2012) extend this work to determine whether an object is anomalous on arbitrarily oriented subspaces spanned by its neighbours. Nguyen et al. (2013) propose to find subspaces with strong mutual correlations and then identify anomalies on these subspaces. Finally, Cabero et al. (2021) use archetype analysis to project the feature space into various subspaces with linear correlations based on nearest neighbours. On this basis, they explore outliers by ensembling the results obtained on relevant subspaces. Overall, these methods pursue to identify anomalies in a subset of features, but treat all features equally and are thus not suitable to identify contextual anomalies.

3 Contextual anomaly detection and explanation

We first introduce the necessary terminology and notations, and illustrate this with the running example depicted in Table 1. A dataset \({\textbf{X}} =\{{\textbf{x}}_{1},\ldots ,{\textbf{x}}_{i},\ldots ,{\textbf{x}}_{N}\}\) contains N instances (or data points) over the set of features (a.k.a. attributes or variables) denoted by \({\textbf{F}} = \{{\textbf{f}}^{1},\ldots ,{\textbf{f}}^{j},\ldots ,{\textbf{f}}^{D}\}\). \(x_{i}^{j}\) denotes the value of the i-th object, \({\textbf{x}}_{i}\), for the j-th feature, \({\textbf{f}}^{j}\). In the running example, we have \(N=16\), \(D=6\), \({\textbf{F}}= \{Latitude, Longitude, Season, Temperature, Rain, Wind\}\), and \(x_{1}^{Rain} = 69\).

Table 1 Running example: fictional climate data for Dutch cities. Temperature is measured in degrees Celsius, rain in mm, and wind in miles per hour. The city is not used for anomaly detection. Latitude, Longitude and Season are treated as contextual features, while Temperature, Rain and Wind are considered the behavioral features

We assume that feature set \({\textbf{F}}\) is divided into two disjoint feature sets (typically using domain knowledge): a contextual feature set \({\textbf{C}} =\{{\textbf{c}}^{1},\ldots ,{\textbf{c}}^{p},\ldots ,{\textbf{c}}^{P}\}\) and a behavioral feature set \({\textbf{B}} =\{{\textbf{b}}^{1},\ldots ,{\textbf{b}}^{q},\ldots ,{\textbf{b}}^{Q}\}\), such that \(D=P+Q\), \({\textbf{F}} = {\textbf{C}} \cup {\textbf{B}}\), and \({\textbf{C}} \cap {\textbf{B}} = \emptyset\). Without loss of generality, we can rearrange the features of an object \({\textbf{x}}_{i}=(x_{i}^{1},\ldots ,x_{i}^{j},\ldots ,x_{i}^{D})\) and represent it as \({\textbf{x}}_{i}=(x_{i}^{1},\ldots ,x_{i}^{P},x_{i}^{P+1},\ldots ,x_{i}^{P+Q}) = (c_{i}^{1},\ldots ,c_{i}^{p},\ldots ,c_{i}^{P},b_{i}^{1},\ldots ,b_{i}^{q},\ldots ,b_{i}^{Q})=({\textbf{c}}_{i}, {\textbf{b}}_{i}),\) so that \({\textbf{c}}_{i}\) and \({\textbf{b}}_{i}\) denote its contextual and behavioral feature values, respectively. Accordingly, we refer to the space spanned by \({\textbf{C}}\) as contextual space, i.e., \({\mathcal {C}} = {\textbf{c}}^{1}\times \cdots \times {\textbf{c}}^{p}\times \cdots \times {\textbf{c}}^{P}\), and to the space spanned by \({\textbf{B}}\) as behavioral space, i.e., \({\mathcal {B}} = {\textbf{b}}^{1}\times \cdots \times {\textbf{b}}^{q}\times \cdots \times {\textbf{b}}^{Q}\).

Finally, let \({{\,\textrm{Pow}\,}}\) denote the powerset, i.e., \({{\,\textrm{Pow}\,}}({\textbf{X}}) = \{X \subseteq {\textbf{X}}\}\).

3.1 Problem statement

In contextual anomaly detection, contextual features are used to determine the so-called context of an object. An object’s context is used to estimate whether it is anomalous. The latter is achieved by comparing the object’s values for the behavioral features to what is ‘normal’ within the object’s context—if the object’s behavioral values strongly deviate, it is flagged as an anomaly.

For contextual anomaly detection to be meaningful, we must assume that there exist dependencies between the contextual and behavioral data. If such a relationship does not exist, there is no need to use contextual anomaly detection; one could simply remove the contextual features and reduce the problem to a traditional anomaly detection problem.

We illustrate this using the running example in Table 1. We can detect anomalous weather taking into account different regions and seasons by specifying contextual feature set \({\textbf{C}} = \{Latitude, Longitude, Season\}\) and behavioral feature set \({\textbf{B}} = \{Temperature, Rain, Wind\}\). Each city is now only compared to cities with a similar latitude, longitude, and season, i.e., its context. If a city has values for temperature, rain, and/or wind that strongly deviate from those of the cities in its context, it is marked as anomalous.

For example, Amsterdam and Rotterdam could form the context of Leiden; they are both nearby and we have measurements for the same season. Temperature and wind are similar for all three cities, but there was substantially more rain in Leiden than in Amsterdam and Rotterdam. Hence, for that reason Leiden could be flagged as an anomaly.

To formalise the problem, we introduce a generic ‘context function’ that maps each possible data point to a subset of the dataset, i.e., its context, based on the data point’s contextual features.

Problem 1: Contextual anomaly detection Given a dataset \({\textbf{X}}\) with a feature set \({\textbf{F}} = ({\textbf{C}},{\textbf{B}})\), a context function \(Context : {\mathcal {C}} \rightarrow {{\,\textrm{Pow}\,}}({\textbf{X}})\), an anomaly detector \(Anomaly : {\mathcal {B}} \times {{\,\textrm{Pow}\,}}({\textbf{X}}) \rightarrow {\mathbb {R}}^{+}\), and a threshold \(\phi\), find all data points for which the anomaly scores exceed \(\phi\) and are thereby flagged as anomalous—based on the behavioral features—within their individual contexts, i.e., \(\{({\textbf{c}},{\textbf{b}}) \in {\textbf{X}} \mid Anomaly ({\textbf{b}}, Context ({\textbf{c}})) \ge \phi \}\).

Note that the context is determined based only on contextual feature values, and that the anomaly detector may only use the behavioral feature values of the data points in the given context when establishing if the given data point is anomalous or not.

In practice, analysts are not only interested in identifying anomalies, but also need to know the underlying reasons for why a specific object is reported as anomaly. This leads to the second problem that we consider.

Problem 2: Contextual anomaly explanation Given an anomalous object \({\textbf{x}} \in {\textbf{X}}\) and the context function \(Context\) and anomaly detector \(Anomaly\) that were used to detect it, find the behavioral features \(\mathbf {B'} \subseteq {\textbf{B}}\) for which \({\textbf{x}} = ({\textbf{c}}, {\textbf{b}})\) substantially deviates from \(Context ({\textbf{c}})\).

3.2 Overall approach

The problem statement in the previous section suggests a three-pronged approach based on (1) context generation, (2) anomaly detection, and (3) anomaly explanation. In this subsection we explain and illustrate the overall approach that we propose, using the running example from Table 1.

Fig. 2
figure 2

Reference Group Generation and Anomaly Explanation. The Reference Group Generation phase computes the distance matrix using only the contextual features, and finds a reference group for each object on this basis. The Anomaly Explanation phase produces an explanation for each identified anomaly by reporting its contextual neighbours, final anomaly score, and raw anomaly scores in each behavioral feature

Phase 1: Reference group generation Many choices are possible for the context function; in this manuscript we choose to use an object’s k nearest neighbours, which we refer to as reference group. The most important reasons for this choice are that (1) once a global contextual distance matrix has been computed the nearest neighbours of any object can be found relatively quickly; (2) this approach only requires a distance metric to be chosen, which can be defined for any type of data and be adapted to the problem at hand; and (3) it is generic enough to allow for different uses of the resulting contexts.

Given a dataset \({\textbf{X}}\) with contextual feature set \({\textbf{C}}\), we first compute the distance matrix \({\textbf{M}}\) between all objects using only the contextual features. Second, for any object \({\textbf{x}} \in {\textbf{X}}\), we find its k nearest neighbours in contextual space based on \({\textbf{M}}\). As a result, the k nearest neighbours of \({\textbf{x}}\) form a reference group, denoted as \(R({\textbf{x}},k)\), which serves as context in Problems 1 and 2.

For instance, as shown in Fig. 2a, given \(\{Latitude, Longitude, Season\}\) as the contextual feature set, we first calculate the distance matrix in the contextual space \(Latitude \times Longitude \times Season\). Second, based on the distance matrix, we can find the three nearest neighbours of any object as its reference group.Footnote 2 Concretely, the three nearest neighbours for \({\textbf{x}}_{1}\) are \(\{{\textbf{x}}_{2},{\textbf{x}}_{3}, {\textbf{x}}_{4}\}\), which forms a reference group for \({\textbf{x}}_{1}\), denoted as \(R({\textbf{x}}_{1},3)\).

Phase 2: Anomaly detection Given an object \({\textbf{x}} \in {\textbf{X}}\) with its reference group \(R({\textbf{x}},k)\), we apply an anomaly detector \(Anomaly\) to obtain an anomaly score based only on the behavioral attribute values. We repeat the above process for all objects, leading to N anomaly scores \(S= \{s_{1},s_{2},\ldots ,s_{N}\}\). We sort S and use threshold \(\phi\) to obtain a ranked list of contextual anomalies \(A= \{{\textbf{a}}_{1},\ldots ,{\textbf{a}}_{m},\ldots ,{\textbf{a}}_{M}\}\), with \(M \ll N\).

For example, we can apply an anomaly detector on the behavioral space \(Temperature \times Rain \times Wind\) of \({\textbf{x}}_{1}\) and its reference group \(R({\textbf{x}}_{1},3)=\{{\textbf{x}}_{2},{\textbf{x}}_{3},{\textbf{x}}_{4}\}\). As a result, we obtain anomaly score \(s_{1}\) for \({\textbf{x}}_{1}\). Accordingly, repeating this process leads to a set of anomaly scores \(S= \{s_{1},\ldots ,s_{16}\}\). In our running example we find \(\{{\textbf{x}}_{4},{\textbf{x}}_{12},{\textbf{x}}_{13}\}\) as contextual anomalies, as they behave differently in the behavioral space \(Temperature \times Rain \times Wind\) when compared to their corresponding neighbours defined in the contextual space \(Latitude \times Longitude \times Season\). For example, Oss (\({\textbf{x}}_{4}\)) has relatively stronger winds in winter when compared to its nearest cities Leiden, Amsterdam and Rotterdam.

We require anomaly detectors to be attributable, meaning that an anomaly score s generated by an anomaly detector must be decomposable into individual contributions towards anomalousness for each behavioral feature using the anomaly detector’s native structure.

Phase 3: Anomaly explanation For each anomalous object \({\textbf{a}}_{m}\), we first report its reference group \(R({\textbf{a}}_{m},k)\) using the distance matrix obtained in Phase 1. Second, we report its anomaly score \(s_{m}\), as obtained in Phase 2. Moreover, we decompose \(s_{m}\) into individual contributions from each behavioral feature \({\textbf{b}}^{q} \in {\textbf{B}}\), resulting in a list of raw anomaly scores \({\textbf{s}}_{m}=\{s_{m1},\ldots ,s_{mq},\ldots ,s_{mQ}\}\). This is possible because \(s_{m}\) is attributable. Next, we report the top-h raw anomaly scores in \({\textbf{s}}_{m}\) with their corresponding behavioral features, where h can be specified by the analyst. The top-h behavioral features and raw scores enable analysts to better understand why a specific object is flagged as a contextual anomaly.

For example, Fig. 2b reports object \({\textbf{x}}_{12}\) as an anomaly. To explain why, we first inspect its reference group \(R({\textbf{x}}_{12},3)=\{{\textbf{x}}_{10},{\textbf{x}}_{11},{\textbf{x}}_{12}\}\), which contains the three objects most similar to \({\textbf{x}}_{12}\). Second, we report its anomaly score, i.e., 90. Next, we decompose the score and report the behavioral features with the highest deviations, e.g., Wind. We can interpret the result as: \({\textbf{x}}_{12}\) deviates substantially in Wind when compared to \(\{{\textbf{x}}_{10},{\textbf{x}}_{11},{\textbf{x}}_{12}\}\), which are all similar in terms of Latitude, Longitude and Season.

4 Preliminaries

We first define the conditional mean and conditional quantiles, which we use to motivate the use of conditional quantiles for anomaly detection. Next, we detail quantile regression forests, a model class that can robustly estimate conditional quantiles. Readers familiar with these concepts can skip this section.

4.1 From conditional mean to conditional quantiles

Given a real-valued variable V and a set of variables \({\textbf{U}}\), regression analysis aims to model the distribution of V conditioned on \({\textbf{U}}\). Specifically, it takes V as the response variable and \({\textbf{U}}\) as the predictor variables to construct a model that can be used to predict V based on \({\textbf{U}}\). Standard regression uses training data \(\{({\textbf{U}}_{1},V_{1}),\ldots ,({\textbf{U}}_{n},V_{n})\}\) to learn a model that estimates the conditional mean \(E(V \mid {\textbf{U}}={\textbf{u}})\) and uses that as prediction for V when \({\textbf{U}}={\textbf{u}}\) is given. For example, Least Squares Regression fits a model \({\hat{\theta }}\) by minimising the expected squared error loss, namely \({\hat{\theta }} = \underset{\theta }{\textrm{argmin}}\, E\{(V-{\hat{V}}(\theta ))^{2})\mid {\textbf{U}}={\textbf{u}}\}\).

The conditional mean, however, is only a single statistic of the conditional distribution and is thereby limited in what it can capture. For example, if the conditional distribution is a multi-modal distribution, the mean is insufficient to describe it (regardless of whether we also consider the standard deviation).

To allow for more comprehensive descriptions of conditional distributions, Koenker and Hallock (2001) proposed to estimate conditional quantiles. As usual, quantiles are splitting points that divide the range of the probability distribution into consecutive intervals having equal probability. Given a continuous variable V, its conditional \(\alpha\)-quantile given \({\textbf{U}}={\textbf{u}}\) is defined by \(Q_{\alpha }({\textbf{u}}) = \inf \{v: F(v \mid {\textbf{U}}={\textbf{u}}) \ge \alpha \}\), where \(F(v \mid {\textbf{U}}={\textbf{u}}) = P(V \le v \mid {\textbf{U}}={\textbf{u}})\) is the cumulative distribution function. Conditional quantiles have the potential to describe the full conditional distribution of response variable V.

Existing regression-based anomaly detection methods typically estimate the conditional mean of V by its expected value \({\hat{v}}\), and then use the Euclidean distance between its actual value and the expected value, i.e., \(dist(v,{\hat{v}})\), as anomaly score. It is hard to interpret this distance though, as the shape and range of the conditional distribution are unknown. In this paper we address this by estimating conditional quantiles instead. In particular, we will show that we can use conditional quantiles to approximate the probability of observing a certain data point in its context, which is then used for the anomaly score.

4.2 Quantile regression forests

Tree-based regression approaches—such as CART, M5, and Random Forests—are often used to learn both linear and non-linear dependencies. Meinshausen (2006) extended the Random Forest to Quantile Regression Forest, which estimates and predicts conditional quantiles instead of means.

Specifically, a quantile regression forest (QRF) is constructed by building an ensemble of K independent decision trees to estimate the full conditional cumulative distribution function of V given \({\textbf{U}} = {\textbf{u}}\), based on n independent observations \(\{({\textbf{U}}_{1},V_{1}),\ldots ,({\textbf{U}}_{i},V_{i}),\ldots ,({\textbf{U}}_{n},V_{n})\}\). Each \({\textbf{U}}_{i}\) consists of d dimensions. The estimated full conditional distribution can be written as

$$\begin{aligned} {\hat{F}}(v \vert {\textbf{U}}={\textbf{u}}) = {\hat{P}}(V \le v \vert {\textbf{U}}={\textbf{u}}) = {\hat{E}}( {\mathbb {I}}(V \le v) \vert {\textbf{U}}={\textbf{u}}) = \sum _{i=1}^{n}\omega _{i}({\textbf{u}}){\mathbb {I}}(V_i \le v), \end{aligned}$$
(1)

where \(\omega _{i}({\textbf{u}})\) denotes the weight assigned to observation \(({\textbf{U}}_{i},V_{i})\).

The decision trees that make up a quantile regression forest are constructed similarly to how a random forest is learned, i.e., for each individual tree \(m \le n\) data points are sampled (with replacement), \(d^{'} \ll d\) features are randomly selected, and a criterion such as information gain is used to recursively split the data and tree. Each leaf node keeps all its observations though. K decision trees, namely \(T_{1}(\theta ),\ldots ,T_{K}(\theta )\), are independently grown to form a forest.

Once the forest has been constructed, for a given \({\textbf{U}} = {\textbf{u}}\) each decision tree \(T_{j}(\theta )\) is traversed to find the leaf node that \({\textbf{u}}\) resides in. A weight \(\omega _{i}({\textbf{u}},T_{j}(\theta ))\) is then computed for each observation \({\textbf{U}}_{i}\), with \(i \in \{1,\ldots ,n\}\): if observation \({\textbf{U}}_{i}\) and \({\textbf{u}}\) reside in the same leaf node, then \(\omega _{i}({\textbf{u}},T_{j}(\theta ))\) is defined as 1 divided by the number of samples residing in the leaf node. Otherwise, \(\omega _{i}({\textbf{u}},T_{j}(\theta ))\) is 0. Next, it takes the average of \(\omega _{i}({\textbf{u}},T_{j}(\theta ))\) over all decision trees, i.e.,

$$\begin{aligned} \omega _{i}({\textbf{u}}) = \frac{1}{K} \sum _{j=1}^{K}\omega _{i}({\textbf{u}},T_{j}(\theta )), \end{aligned}$$
(2)

which is the weight assigned to an observation \({\textbf{U}}_{i}\). Finally, conditional quantile \(Q_{\alpha }({\textbf{u}}) = \inf \{v: F(v \mid {\textbf{U}}={\textbf{u}}) \ge \alpha \}\) can be estimated by \({\hat{Q}}_{\alpha }({\textbf{u}}) = \inf \{v: {\hat{F}}(v \mid {\textbf{U}}={\textbf{u}}) \ge \alpha \}\). Under reasonable assumptions, Meinshausen (2006) proved quantile regression forests to be consistent, i.e.,

$$\begin{aligned} \Big \vert {{\hat{F}}(v | {\textbf{U}}={\textbf{u}}) - F(v | {\textbf{U}}={\textbf{u}})}\Big \vert {\mathop {\longrightarrow }\limits ^{p}} 0, \text { with } n \longrightarrow \infty \end{aligned}$$
(3)

holds pointwise for every \({\textbf{u}}\).

5 Quantile-based contextual anomaly detection and explanation

We present an instance of the generic approach for contextual anomaly detection presented in Sect. 3.2 that is based on quantile regression forests.

The main idea of our method is to estimate the deviation of an object’s behavioral values within a given context using uncertainty quantification around predictions, where the predictions are assumed to capture ‘normal’ behavior. Then, a higher uncertainty implies a higher deviation within the context and thus a higher degree of anomalousness. To this end several approaches could be explored. For example, one might use multi-target regression models—such as Multivariate Random Forests (Segal and Xiao 2011)—on all behavioral features, or single-target regression models—such as Random Forests (Breiman 2001)—on individual behavioral features followed by aggregation and conformal inference (Lei et al. 2018). In this paper we instead opt to use Quantile Regression Forests (Meinshausen 2006), a single-target regression model, because it is (relatively) simple, offering advantages with regard to interpretability, and directly provides uncertainty quantifications. More concretely, we derive intervals for the behavioral features from the underlying quantile regression forests as inherent uncertainty quantifications around predictions, resulting in statistically sound and interpretable measures of degree of anomaly.

In the first phase, we generate reference groups using a distance matrix computed on the contextual space of all data points. Specifically, we use Gower’s distance (Gower 1971) (see Sect. 6.1.1 for more detail), to be able to deal with both quantitative and categorical features, and select the k objects having the smallest distances to an object \({\textbf{x}}\) as its reference group \(R({\textbf{x}}, k)\).

Next, in the second phase, an anomaly score is computed for each individual data point, based on the values in the behavioral space of the data point and its reference group. The algorithm is dubbed QCAD—for Quantile-based Contextual Anomaly Detection—and forms the core of our approach; it is introduced in Sect. 5.1. Finally, Sect. 5.2 describes how the found anomalies are explained by decomposing the anomaly score in the third phase.

figure a

5.1 Detecting anomalies with quantile regression forests

Algorithm 1 outlines the QCAD algorithm, which takes a dataset and a number of hyperparameters as input and outputs a list of all data points together with their reference groups and computed anomaly scores. Specifically, we assume that the contextual features can be mixed but all behavioral features are numerical. We will first describe the overall algorithm, and then go into the specifics of the score computation.

5.1.1 Algorithm

After initializing the empty score list (Line 2), the algorithm iterates over all data points in dataset \({\textbf{X}}\) (Ln 3–11). For each object \({\textbf{x}} =(c^{1},\ldots ,c^{p},\ldots ,c^{P},b^{1},\ldots ,b^{q},\ldots b^{Q})=({\textbf{c}},{\textbf{b}})\) we first obtain the reference group that was computed in the first phase (Ln 4). We then iterate over all features in behavioral feature set \({\textbf{X}}\) in order to compute a partial anomaly score for each behavioral feature (Ln 5–8). These partial anomaly scores are summed to obtain the anomaly score for \({\textbf{x}}\) (Ln 9), i.e., we assume the behavioral features to all have equal potential to contribute to the overall anomaly score. After this, the data point, its reference group, and its anomaly score are appended to the anomaly score list (Ln 10). Finally, the anomaly score list is returned as output (Ln 12).

Within the inner for loop, we first learn a quantile regression forest using behavioral feature \({\textbf{b}}^{q}\) as response variable and the data point’s reference group \(R({\textbf{x}}, k)\) as training data (Ln 6). All contextual features \({\textbf{C}}\) are used as predictor variables for every constructed QRF. We then use the learned quantile regression forest, the features, and the data point to compute the raw partial anomaly score (Ln 7), which we will motivate and explain in detail next.

5.1.2 QRF-based anomaly score

As argued in the Introduction and Sect. 4.1, taking the distance between a data point and a conditional mean as basis for a contextual anomaly score may be too limiting: this only works if all ‘normal’ data points reside close to the mean. Instead, we aim to—conceptually—consider the entire conditional probability density function and use the local density of a given data point as a proxy for anomalousness: the lower the density, the higher the anomaly score. Directly accurately estimating the density function is hard though, especially in areas of low density, which are of particular importance to us.

Fig. 3
figure 3

a Estimated conditional quantiles \(\{Q_{0.1}, Q_{0.2}, \ldots ,Q_{0.9}\}\) for a behavioral feature conditioned on contextual features. b Zooming in on the area between \(Q_{0.40}\) and \(Q_{0.50}\), we see that percentiles are more likely to be sufficiently detailed than the quantiles in (a)

This is where the quantile regression forests come in: given sufficient training data, they accurately learn the conditional cumulative distribution function, which can be queried in inverse form, i.e., through conditional quantiles. When querying a quantile regression forest, this can be done at different granularities. For example, Fig. 3a shows that conditional quantiles \(\{Q_{0.1}, Q_{0.2}, \ldots ,Q_{0.9}\}\) may very well be insufficient to accurately describe a conditional distribution as they overly smooth the underlying distribution and thus fail to capture the nuances accurately, while Fig. 3b show that conditional percentiles, i.e., \(\{Q_{0.01}, Q_{0.02}, \ldots ,Q_{0.99}\}\), are much more likely to provide sufficient detail.

We use conditional percentiles for our anomaly score, because they provide a high level of granularity while not requiring very large amounts of data to be estimated accurately. As additional benefit, the difference between each two consecutive percentiles is always assessed by a weighted combination of a comparable number of training data points, i.e., at the cost of some smoothing we do not suffer from extremely poor local density estimates in low density areas.

To formally develop our anomaly score, we first define \(\tau _i\) to be the ith percentile, i.e., \(\tau _i = Q_{i/100}, \forall i \in [0,100]\). For any percentile interval, i.e., an interval \([\tau _i, \tau _{i+1}]\) defined by two consecutive percentiles i and \(i+1\), we have by definition that it spans exactly 0.01 probability (see Fig. 3b). We could estimate the local density of a percentile interval and use that for our anomaly score, but we aim for a score that becomes larger when a data point is deemed to be more anomalousness.

To this end we define percentile interval width \(w_i\) as the difference between two consecutive percentiles i and \(i+1\), i.e., \(\tau _{i+1} - \tau _i\). As such, interval width can be regarded as ‘inverse density’, meaning that width will increase as the local density decreases. For example, in Fig. 3b we have \(\tau _{46} = 0.5\) and \(\tau _{47} = 0.513\), which gives \(w_{46} = 0.013\). Data points that fall in percentile intervals having relatively large widths are more likely to be contextual anomalies, as they reside in low-density areas of the conditional distribution.

The basic idea is thus to define the anomaly score for an object \({\textbf{x}}\) and behavioral feature \({\textbf{b}}^{q}\) as \(w({\textbf{x}}|{\textbf{b}}^{q})\), i.e., the width of the (QRF-predicted) percentile interval in which the behavioral value of the data point falls. We need to consider a special case though: the actual behavioral feature value \(b^{q}\) may be less than the smallest estimated conditional quantile, i.e., \(\tau _{0}^{q}\), or greater than the largest estimated conditional quantile, i.e., \(\tau _{100}^{q}\). That is, the actual value may not fall in any estimated conditional percentile interval. To address this we extrapolate beyond \(\tau _{0}^{q}\) and \(\tau _{100}^{q}\), leading to intermediate anomaly score

$$\begin{aligned} is({\textbf{x}}\vert {\textbf{b}}^{q})=\left\{ \begin{aligned}\left( 1+\frac{\tau _{0}^{q}-b^{q}}{\tau _{75}^{q}-\tau _{25}^{q}}\right) {\max}(w({\textbf{x}}|{\textbf{b}}^{q}))&, \text { if } b^{q} < \tau _{0}^{q}; \\ \left( 1+\frac{b^{q}-\tau _{100}^{q}}{\tau _{75}^{q}-\tau _{25}^{q}}\right) {\max}(w({\textbf{x}}|{\textbf{b}}^{q}))&, \text { if } b^{q} > \tau _{100}^{q}, \\ w({\textbf{x}}|{\textbf{b}}^{q})&, \text { otherwise},\\ \end{aligned} \right. \end{aligned}$$
(4)

where \({\max}(w({\textbf{x}}|{\textbf{b}}^{q}))\) represents the maximum interval width of all conditional percentile intervals for the behavioral feature \({{\textbf {b}}}^{q}\).

Unfortunately, directly using Eq. (4) as partial anomaly score would make our approach prone to the so-called dictator effect: summing such partial scores for a data point that strongly deviates in only a few behavioral features would lead to a larger anomaly score than anomalous data points deviating moderately in many behavioral features. As a result, data points with few, strong deviations would ‘dictate’ highest scores; this is undesirable.

To avoid the dictator effect, we truncate the partial anomaly scores to a predefined maximum and arrive at the final partial anomaly score as

$$\begin{aligned} s({\textbf{x}}|{\textbf{b}}^{q})=\left\{ \begin{aligned} &\frac{\eta }{100}, \text { if }\,\, is\,\,({\textbf{x}}|{\textbf{b}}^{q}) > \frac{\eta }{100}; \\& is({\textbf{x}}|{\textbf{b}}^{q}), \text { otherwise}, \\ \end{aligned} \right. \end{aligned}$$
(5)

where \(\eta\) is a hyperparameter. The rationale for \(\eta\) is as follows: if the conditional distribution is uniformly distributed and the behavioral feature has range [0, 1], then the expected width of any percentile interval width is 0.01 and \(\eta\) can be interpreted as the maximum number of expected percentile interval widths. In the experiments we use \(\eta = 10\) because of its strong empirical performance (also shown in the ablation study in Appendix D).

By scaling the behavioral features to [0, 1] before anomaly detection, e.g., using min-max normalization, the estimated conditional percentiles should also lie in this interval and the interval widths obtained for the individual behavioral features should thus be comparable. In turn, this implies they can be summed to obtain the final anomaly score for a data point (Algorithm 1, Line 9).

5.2 Explaining anomalies with anomaly beanplots

In the third phase, we provide explanations for the reported anomalies. From the definition of our anomaly score it is clear that it is attributable: the overall score can be decomposed into partial scores for individual behavioral features.

For each identified anomaly \({\textbf{a}}\), we report its contextual neighbours \(R({\textbf{a}})\) and the final anomaly score as computed on Line 9 of Algorithm 1. Further, we report the partial anomaly scores corresponding to the behavioral features, i.e., we report \(s({\textbf{x}}|{\textbf{b}}^{q})\) for all q, ranked from highest to lowest to indicate in which behavioral features the anomaly deviates most.

Fig. 4
figure 4

Anomaly beanplot giving insight in what the quantile regression forest learned for a particular data point and behavioral feature, and why the data point is (not) considered an anomaly. The short blue lines indicate the conditional percentiles \(\tau _0,\tau _1,\ldots ,\tau _{100}\) as learned by the QRF (top to bottom), for the given data point and behavioral feature. As in a box plot, the (cyan) box indicates first quartile (\(\tau _{25}\)), median (\(\tau _{50}\)), and third quartile (\(\tau _{75}\)). The wider red area represents probability densities as estimated based on the conditional percentiles. Finally, the black line indicates the actual value that the data point has (Color figure online)

Since visualisation often helps to quickly provide valuable insight, we propose the anomaly beanplot, a variant of the beanplot by Kampstra (2008). Figure 4 shows an example, visually depicting the learned conditional percentiles, their interval widths, and the probability densities that can be estimated from those, all for a particular data point and behavioral feature. (Remember that a quantile regression forest is learned on the reference group of a data point, hence the anomaly beanplot for each data point may be different.)

By including the actual value of the data point in the beanplot (as a horizontal black line), the analyst can easily see how its behavioral feature value is positioned relative to those of its reference group, and why and to what extent it contributes to the its anomalousness.

6 Experiments

To demonstrate the effectiveness of our overall approach and proposed method QCAD, we conduct experiments on a wide range of synthetic and real-world datasets. We will first explain the choices regarding datasets, baseline algorithms, and evaluation criteria, after which we present both quantitative results and a case study that investigates interpretability and practical utility.

In addition, we demonstrate the robustness of QCAD with regard to the ‘number of nearest neighbours’ hyperparameter by means of a sensitivity analysis, and conduct several ablation studies to investigate the impact of the hyperparameters. For reasons of space and brevity, the sensitivity analysis and ablation studies are given in Appendices C and D.

6.1 Data

It is challenging to evaluate unsupervised anomaly detection algorithms due to the lack of commonly agreed-upon benchmark data, and down-sampling classification datasets has been criticized for its variation in the nature of the resulting outliers (Färber et al. 2010; Campos et al. 2016).

When evaluating unsupervised contextual anomaly detection algorithms, this problem is further compounded by the requirement to have both contextual and behavioral features, and—more importantly—treating those differently (Liang and Parthasarathy 2016). Consequently, a generally accepted approach is to artificially inject contextual anomalies into existing datasets using a perturbation scheme.

6.1.1 Data preprocessing

To make the datasets suitable to all anomaly detection methods, we need to preprocess them before injecting contextual anomalies. First, we leverage Label Encoding (Seger 2018) to transform categorical contextual features to numerical form. Second, we employ Min-Max normalisation to scale all behavioral features to [0, 1]. Min-Max normalization is routinely used in many anomaly detection and generally improves performance (Kandanaarachchi et al. 2020).

Gower’s distance To be able to calculate the similarity between two data points containing both categorical and numerical features, we can utilise Gower’s distance (Gower 1971). Specifically, the Gower’s distance between data points \({\textbf{c}}_{i} = (c_{i}^{1},\ldots ,c_{i}^{p},\ldots c_{i}^{P})\) and \({\textbf{c}}_{j} = (c_{j}^{1},\ldots ,c_{j}^{p},\ldots c_{j}^{P})\) is defined as \(1- \frac{1}{P}\sum _{p=1}^{P}ps_{ij}^{p}\), where \(ps_{ij}^{p}\) represents the partial similarity between data instances \({\textbf{c}}_{i}\) and \({\textbf{c}}_{j}\) in the p-th dimension. For a numerical feature, \(ps_{ij}^{p} = 1- \frac{|{c_{i}^{p}-c_{j}^{p}}|}{{\max}({\textbf{c}}^p)-{\min}({\textbf{c}}^p)}\), with \({\max}({\textbf{c}}^p)\) and \({\min}({\textbf{c}}^p)\) denoting the maximum and minimum value of all data points for the p-th feature, respectively. For a categorical feature, \(ps_{ij}^{p} = {\mathbb {I}}(c_{i}^p-c_{j}^p\)), where \({\mathbb {I}}(\cdot )\) represents the indicator function. Consequently, Gower’s distance between two data points is always in [0, 1], with a lower value indicating a larger similarity.

6.1.2 Perturbation scheme for outlier injection

Song et al. (2007) proposed a perturbation scheme to inject contextual anomalies in datasets without ground-truth anomalies, which has become a de-facto standard for the evaluation of contextual anomaly detection (Song et al. 2007; Liang and Parthasarathy 2016; Zheng et al. 2017; Calikus et al. 2021).

This perturbation scheme, however, has also been criticised for two reasons (Song et al. 2007; Kuo et al. 2018). First, the objects obtained by simply swapping the feature values are still likely to be contextually normal. Second, some very common types of anomalies cannot be yielded through this perturbation scheme. For example, one cannot obtain extreme values by swapping features values in a clean dataset, whereas most anomaly detection methods assume the training dataset is uncontaminated. To avoid these problems, Kuo et al. (2018) introduced another perturbation scheme to inject anomalies. We develop a new perturbation scheme by refining this scheme, as follows.

Given a dataset \({\textbf{X}}\) containing N data instances with contextual feature set \({\textbf{C}} = \{{\textbf{c}}^{1},\ldots ,{\textbf{c}}^{P}\}\) and behavioral feature set \({\textbf{B}} = \{{\textbf{b}}^{1},\ldots ,{\textbf{b}}^{Q}\}\), we first use Min-Max normalization to scale the behavioral features of all objects to [0, 1] (keeping the contextual features intact), resulting in a new dataset \(\tilde{{\textbf{X}}}\). Second, to inject m anomalies into \(\tilde{{\textbf{X}}}\), with \(0<m \ll N\), we select m objects \({\textbf{x}}_{1},\ldots ,{\textbf{x}}_{m}\) uniformly at random from \(\tilde{{\textbf{X}}}\). For each \({\textbf{x}} =({\textbf{c}},{\textbf{b}})\) in \(\tilde{{\textbf{X}}}\), \({\textbf{c}}\) represents the contextual feature values and \({\textbf{b}}\) denotes the behavioral feature values. Third, for a selected object \({\textbf{x}}_{i} = ({\textbf{c}}_{i}, {\textbf{b}}_{i})= (c^{1}_{i},\ldots ,c^{p}_{i},\ldots ,c^{P}_{i},b^{1}_{i},\ldots ,b^{q}_{i},\ldots ,b^{Q}_{i})\) and behavioral feature \({\textbf{b}}^{q}\), we sample a number uniformly at random from \([-0.5,-0.1] \bigcup [0.1,0.5]\), and then add this random number to the behavioral feature value of \({\textbf{x}}_{i}\), namely \(b^{q}_{i}\), resulting in \({\hat{b}}^{q}_{i}\). In the same manner, we repeat this process for each behavioral feature, resulting in \(({\hat{b}}^{1}_{i},\ldots ,{\hat{b}}^{Q}_{i})\), or \(\hat{{\textbf{b}}}_{i}\). Accordingly, we generate a new object \(\tilde{{\textbf{x}}} = ({\textbf{c}}, \hat{{\textbf{b}}})\) as contextual anomaly. Fourth, we repeat the third step for each selected object, leading to m perturbed objects. Fifth and final, we replace the selected objects in the original dataset with their corresponding perturbed objects. To allow extreme values to be injected, we deliberately do not truncate the values outside [0, 1] after adding a random number in each behavioral feature.

Our perturbation scheme has the following advantages. When compared to the swapping perturbation scheme proposed by Song et al. (2007), the objects obtained by our perturbation scheme are very unlikely to remain contextually normal. In addition, our perturbation scheme can—but does not always—lead to extreme values. Note that we do not strictly follow the perturbation scheme proposed by Kuo et al. (2018) because their method only adds a non-negative number to the behavioral features. On the one hand, sometimes this non-negative number is zero, leading to injecting a normal object as ‘anomaly’. On the other hand, sometimes this non-negative number is huge, which makes the injected object (too) easy to detect. Our perturbation scheme avoids these problems by firstly normalising the behavioral feature values, and then setting more reasonable lower and upper bounds for the perturbation.

6.1.3 Datasets

To evaluate and compare our method on a diverse range of datasets, we generate 10 synthetic datasets and select 20 real-world datasets; their properties are summarised in Tables 2 and 3 respectively.

Table 2 Summary of synthetic datasets

We first discuss the synthetic data. To be able to produce data with various forms and degrees of dependencies between behavioral and contextual features, we propose the following generation schemes. For \(q \in \{1,\ldots ,Q\}\), we have

  1. (S1)

    \({\textbf{b}}^{q} = \sum _{p=1}^{Q}\left( \alpha _{qp}\cdot {\textbf{c}}^{p}\right) +\varvec{\epsilon }\);

  2. (S2)

    \({\textbf{b}}^{q} = \sum _{p=1}^{Q}\left( \alpha _{qp}\cdot ({\textbf{c}}^{p})^{3}\right) +\varvec{\epsilon }\),

  3. (S3)

    \({\textbf{b}}^{q} = \sum _{p=1}^{Q}\left( \alpha _{qp}\cdot \textrm{sin}({\textbf{c}}^{p})\right) +\varvec{\epsilon }\),

  4. (S4)

    \({\textbf{b}}^{q} = \sum _{p=1}^{Q}\left( \alpha _{qp}\cdot {\log}({\textbf{1}}+\vert {\textbf{c}}^{p}\vert )\right) +\varvec{\epsilon }\),

  5. (S5)

    \({\textbf{b}}^{q} = \sum _{p=1}^{Q}\left( \alpha _{qp}\cdot {\textbf{c}}^{p}+\beta _{qp}\cdot ({\textbf{c}}^{p})^{3}+\gamma _{qp}\cdot \textrm{sin}({\textbf{c}}^{p})+\delta _{qp}\cdot {\log}({\textbf{1}}+\vert {\textbf{c}}^{p}\vert )\right) +\varvec{\epsilon }\),

where \(\alpha _{qp}, \beta _{qp},\gamma _{qp}, \delta _{qp} {\mathop {\sim }\limits ^{i.i.d.}} {\mathcal {U}}(0,1)\) and are replaced by zero with a probability of 1/3. Further, \(\varvec{\epsilon } = (\epsilon _{1},\ldots ,\epsilon _{n},\ldots ,\epsilon _{N})^{T}\), with \(\epsilon _{n} {\mathop {\sim }\limits ^{i.i.d.}} {\mathcal {U}}(0,0.05)\).

In addition, for \(p \in \{1,\ldots ,P\}\), \({\textbf{c}}^{p}\) is generated from a Gaussian mixture model with five clusters. If it is numerical, each of the Gaussian centroids is sampled uniformly at random from [0, 1] and the diagonal element of the covariance matrix is fixed at 1/4 of the average pairwise distance between the centroids in each behavioral feature. Otherwise, the centroids are sampled uniformly at random from \(\{0,1,\ldots ,10\}\) with the same covariance setting. Moreover, every generated number is rounded to an integer to ensure that it is categorical. On this basis, we generate a wide collection of synthetic datasets by varying the generation scheme and the number of contextual features and behavioral features. Sample size is always set to 2000, and the rate of injected anomalies is fixed at \(2.5\%\). See Table 2 for an overview.

Table 3 Summary of real-world datasets

Next, we employ the above-mentioned perturbation scheme to inject contextual anomalies into 20 real-world datasets. We use these datasets because they are representative of the potential application areas of our QCAD framework. That is, they stem from healthcare & life sciences (e.g., Bodyfat, Heart Failure, Indian River Patient, Hepatitis, Parkinson Telemonitoring, Abalone, Fish Weight, QSAR Fish Toxicity), social sciences (e.g., Boston House Price, University Ranking), environmental-protection area (e.g., El Nino, Forest Fires), and engineering (e.g., Gas Turbine CO and NOx Emission, Yacht Hydrodynamics, Condition Based Maintenance of Naval Propulsion Plants, Synchronous Machine, Airfoil Self-Noise, Concrete Compressive Strength, Combined Cycle Power Plant). A summary is given in Table 3; each dataset is described in more detail in Appendix B.

6.2 Baseline algorithms and implementations

We empirically compare our method to state-of-the-art algorithms, including traditional anomaly detection methods (distance-based, density-based, dependency-based, etc.) and contextual anomaly detection methods. For a fair comparison, we select the following anomaly detectors, which all return an anomaly score—rather than a binary outcome—for each data instance.

  • Local Prediction Approach to Anomaly Detection (LoPAD) by Lu et al. (2020b), which is the state-of-the-art dependency-based traditional anomaly detector;

  • Conditional Outlier Detection (CAD) by Song et al. (2007), which was the first anomaly detector dedicated to identify contextual anomalies;

  • Robust Contextual Outlier Detection (ROCOD) by Liang and Parthasarathy (2016), which is the state-of-the-art contextual anomaly detector;

  • Isolation Forest (IForest) by Liu et al. (2008), which is one of the state-of-the-art isolation-based traditional anomaly detectors;

  • Local Outlier Factor (LOF) by Breunig et al. (2000), which is one of the state-of-the-art density-based traditional anomaly detectors;

  • k-NN anomaly detector by Angiulli and Pizzuti (2002), which is one of the state-of-the-art distance-based traditional anomaly detectors;

  • Anomaly detector using axis-parallel subspaces (SOD) by Kriegel et al. (2009), which is one of the state-of-the-art subspace-based traditional anomaly detectors; and

  • Histogram-Based Outlier Score (HBOS) by Goldstein and Dengel (2012), which is one of the state-of-the-art histogram-based traditional anomaly detectors.

We implemented and ran all algorithms in Python 3.8 on a computer with Apple M1 chip 8-core CPU and 8GB unified memory. For classical algorithms such as IForest, LOF, k-NN, SOD, and HBOS, we use their publicly available implementations in PyOD (Zhao et al. 2019) with their default settings. Unfortunately, for LoPAD, CAD and ROCOD no implementation was publicly available. For LoPAD, we first use the ‘bnlearn’ package (Scutari et al. 2019) in R to find the Markov Blankets of each dataset based on Fast-IAMB (Yaramakala and Margaritis 2005). Next, we implement the LoPAD algorithm in Python using CART (Breiman et al. 2017) as the prediction model with bagging size 200. For CAD, we implement the CAD-GMM-Full algorithm as recommended by Song et al. (2007), with default settings except for parameter ‘number of Gaussian component’; if this parameter would be set to its default 30, it would take more than one month to finish all the experiments on our computer. In addition, preliminary experiments show that the difference in results when this parameter is set to 30 and 5 respectively is negligible for most datasets. We therefore set this parameter to 5 in all experiments. We implemented ROCOD in Python with its default settings. One important parameter, namely the distance threshold used to find neighbours, is not discussed in Liang and Parthasarathy (2016) though. For different datasets and distance metrics, it is hard to obtain a single best value for this parameter and preliminary experiments revealed that ROCOD is sensitive to this parameter. Nevertheless, we also observed that it often achieves relatively good results when the distance threshold used to find neighbours is set to 0.9; hence, we decided to set this parameter to 0.9 by default.

6.2.1 QCAD parameters setting

As summarised in Table 4, the QCAD algorithm also requires several parameters to be set. First, in our contextual anomaly framework, we need to set the number of nearest neighbours (k) used to generate reference group. Second, when creating quantile regression forests, we need to specify the following parameters: the number of trees, the number of maximal features used to construct a tree, and the number of minimal sample size to split in a node of tree. Third, we need to specify the number of conditional quantiles (l) to estimate for an object in each behavioral feature. Last, we also need to set the number of features (h) used to generate explanations. We set these parameters as follows.

Table 4 Summary of parameters involved in QCAD
  • The number of nearest neighbours (k): the sensitivity analysis (see appendix) indicates that our approach is robust with respect to this parameter as long as its value is not overly small. By default, we set this parameter to \({\min}(N/2,500)\), where N is the sample size.

  • The number of conditional quantiles to estimate (\(n_q\)): theoretically, an increase in this number will result in better performance in terms of accuracy, at the expense of a larger running time. However, preliminary experiments show that increasing this number beyond 100 will only produce slightly better results. Therefore, we set it to 100 by default.

  • The number of trees used to construct a quantile regression forest (\(n_t\)): in theory, a larger number of trees will produce better performance in terms of accuracy, but at the cost of a larger running time. We empirically found that 100 trees usually gives good results and further increasing the number of trees leads to negligible improvement. Due to time constraints, we set this number to 10 in all experiments in this paper.

  • The number of maximal features used to construct a tree in a quantile regression forest (\(n_f\)): Meinshausen (2006) demonstrated the stability of quantile regression forest on this parameter, and set this parameter to 1/3 of the number of variables in their experiments. However, in our experiments, sometimes the number of contextual features is less than 3. To render quantile regression forests applicable on various datasets, we set this parameter to the number of all variables by default.

  • The minimal sample size for the node of a tree to be split (\(n_s\)): as indicated in Meinshausen (2006), different values of this parameter do not seem to have much effect on the results, and our preliminary experiments are also in line with this statement. Therefore, we set this number to 10 by default, as also used in Meinshausen (2006).

  • The number of features used to generate explanations (h): This parameter is set to \({\min}(\vert {\textbf{B}} \vert ,3)\) by default. However, the end-users can set this parameter according to their preferences as long as its value is between 0 and \(\vert {\textbf{B}} \vert\), where \(\vert {\textbf{B}} \vert\) represents the number of behavioral features.

For reproducibility, we make all code and datasets publicly available.Footnote 3

6.3 Evaluation criteria

PRC AUC (Liang and Parthasarathy 2016; Kuo et al. 2018), ROC AUC (Micenková et al. 2014, 2015; Pasillas-Díaz and Ratté 2016), and Precision@n (Aggarwal and Sathe 2017) are widely used for the evaluation and comparison of anomaly detection methods that generate a full list of anomaly scores for all observations. They are defined as follows:

  • Receiver Operating Characteristic (ROC), which is obtained by plotting the true positive rate (y-axis) versus the false positive rate (x-axis) at various threshold settings. The area under this curve, namely ROC AUC, is a threshold-agnostic performance measure widely used in anomaly detection;

  • Precision-Recall Curve (PRC), which is created by plotting precision (y-axis) against recall (x-axis) at various threshold settings. The area under this curve, namely PRC AUC, is another widely-used, threshold-agnostic performance measure, and is also called Average Precision;

  • Precision at n, or P@n, is defined as the precision of the observations ranked among the top-n, where \(n\in \{1,2,\ldots ,N\}\). In our experiments, we set n to the number of injected contextual anomalies.

For completeness: precision is defined as \(\frac{\#\{\text {Real anomalies}\} \cap \{\text {Reported anomalies}\}}{\#\{\text {Reported anomalies}\}}\), while recall is defined as \(\frac{\#\{\text {Real anomalies}\} \cap \{\text {Reported anomalies}\}}{\#\{\text {Real anomalies}\}}\). We perform ten independent trials of injecting contextual anomalies on each dataset and report the means and standard deviations of each of the three evaluation criteria.

6.4 Anomaly detection performance

Results on 10 synthetic datasets and 20 real-world datasets are presented in Tables 5 and 6, where the first table concerns synthetic data, and the latter two tables concern real-world data.

Table 5 Performance in terms of PRC AUC, ROC AUC, and P@n, on synthetic data, with 10 independent runs of injecting contextual anomalies into each dataset

From Table 5, we observe that QCAD generally dominates other methods in terms of anomaly detection accuracy according to the average ranks. More specifically, QCAD achieved the best results on 9 out of 10 synthetic datasets in terms of PRC AUC and ROC AUC, and on all datasets in terms of Precision@n. On Syn8, QCAD is on par with its best competitor (i.e., CAD) in terms of ROC AUC, whereas QCAD is slightly worse than CAD and ROCOD in terms of PRC AUC. This demonstrates the effectiveness of QCAD in identifying contextual anomalies for different forms and degrees of dependencies between the behavioral features and contextual features. More importantly, the poor performance of LoPAD indicates the importance of distinguishing behavioral features from contextual features. Additionally, other traditional anomaly detectors—including IForest, LOF, k-NN, SOD, and HBOS—perform poorly on most datasets because they treat all features equally.

Another important observation is that QCAD is generally superior to other methods in terms of robustness. That is, QCAD attains high PRC AUC, ROC AUC, and Precision@n values with small standard deviations on Syn1, Syn2, Syn3, Syn4 and Syn5, indicating that it is robust to different forms and degrees of dependency relationships, including linearity and non-linearity. In contrast, its strongest contender, CAD, has high standard deviations on Syn1 and Syn6. One possible reason is that CAD gets trapped in a bad local minimum when using expectation-maximization algorithm to learn parameters. Despite being a contextual anomaly detector, ROCOD performs poorly on datasets Syn2 and Syn5 in terms of PRC AUC and Precision@n. From the results on Syn6, Syn7, Syn8, Syn9 and Syn10, it appears that a larger number of contextual features does not substantially affect the performance of QCAD. Compared to other contextual anomaly detectors, and specifically CAD, QCAD does not perform particularly better on these synthetic datasets.

Table 6 Performance in terms of PRC AUC, ROC AUC, and P@n, on real-world data, with 5 or 10 independent runs of injecting contextual anomalies into each dataset\(^{1}\)

The results on real-world datasets presented in Table 6 show that QCAD performs best overall when compared to its contenders, as witnessed by its average ranking results. Concretely, QCAD outperforms the other methods on 13 out of 20 real-world datasets in terms of PRC AUC and ROC AUC, and on 11 out of 20 datasets in terms of P@n. On most of the remaining datasets, QCAD is on par with its strongest competitors. For instance, HBOS achieves the best performance on Forest Fires, Heart Failures, Hepatitis, and Indian Liver Patient; QCAD’s performance on these datasets is comparable. The number of contextual features in these datasets is often substantially less than the number of behavioral features, reducing the contextual anomaly detection problem to a traditional anomaly detection problem. QCAD is only slightly worse than ROCOD on Power Plant and Toxicity in terms of PRC AUC. Note that CAD produces surprisingly good results on the Bodyfat dataset, and surpasses other methods including QCAD. One possible explanation is that the contexts and behaviors are both composed of well-separated Gaussian components, and that the association between contextual and behavioral components is strong (Fig. 5).

Fig. 5
figure 5

Critical difference diagram showing statistical difference comparisons between QCAD and its contenders in terms of PRC AUC, ROC AUC and Precision@n. To achieve this, we use Friedman tests (Friedman 1937) followed by Nemenyi post hoc analysis (Nemenyi 1963) with a significance level of 0.05. The post-hoc Nemenyi test indicates there are no significant differences within QCAD, CAD and ROCOD in terms of PRC AUC; there are no significant differences within QCAD, ROCOD, HBOS and IForest in terms of ROC AUC; there are no significant differences within QCAD, ROCOD, HBOS and CAD in terms of Precision@n. However, one can see that QCAD consistently outperforms its contenders by a large margin in three metrics

We also observe that QCAD performs well on datasets with varying sample size, dimensionality, and rate of injected anomalies. Specifically, QCAD achieved a high ROC AUC (\(\ge 0.85\)) on all datasets. This is much better than the ROC AUC values close to 0.50 obtained by ROCOD, CAD, HBOS, and IForest on some datasets, implying they are random guessing. In addition, QCAD attained high PRC AUC (\(\ge 0.8\)) and Precision@n values (\(\ge 0.7\)) on most datasets. The lowest PRC AUC and Precision@n value are is 0.46 and 0.49, respectively, both obtained on the Toxicity dataset. Possible reasons for QCAD’s moderate performance on Airfoil, BodyFat, Power plant, and Toxicity are: the dependency relationship between behavioral features and contextual features is not strong, or the dependency relationship is too complex for the Quantile Regression Forests to capture. ROCOD, COD, HBOS, and IForest, however, obtain PRC AUC values less than 0.70 and Precision@n values less than 0.60 on most datasets. Moreover, HBOS and IForest sometimes attain PRC AUC and Precision@n lower than 0.10, which is extremely poor. This implies that traditional anomaly detection methods are not suitable for identifying contextual anomalies, as they treat all features equally. Furthermore, the relatively poor performance of ROCOD and CAD—compared to our method—demonstrates the importance of modelling more properties of the conditional distribution than just the mean.

6.5 Runtime analysis

Fig. 6
figure 6

Runtime analysis by varying the sample size, the number of contextual features (\(\#{{\textbf {C}}}\)), and the number of behavioral features (\(\#{{\textbf {B}}}\)). The results are obtained with 5 independent trials. The data was synthesized using scheme S1. Note the varying y-axis scales

To investigate the scalability and efficiency of QCAD, we perform a runtime analysis by varying the sample size, the number of contextual features, and the number of behavioral features. As shown in Fig. 6, we can empirically observe that the runtime of QCAD scales linearly with respect to the sample size, the number of contextual features, and the number of behavioral features on small and medium datasets. In contrast, CAD is computationally prohibitively expensive even on small datasets. To further understand the complexity of QCAD, we perform a time complexity analysis in Appendix A. The theoretical analysis is in line with our empirical results and analysis.

6.6 Using QCAD to find promising football players

In this section, we illustrate the capability of QCAD on identifying potentially meaningful contextual anomalies and providing intuitive and understandable explanations for reported anomalies when applied to a real-world problem.

Fig. 7
figure 7

Distribution of all players in the behavioral space \(Assists \times Goals\), and their corresponding anomaly scores given by QCAD. The red curves represent the estimated densities of the joint distribution at different levels, the histograms indicate the marginal distributions. A larger and darker green dot represents a larger anomaly score as given by QCAD (Color figure online)

As use case, we consider the problem of finding exceptional players in the English Premier League. The datasetFootnote 4 describes the background information and performance statistics of 532 football players from 2020 to 2021. We use the variables Position1,Position2 (positions for which the player plays), Age (age of the player), Matches (number of matches played), Starts (number of matches that the player was in the starting lineup), and Mins (number of minutes the player played overall) as contextual features. Two performance statistics, Goals (number of goals scored by the player) and Assists (number of assists given by player), are regarded as behavioral features.

Figure 7 depicts the distribution of all data objects in behavioral space \(Goals\times Assists\). Without considering contextual information, traditional anomaly detectors such as HBOS, LOF, IForest will treat objects in dense areas (i.e., objects residing inside the red curves) as normal objects. In contrast, our contextual anomaly detector QCAD can detect anomalies in dense areas by considering contextual information (i.e., green objects residing inside the red curves). Concretely, Fig. 7 also presents the anomaly scores given by QCAD (with default settings). For example, player Matheus Pereira—who achieved 11 goals and 6 assists—is reported as an ‘anomaly’ (i.e., as having a relatively high anomaly score) by QCAD but as ‘normal’ by traditional anomaly detectors. Conversely, QCAD considers Patrick Bamford, a player with 17 goals and 7 assists (in a sparse area), to be normal, while traditional anomaly detectors consider him to be an anomaly.

Fig. 8
figure 8

Explanations for the identified anomalous player Matheus Pereira based on his reference group. Shown are the values for Matheus Pereira (red bins) relative to the distribution of his contextual neighbours for each contextual feature

Fig. 9
figure 9

Explaining why Matheus Pereira is considered an anomaly: exceptionally many goals compared to contextually similar players, and relatively many assists. The beanplot shows the estimated conditional distribution of each behavioral feature, with a wider red area representing a higher probability of occurrence. The horizontal black lines represent the values reported for the player being investigated

In addition to giving an anomaly score for each data object, QCAD can also provide an explanation. For instance, for Matheus Pereira (Position1 = "MF", Position2 = "FW", Age=24, Matches=33, Starts=30, Mins=2577), Fig. 8 shows the distribution of the values of each contextual feature for all players in his contextual neighbourhood (i.e., similar players). It can be seen that most of these players are Midfielder (MF) and/or Forward(FW), aged from 22 to 28, playing matches more than 25 times, etc. The anomaly score is 65.8, which puts this player in the top-10 of most anomalous players. This score can be decomposed according to the behavioral features to give the behavioral feature(s) that contribute the most to the obtained anomaly score; Goals, in this case. Moreover, Fig. 9 shows how the beanplot-like visualisation can help to give insight in how the behavioral feature values deviate from those in the player’s contextual neighbourhood. From these plots we can see that Matheus Pereira has been exceptionally good at scoring goals and slightly better than expected with regard to giving assists.

These results can be interpreted as: Matheus Pereira performed surprisingly well in terms of goals and slightly better than expected in terms of assists when compared to other midfielders and/or forwards aged 22–28 who played more than 25 games and played more than 2000 min. Although we are not football experts, we expect the automated detection and interpretable score explanation of such ‘anomalies’ to be meaningful and possibly useful to football coaches and scouts.

7 Conclusions

In this paper, we for the first time explicitly establish a connection between dependency-based traditional anomaly detection methods and contextual anomaly detection methods. On this basis, we propose a novel approach to contextual anomaly detection and explanation. Specifically, we use Quantile Regression Forests to develop a accurate and interpretable anomaly detection method, QCAD, that explores dependencies between features. QCAD can handle tabular datasets with mixed contextual features and numerical behavioral features. Extensive experiment results on various synthetic and real-world datasets demonstrate that QCAD outperforms state-of-the-art anomaly detection methods in identifying contextual anomalies in terms of accuracy and interpretability.

From the case study on football player data, we conclude that QCAD can detect potentially meaningful and useful contextual anomalies that are directly interpretable by a domain expert. The beanplot-based visualisations help to explain why a certain object is (not) considered an anomaly within its context. This is important because anomaly detection is an unsupervised problem and the explanations can help analysts and domain experts to verify the results. It also opens up opportunities for human-guided anomaly detection, where feedback from the analyst can be used to guide the analysis.

In the future, given that QCAD can only handle static features, we plan to extend QCAD to streaming settings.