1 Introduction

Anomaly detection is a process that aims to identify objects that deviate from the expected behavior. Many current techniques for anomaly detection do not distinguish between different features in the data, treating them all equally when searching for anomalies. However, in various real-world applications, certain features provide direct information about the normality or abnormality of objects (i.e., behavioral attributes), while others provide clues about environmental factors affecting the system (i.e., contextual attributes). For example, a heart rate above 100 bpm is abnormal for a resting adult, while it is considered normal for an infant or an adult training at the gym. In such cases, features such as age and intensity of physical activity do not directly correspond to abnormality but provide information to determine whether the heart rate falls within the expected range under a certain context.

In contextual (or conditional) anomaly detection, an object is considered anomalous if it significantly deviates from objects sharing similar contextual information (i.e., reference group). Contextual attributes are used to define the context and determine the reference groups, while behavioral attributes help measure whether the object deviates significantly from its reference group. This necessitates dividing the features into two disjoint subsets, i.e., contextual and behavioral attribute sets, referred to as context and behavior throughout this paper. Effectively making this division is crucial since uninformative contexts can lead to the detection of unreliable reference groups, which can disguise contextual anomalies (See Fig. 1).

Fig. 1
figure 1

This figure demonstrates the importance of correctly identifying the reference group in detecting contextual anomalies and how different contexts can impact this process. A red square represents the contextual anomaly, while other squares represent its reference group. The first context (useful context) correctly identifies the reference group, allowing the anomaly to be easily detected in the behavioral space. The second context leads to an incorrect estimation of its reference group, resulting in the red square remaining concealed within its reference group in the behavioral space. This makes it difficult or impossible to detect the contextual anomaly

However, defining the right context and behavior is challenging in practice, even for domain experts, as it can arise from combining different attributes in numerous ways, particularly in high-dimensional datasets. Furthermore, complex real-world systems produce anomalies under various contexts, and it is not realistic to expect a single context and behavior to reveal all interesting anomalies.

In this paper, we introduce a novel unsupervised approach, Con Quest, that automatically discovers relevant contexts in data for multi-context anomaly detection. Con Quest adopts a multi-objective search strategy, where the set of criteria to assess the quality (i.e., relevance) of different contexts is derived from the desired properties of contextual anomaly detection without requiring any labeled ground-truth data.

Furthermore, we propose a new contextual anomaly detection algorithm, the multi-context anomaly factor (MCAF), which jointly considers neighborhood structures in contextual and behavioral spaces. MCAF models the “reference group” (i.e., a group of objects that have similar contextual attribute values) using nearest neighbors in the contextual space and refers to this structure when measuring deviations using behavioral attributes. The main novelty in this method is that not only does it consider whether an object significantly deviates from its reference group in the behavioral space, but also whether the relationship between this object and its reference group members differs significantly between the contextual and behavioral spaces.

In summary, the major contributions of this paper are:

ConQuest approach:Footnote 1 We propose Con Quest, a contextual anomaly detection approach that automatically discovers useful contexts in high-dimensional feature spaces and detects diverse types of anomalies by leveraging discovered contexts.

Objective functions: We intuitively define a set of criteria to determine the relevance of a context based on general assumptions of contextual anomaly detection. Then, we define them formally as heuristics to formulate our multi-objective function that guides context discovery.

Multi-Context Anomaly Factor (MCAF): We propose a novel contextual anomaly detector that uses a distance metric combining contextual and behavioral distances. It allows us to differentiate objects whose neighborhood structures within the reference groups are significantly different between two attribute spaces (contextual and behavioral).

2 Related work

Although there is a large number of studies on anomaly [1], contextual anomaly detection, which distinguishes contextual attributes from behavioral ones, is relatively new and has not been explored as much.

Early contextual methods were mostly studied for time-series [2, 3] and spatial data [4, 5], where spatial or temporal attributes are considered as context. These techniques cannot be easily adapted to other areas where the context is determined by different types of characteristics.

Song et al. [6] introduced a significant contribution to the field of contextual anomaly detection, presenting a method, conditional anomaly detection (CAD), for detecting contextual anomalies in a generic manner. This method assumes that attributes can be divided into contextual (referred to as “environmental”) and behavioral (referred to as “indicator”) attributes, which are modeled using Gaussian Mixture Models (GMMs) to learn their distributions. The approach identifies dependencies between these attributes by utilizing “mapping functions” and detects anomalies by assessing whether an object violates these learned functions.

Angiulli et al. [7] also employed GMMs to model relationships between contextual and behavioral attributes. This method handles high-dimensional data by modeling each pair of contextual and behavioral attributes separately, referring to these models as correlation patterns. However, this approach is computationally expensive, requiring a separate model for each context-behavior pair, and still necessitates the pre-classification of attributes into contextual and behavioral categories.

Additionally other works, ROCOD [8] and MELODY [9] have been proposed. ROCOD combines local and global estimation of behavioral attributes, while MELODY employs metric learning to effectively measure similarities in the contextual space. Similarly, the work in [10] builds two variational autoencoders (VAEs) to model contextual and behavioral variables, where latent contextual variables serve as additional inputs to the other VAE’s decoder for reconstructing behavioral data from their latent representations. Samples with large reconstruction errors are assumed to be contextual anomalies. However, all these approaches assume a single, user-defined context and cannot accommodate multiple contexts.

In contrast, a recent method, ICL [11], utilizes a contrastive learning objective to divide attributes into two disjoint sets and detect anomalies, offering a novel approach distinct from previous methods.

ConOut [12], and WisCon [13], are the most similar methods to Con Quest maintaining an ensemble over multiple contexts. However, ConOut does not consider whether included contexts are actually useful, while WisCon requires labels for active learning.

Another line of related work is subspace outlier detection which identifies anomalies in high-dimensional feature space by exploring different subspaces. Early work by Aggarwal et al. [14] employed an evolutionary algorithm to find lower-dimensional subspaces for detecting axis-parallel outliers. Kriegel et al. [15] introduced distance-based outlier detection with subspace outlier degree.

Ensemble methods, like Feature Bagging (FB) [16], sample subspaces, and combine scores from various feature subsets for improved detection robustness. FB [16], LODA [17], rotated bagging [18], subspace histograms [19], high-contrast subspaces [20], and relevant subspaces [21, 22] are examples. Unlike contextual anomaly detection, subspace methods don not differentiate between attributes as contextual or behavioral.

Fig. 2
figure 2

Overview of the stages of Con Quest

3 Problem definition

Suppose we have a dataset \(Z \in \mathbb {R}^{n \times d}\), where n is the number of points, and d is the number of features. We denote a point from Z as \(z \in \mathbb {R}^d\) and the set of real-valued features in Z as \(F = \{f_1, f_2,..., f_d\}\).

Technically, one can extract \(2^{d-1}\) different context-behavior pairs \((C_a,B_a)\) from a dataset with d features, which makes the computations for high-dimensional datasets infeasible. In this work, we follow [11] and extract context-behavior pairs in a sliding-window fashion and only consider subsets of consecutive variables.

Definition 1

(Context and Behavior) Given a dataset \(Z \in \mathbb {R}^{n \times d}\) with a feature set F, we define a context C as a set of contextual attributesFootnote 2 and behavior B as the corresponding set of behavioral attributes such that \(C \subset F\), \(B = F \setminus C\), \(|C| > 0\), \(|B| > 0\) and, therefore, \(|C| + |B| = d\).

Based on the above definition, one can extract \(2^{d-1}\) different context-behavior pairs (CB) from a dataset with d features, which makes the computations for high-dimensional datasets infeasible. In this work, we follow [11] and extract context-behavior pairs in a sliding-window fashion only considering subsets of consecutive variables with a window size w. Given a dataset \(Z \in \mathbb {R}^{n \times d}\), we obtain \(\eta \) set of pairs (CB), such that \(C_a=\{f_a,f_{a+1},..f_{a+w-1}\}\), \(B_a=\{f_1,f_2,...,f_{a-1},f_{a+w},...,f_d\}\) and \(\eta =d+1-w\). Finally, for each context \(C_a\) and its corresponding behavior \(B_a\), we represent the initial dataset Z as \(Z^a=(X^a,Y^a)\) such that each data point \(z_i^a = (x_i^a,y_i^a)\) is composed of a contextual attribute vector \(x_i^a = [z_i^a,z_i^{a+1},..z_i^{a+w-1}]\) and a behavioral attribute vector \(y_i^a = [z_i^1,z_i^2,..z_i^{a-1},z_i^{a+w},...]\), such that \(|C_a|=c\), \(|B_a|=b\), and \(c+b=d\).

Definition 2

(Reference Group) Given a context \(C_a\), and a data point \(z_i^a = (x_i^a,y_i^a)\), its reference group \(R_i^ a \subset Z^a\) is a group of points in \(Z^a\) that share similarity with \(z_i^a\) w.r.t \(C^a\).

Problem 1 (Context Discovery for Anomaly Detection) Given a dataset \(Z \in \mathbb {R}^{n \times d}\) described by real-valued features F, in which useful contexts are unknown apriori,

First, find a set of contexts \(\hat{C}=\{C_1, C_2,.. C_m\}\) such that each establishes a suitable reference group unveiling contextual anomalies hidden in the global feature space or in other contexts.

Then, score each point in Z based on the set of contexts \(\hat{C}\), such that each score correctly indicates the degree of being a contextual anomaly.

In order to address the aforementioned problem, we propose a novel approach, named Con Quest, as depicted in Fig. 2. The proposed approach comprises two major steps, namely (i) context discovery and (ii) multi-context anomaly detection. In the context discovery step, we take the initial dataset and a collection of possible candidate contexts as the input, and employ a multi-objective genetic algorithm along with a set of newly defined objectives to discover a set of relevant contexts. On the other hand, the multi-context anomaly detection step involves the use of a novel algorithm, namely multi-context anomaly factor (MCAF), which leverages the contexts obtained in the previous step to produce the final anomaly scores.

In the subsequent sections, we start by presenting the MCAF algorithm, followed by a delineation of the objective functions and the context discovery process. The rationale behind this sequence is that the distance function introduced in the MCAF algorithm is utilized by one of the objective functions explained later.

4 Multi-context anomaly factor (MCAF)

According to the definition of contextual anomaly, an object is considered abnormal if its behavior significantly deviates from its reference group. One approach could be to cluster the instances using their contextual attributes to find reference groups and use an anomaly detection method to find deviations in each cluster using behavioral attributes. However, this approach heavily depends on the success of clustering, making it inconvenient for sparse datasets or datasets with many outlying individuals. Furthermore, it requires training a separate anomaly detector per cluster. Using nearest neighbors provides a higher granularity of analysis and therefore can improve the quality of reference group estimation when the contextual space is sparse or noisy.

Definition 3

(Reference Object) Given a sample \(z^a_i=(x^a_i,y^a_i)\) in a context \(C_a\), the reference group of \(z^a_i\) is defined as K-nearest neighbors of contextual attribute vector \(x^a_i\) and denoted as \(R_{i}^a\). We refer to each \(x^a_j \in R_{i}^a\) as a reference object.

Following the definition of contextual anomalies, we assume normal instances to have similar neighborhood structures in both contextual and behavioral spaces. Therefore, we expect the distances between anomalies and their reference groups, measured using behavioral attributes (i.e., behavioral distances), to be larger in comparison to normal objects. In addition, we consider that normal samples tend to have similar neighborhood structures within their reference groups in both the context and the behavior. More precisely, given an object \(x_i\) and its reference objects \(x^a_j, x^a_t \in R_{i}^a\), \(x_i\) preserves the similar neighborhood structures in both spaces, if and only if \({\text{ dist }}(y^a_i,y^a_j) < {\text{ dist }}(y^a_i,y^a_t)\) and \({\text{ dist }}(x^a_i,x^a_j) < {\text{ dist }}(x^a_i,x^a_t)\). The significant change between these relationships measured in the context and the behavior also indicates anomalousness. To capture this, we need to incorporate both contextual and behavioral distances.

However, it is not trivial to simply combine distances (or similarities) in two spaces since they may have different dimensions and, therefore, different scales. It is important to note that our goal is still to determine the deviations using behavioral attributes; therefore, it is important to avoid giving stronger influences to contextual distances.

Shared nearest-neighbors (SNN) similarity has been shown to be an effective alternative to traditional similarity measures, while not being as sensitive to the dimensionality changes [23]. SNN similarity defines the similarity between two points by the number of nearest neighbors they have in common. It is simply measured using the intersection between two nearest neighbor lists as \(|N(i) \cap N(j)|/k\). SNN similarity is akin to the cosine of the angle between the zero–one set membership vectors for N(i) and N(j). Then, the SNN distance that measures the distance between two points using their shared nearest neighbors is given by

$$\begin{aligned} {\text{ dis }}_{\text{ SNN }}(i,j)= {\text{ arc }} \cos \left( \frac{|N(i) \cap N(j)|}{k}\right) \end{aligned}$$
(1)

SNN distance is symmetric and satisfies the triangular inequality. Finally, we define the contextual-nearest-neighbor distance, the distance metric used to measure deviations by this base detector.

Definition 4

(Contextual Nearest-neighbor Distance) Assume \(z^a_j\) is a reference object of \(z^a_i\) in the context \(C_a\). Formally, the contextual nearest-neighbor distance of \(z^a_i\) with respect to \(z^a_j\) is defined as

$$\begin{aligned} {\text{ dis }}_{\text{ CoNN }}(z^a_i,z^a_j) = \frac{{\text{ dis }}_{\text{ EUC }}(y^a_i,y^a_j) }{{\text{ dis }}_{\text{ SNN }}(x^a_i,x^a_j)} \end{aligned}$$
(2)

where \(x^a_j\in R_i^a\) and \({\text {dis}}_{\text{ SNN }}(x^a_i,x^a_j)={\text{ arc }} \cos \left( \frac{|R_i^a \cap R_j^a|}{k}\right) \).

CoNN distance measures the distance between a sample \(z^a_i\) and its reference object \(z^a_j\) by combining the SNN distance measured using the contextual and the Euclidean distance measured using the behavioral attributes.

Definition 5

(Contextual Anomaly Density) Assume \(R_i^a\) is the reference group of \(z^a_i=(x^a_i,y^a_i)\) in a context \(C_a\), the contextual anomaly density of \(z^a_i\) is given by

$$\begin{aligned} {\text {CAD}}(z^a_i)= \frac{1}{\sum _{j\in R_i^a} \frac{{\text {dis}}_{\text {CoNN}}(z^a_i,z^a_j)}{|R_i^a|} } \end{aligned}$$
(3)

Intuitively, the contextual anomaly density of an object \(z^a_i\) is the inverse of the average contextual anomaly distance to the reference group. It is similar to the relationship between local reachability distance and the local outlier density in the local outlier factor (LOF) algorithm [24].

Definition 6

(Contextual Anomaly Factor) The contextual anomaly factor of \(z^a_i\) is defined as

$$\begin{aligned} {\text {CAF}}(z^a_i)= \frac{\sum _{j\in R_i^a} \frac{{\text {CAD}}(z^a_i)}{{\text {CAD}}(z^a_j)}}{|R_i^a|} \end{aligned}$$
(4)

The contextual anomaly factor of a sample \(z^a_i\) captures the degree of deviation of \(z^a_i\) w.r.t. its reference group \(R_i^a\). It is given by the average ratio of the contextual anomaly density of \(z^a_i\) and its reference objects in \(R_i^a\). The CAF score of \(z^a_i\) is higher when the contextual anomaly density is lower for \(z^a_i\) than for its reference group.

So far, we have given the formulation of contextual anomaly detection for a given context. We extend CAF to the multi-context case by combining scores produced under different contexts.

Definition 7

(Multi-Context Anomaly Factor) Given a context set \(\hat{C}={C_1, C_2,..., C_m}\), we produce multi-context anomaly factor scores by taking the mean of CAF scores for each context. The MCAF score of i-th sample in the dataset, \(s_i\), is given by

$$\begin{aligned} {\text {MCAF}}(z_i)= \frac{\sum _{a=1}^{m} {\text {CAF}}(z_i^a) }{m} \end{aligned}$$
(5)

where \({\text {CAF}}(z_i^a)\) is the contextual anomaly factor of \(z_i^a\) in context \(C_a \in \hat{C}\).

\({\text {MCAF}}(z_i) > 1\) means \(z_i\) has a lower average density than the reference group across all contexts and indicates \(z_i\) is most likely an anomaly. The algorithm for MCAF is shown in Algorithm 1.

figure a

5 Context discovery

In this section, we consider how to find relevant contexts for MCAF, so that each establishes suitable reference groups revealing contextual anomalies. First, we intuitively define a list of desired properties determining the relevance of a context based on general assumptions of contextual anomaly detection. Then, we exploit these rules-of-thumb to formulate our objective functions that guide the search for relevant contexts. To solve the multi-objective optimization problem, we utilize a multi-objective genetic algorithm that efficiently searches the context space and returns a Pareto front comprising a diverse set of non-dominating solutions.

5.1 Objective functions

5.1.1 Objective 1: Maximum context-behavior dependency

The general assumption in contextual anomaly detection is that objects sharing a similar context are expected to have similar behavior. Ideally, the suitable context and its corresponding behavior should constitute a high correlation so that the instances (except anomalies) would belong to similar groups in both spaces. Therefore, our first objective is to maximize the dependency between the given context and behavior.

Considering both the context and the behavior may have multiple and unequal dimensionalities, we cannot use classical correlation measures such as Pearson correlation. Instead, we use distance correlation that measures multivariate independence for variables, in which variable dimensions are not necessarily equal. Furthermore, this measure can spot nonlinear relationships. Given two random variables X and Y, the distance correlation can be measured as

$$\begin{aligned} {\text {Cor}}_d(X,Y)=\frac{{\text {Cov}}_d(X,Y)}{\sqrt{{\text {Var}}_d(X){\text {Var}}_d(Y)}}, \end{aligned}$$
(6)

where \({\text {Var}}_d(X)\) is the distance variance of X and \({\text {Cov}}_d(X,Y)\) is the distance covariance between X and Y.

Given a context C and a behavior B, we define the context-behavior dependency of C and B as \({\text {Cor}}_d(C,B)\). Here, our objective is to find a set of contexts that maximizes average context-behavior dependency. Since we define all objectives as minimization, the optimization function can be formulated to minimize the total context-behavior independency.

$$\begin{aligned} \begin{aligned} f_1(\hat{C})= \min \sum _{C_a\in \hat{C}} 1- {\text {Cor}}_d(C_a, B_a) \\ \end{aligned} \end{aligned}$$
(7)

5.1.2 Objective 2: Minimum context redundancy

The objective here is to find a set of contexts minimizing average redundancy between them so that we avoid picking contexts that result in similar reference groups. The redundancy between two contexts is measured by distance correlation, as explained above. The idea is that if the dependency between two contexts is high, they are more likely to capture similar relations in data, and there is no use in including both in \(\hat{C}\). Given two contexts \(C_a\) and \(C_b\), we can define the redundancy as \({\text {Cor}}_d(C_m,C_n)\).

Then, the optimization objective can be formulated to minimize the total pairwise redundancy among contexts.

$$\begin{aligned} \begin{aligned} f_2(\hat{C})= \min \sum _{C_a, C_b \in \hat{C}} {\text {Cor}}_d(C_a, C_b) \\ \end{aligned} \end{aligned}$$
(8)

5.1.3 Objective 3: Maximum discrimination

We start from the intuition that an anomaly is, to some extent, farther away from normal instances and, therefore, can be more easily separated (discriminated). Specifically, a contextual anomaly is farther off from its reference objects, given contextual distances. The relevant context ideally reveals many contextual anomalies that are hidden otherwise. As described in the previous section, we refer to CoNN distances to measure deviations. Let us consider the distribution of average CoNN distances of all the samples in data for a given context. A distribution with light tails, therefore, is less likely to be relevant in terms of uncovering anomalies. Based on this idea, we use kurtosis to quantify the “tailedness” of average CoNN distances of all samples. We expect an irrelevant context to show lower kurtosis than a relevant one. Given a context \(C_a\) and a dataset \(Z^a=(X^a, Y^a)\), the average CoNN of a sample \(z_i^a\in Z^a\) is given by

$$\begin{aligned} \overline{{\text {dis}}_{\text {CoNN}}}(z_i^a) = \frac{\sum _{z_j^a\in R_i^a} {\text {dis}}_{\text {CoNN}}(z_i^a,z_j^a)}{|R_i^a|} \end{aligned}$$
(9)

Then, we measure the kurtosis in \(C_a\) as

$$\begin{aligned} {\text {kurtosis}}(C_a)= \frac{\sum _{z_i^a\in Z^a} (\overline{{\text {dis}}_{\text {CoNN}}}(z_i^a) - \mu _{a})^4}{(n-1)\sigma _{a}^4} -3 \end{aligned}$$
(10)

where n is the number of sample in \(Z^a\), \(\mu _{a}\) is the mean of \(\{\overline{{\text {dis}}_{\text {CoNN}}}(z_1^a),\overline{{\text {dis}}_{\text {CoNN}}}(z_2^a),..., {{\text {dis}}_{\text {CoNN}}}(z_n^a)\}\) and \(\sigma _{a}\) is the standard deviation.

We finally define the objective function as the minimization of negative kurtosis.

$$\begin{aligned} \begin{aligned} f_3(\hat{C})= \min \sum _{C_a\in \hat{C}} - {\text {kurtosis}}(C_a) \end{aligned} \end{aligned}$$
(11)

5.2 Multi-objective optimization

Here, we present the details of our context search strategy, in which we utilize the Elitist Non-Dominated Sorting Genetic Algorithm II (NSGA-II) [25] to approximate a Pareto frontier between multiple and possibly conflicting objectives described above. We implement custom encoding and crossover procedures and also apply a multi-criteria selection procedure to obtain the best context sets from the global non-dominated solutions in the Pareto front.

5.2.1 Encoding

Given \(\eta \) contexts constructed from the dataset, our goal is to find m relevant contexts, where m is a user-specified parameter. Therefore, the search space would include \({\eta \atopwithdelims ()m}\) sets of contexts, in which each set containing m contexts is a candidate solution. We use a common binary encoding representation of these solutions, where each solution is a chromosome represented as a binary string of zeros and ones with the length of the total number of initially provided contexts (\({\eta }\)). A zero or one at the i-th gene specifies the absence or presence of the i-th context in the final context set. Since the size of the context set m is pre-specified, the number of ones in the string should always be exactly m.

5.2.2 Crossover and mutation

We create a homogeneous crossover operator which takes two selected chromosomes as parents and creates offspring (a new context set) by inheriting and recombining the building blocks from them. The crossover operator is designed considering two properties: (i) carrying the potentially most relevant parts to the offspring by preserving the common context shared between both parents, (ii) maintaining the same number of contexts by limiting the number of 1 bits in the resulting chromosome.

First, we apply the AND operator across parent binary strings to make sure the contexts that are common in them will be present in their offspring as well. Then, we randomly assign 1 bits to reach m contexts in the set and leave the rest as 0. To increase the diversity of the population and consequently improve the exploration of relevant contexts, we also apply a two-bit mutation operation. We randomly flip two opposite bits (i.e., switch a random 1 to 0 and another 0 to 1), ensuring the number of contexts remains the same.

5.2.3 Final selection

The Pareto frontier includes multiple non-dominated solutions, where each solution comprises m contexts to be used by MCAF or any multi-context anomaly detector. We can not consider all solutions, as not all of them are equally useful or relevant in practice. Therefore, we apply TOPSIS [26], a simple yet effective method that chooses the best alternative based on the shortest and farthest Euclidean distances from the positive ideal solution (PIS) and negative ideal solution (NIS), respectively. We define our objective functions as minimization, i.e., cost functions. Therefore, PIS, in our case, comprises minimum values observed among three criteria (i.e., reverse dependency, redundancy, and negative kurtosis), and NIS is the opposite. TOPSIS ranks the Pareto set based on how much each solution is close to or far from these ideal solutions. Using their rankings, an analyst can prioritize different context sets found by Con Quest or select only a few from the top to limit investigation effort.

5.3 Summary

The overall Con Quest approach is illustrated in Fig. 2. We conclude by describing a high-level abstraction of the approach as follows:

  1. 1.

    Context Discovery

    • Using objective functions described in Sect. 5.1, find non-dominated solutions, where each comprises m contexts, as described in Sect. 5.2.

    • Select the top solution with m contexts using TOPSIS algorithm described in Sect. 5.2.

  2. 2.

    Multi-context anomaly detection

    • For each context in m contexts, compute anomaly scores using the contextual anomaly factor algorithm described in Sect. 4.

    • Combine anomaly scores from m contexts to produce final anomaly scores.

6 Experiments

6.1 Setup and datasets

In the subsequent experiments, we set the number of nearest neighbors in MCAF as \(k=40\). For NSGA-II, we set the population size \(p=20\) if the number of features \(d<10\) and \(p=50\) if \(d\ge 10\). Similar to ICL [11], we set the initial sliding window size w used to extract the context-behavior pairs as \(w = 2\) if \(d<40\), \(w = 10\) if d in the range [40, 300], and \(w = 100\) for \(d > 300\). Our algorithm is much faster than ICL and allows a larger search space. Therefore, in our implementation, we repeat this procedure multiple times. In each repeat, the window size w is doubled until it reaches \(w= d/2\). Finally, the number of contexts m to be searched is chosen to be \(m = 4\). Although the best m may differ in each dataset, empirically, we found that this value generally provides satisfactory results.

We evaluate Con Quest on two synthetic and 23 real datasets from ODDSFootnote 3 and DAMI [27]. Synthetic datasets contain contextual anomalies generated using the perturbation scheme described in [6] - a de facto standard for evaluating contextual anomaly detection methods. The first dataset, Synthetic-Single, includes anomalies generated from only one context, while Synthetic-Multi has anomalies from three different contexts.

The datasets are split into training (70%) and testing (30%), and the results are averaged across ten independent runs for all the methods reported in this section. Evaluations are performed using the Area Under the Precision-Recall Curve (i.e., AUC-PR or Average Precision) and Area Under Receiver Operating Characteristic curve (i.e., AUC-ROC)- commonly used metrics in machine learning community [27]. However, in many anomaly detection applications, AUC-PR is often deemed more suitable than AUC-ROC as it places greater emphasis on the performance of the positive class [28]. Although AUC-ROC serves as a reliable measure for evaluating performance across both anomaly and normal classes, it may lead to overly optimistic performance assessments due to high-class imbalance typical in anomaly detection datasets [27]. In contrast, AUC-PR specifically assesses the accuracy of positive predictions (precision) and the proportion of actual positives that are correctly identified (recall), offering a more realistic view of performance on the anomaly class. Therefore, it is recommended to use AUC-ROC in conjunction with AUC-PR, rather than as a standalone performance indicator [27].

6.2 Baselines

Our proposed method is compared with 9 baselines in various categories: (i) contextual anomaly detection methods: ICL [11], ROCOD [8], CAD [6], and ConOut [12], (ii) traditional approaches: iForest [29], LOF [24], OC-SVM [30], and (ii) recent techniques: ECOD [31] and COPOD [32]. We tune the parameters of the baseline algorithms on each dataset, to give them advantage of reporting best performances against Con Quest. The details of the baseline algorithms and how we tune their parameters are given below.

ICL Internal Contrastive Learning (ICL) is a recent approach using self-supervision to detect anomalies in tabular datasets. This method assumes that the way in which a subset of the variables in the feature vector is related to the rest of the variables is class-dependent, and samples violating this dependency are anomalous. ICL also uses a sliding window with a size k to divide d features into two disjoint sets as k consecutive features and the rest of the features with a size \(d-k\). We set all the hyper-parameters and the neural network architecture as described by the authors in the paper.

ROCOD Robust Contextual Outlier Detection simultaneously considers local and global effects in outlier detection. We tune the number of nearest neighbors, k, between 10 to 200. The parameters of Decision Tree Regression are set by performing a grid search over ten-fold cross-validation. ROCOD uses a single predefined context, which is very difficult to obtain in practice. For our experiments, we train ROCOD with the best single context, i.e., one resulting in the highest performance in each dataset obtained through an exhaustive search.

CAD Conditional Anomaly Detection is a generative approach to model the relation between context and behavior. Both the context and the behavior are modeled separately as a mixture of multiple Gaussians. We vary the number of components between 2 to 20 for both cases and pick the ones resulting in the highest performance. CAD is also a single-context method. We follow the same strategy as in ROCOD to pick the best context for CAD as well.

ConOut ConOut is a contextual anomaly detection algorithm that combines multiple contexts identified automatically. First, it uses a measure to determine and eliminate redundant contexts, then builds an ensemble over remaining contexts with a custom anomaly detection method called Contextual iForest. We tune ConOut’s parameter \(\gamma \), between \(\{0.0001,0.001,0.01,0.1,1,10,100,1000\}\). However, we set the parameters of isolation forests within ConOut as suggested in the paper (i.e., number of trees \(t =100\) and the sampling size \(\psi = 256\)); because it is computationally expensive to tune the parameters of each iForest belonging to the ensemble.

iForest Isolation Forest is the isolation-based tree ensemble detector. We tune the hyper-parameters, number of trees between \(t = \{100, 200, 300, 400, 500\}\) and the max features between \(d=\{0.1, 0.2,..., 1\}\). We set the sampling size \(\psi = 256\) as suggested in the paper.

LOF Local Outlier Factor compares the local density of each point to that of its neighbors. We tune the number of nearest neighbors, k, between 10 to 200 and pick the one resulting in the highest performance.

OC-SVM One class SVM is a popular outlier detection technique based on the principles of support vectors. We optimize the hyper-parameters kernel = {linear, polynomial, radial} and \(\nu = \{0.01, 0.1, 0.5, 1\}\).

COPOD Copula-based Outlier Detection first constructs an empirical copula, and then uses it to predict tail probabilities of each given data point to determine its level of “extremeness”. It is a parameter-free approach.

ECOD Empirical-Cumulative-distribution-based Outlier Detection is also a recent statistical technique. It estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. Then, it uses these empirical distributions to estimate tail probabilities per dimension for each data point and computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. It is a parameter-free approach.

Table 1 Performance (AUC-PR) comparisons between Con Quest and baselines on 25 datasets
Table 2 Performance (AUC-ROC) comparisons between Con Quest and baselines on 25 datasets

6.3 Comparison against baselines

Here, we compare our method, Con Quest, with nine state-of-the-art methods in terms of detection performance. The experimental results for Con Quest and nine competitors across 25 datasets are presented in Tables 1 and 2, using AUC-PR and AUC-ROC scores, respectively. Con Quest Top-1 shows the performance obtained from the solution at the top of the Pareto front ranked by TOPSIS, and Top-5 shows the highest performance among the top five solutions.

Table 1 demonstrates that the Con Quest methods achieve the highest AUC-PR scores, in terms of both average scores and rankings across all datasets. Specifically, Con Quest Top-5 is the best-performing approach among all the baselines, while Con Quest Top-1 also delivers competitive results, ranking as the second best after Top-5. In particular, Con Quest Top-5 and Top-1 achieve average AUC-PR scores of 49.45% and 42.53%, respectively, which are 11.15% and 4.23% higher than the next best method, LOF.

in terms of AUC-ROC scores, Con Quest methods also achieve the highest average ranks (Table 2). However, while Con Quest (Top-5) still records the best average score, Con Quest (Top-5) ranks third, following LOF. As previously mentioned, AUC-ROC scores should not be solely relied upon to judge the quality of anomaly detection methods due to potentially over-optimistic results in datasets with a very low ratio of anomalies [27]. The significant performance variation in the PenDigits (Pens) dataset between the two metrics illustrates this issue well.

When comparing the results from the baseline methods in both Tables 1 and 2, we observe that traditional detectors such as iForest and LOF achieve better average performance than more recent approaches like ICL, COPOD, and ECOD in our setup.

The effectiveness of Con Quest in handling contextual anomalies is evident from the results on synthetic datasets. Con Quest successfully discovers useful contexts in a fully unsupervised manner and outperforms both contextual and non-contextual detectors in these datasets.

ROCOD and CAD are the other two contextual methods, and they both rely on a single predefined context, which is challenging to determine in practice. For these experiments, we trained these methods with the best context (i.e., the context resulting in the best performance) for each dataset, obtained through an exhaustive search. However, this procedure was not feasible for certain high-dimensional datasets due to the massive number of candidate contexts. Therefore, the results for these datasets are left empty. Nevertheless, Con Quest outperforms both ROCOD and CAD, even under the best-case scenario where they use the optimal context. This demonstrates that a user-defined, single-context setting is inferior to the effective combination of multiple relevant contexts.

Furthermore, the results from the Synthetic-Single dataset show that our MCAF algorithm can distinguish contextual anomalies more successfully than other contextual methods (ICL, ROCOD, CAD, and ConOut), even when there is a single known context.

The only other multi-context anomaly detector besides Con Quest is ConOut. ConOut uses a statistical measure to analyze dependencies among different attributes, forming contexts and behaviors by ensuring that highly correlated attributes are not grouped together. This method reduces the total number of contexts and eliminates redundant ones. However, it also cannot handle very high-dimensional datasets such as Arrhythmia, so its results for such datasets are omitted from Tables 1 and 2. Moreover, eliminating redundant contexts does not guarantee the inclusion of useful ones that reveal anomalies, causing ConOut to fall behind Con Quest in most cases. It performs worse than Con Quest, even in synthetic datasets containing contextual anomalies, because its automatic context formation algorithm fails to retain any “chosen” context in which the anomalies are injected.

Fig. 3
figure 3

Statistical comparisons of the anomaly detection methods for two different metrics

We also statistically compare the average ranks of the algorithms using the nonparametric Friedman test [33]. The tests return p-values of 0.0059 and 0.0012 for AUC-PR and AUC-ROC results, respectively, indicating that we can reject the null hypothesis that there is no significant difference between the algorithms.

Subsequently, we perform the Conover post-hoc test to compare the methods pairwise and plot the “critical difference” (CD) diagrams in Fig. 3a, b. These plots show whether the average ranks of the two methods differ by a “critical difference” at a significance level of \(\alpha =0.05\) [33].

These plots confirm that Con Quest (Top-5) significantly outperforms both contextual and non-contextual detectors, while Con Quest (Top-1) provides competitive results against these baselines. This demonstrates its capability to unveil anomalies of different types. Anomaly detection is a highly subjective task where the notion of an anomaly heavily depends on the assumptions made about the characteristics of the data and the application domain [1]. Various studies have shown that no single approach can detect all types of anomalies with the highest performance in every dataset [27, 34, 35]. Con Quest is particularly advantageous for use cases where data is suspected to involve multiple contexts from which anomalies arise. However, empirical results suggest that it can also be applied to datasets with unknown types of anomalies.

6.4 The impact of multi-objective optimization

Here, we study how effective our multi-objective function is in comparison to individual objectives, as well as random selection. Figure 4 shows the performances w.r.t. the number of context m. The “3 objectives” lines represent the complete Con Quest algorithm that optimizes multiple objectives. In “dependency,” “redundancy” and “discrimination,” we individually consider the objectives described in the previous section, using the traditional single-objective genetic algorithm. The “random” approach is based on uniformly sampling m contexts from the initial set. We run the random approach 10 times for each m and report the results from the best context set.

The results show the clear benefit of multi-objective optimization over single-objective ones for the discovery of relevant contexts. It can be seen that the multi-objective case performs better than other solutions in all datasets. Dependency and discrimination can also achieve competitive results, depending on the dataset and m. However, they show rather inconsistent performance across different datasets. Furthermore, “redundancy” alone or random combinations result in very sub-par performances.

Fig. 4
figure 4

The performance (AUC-PR) versus the number of contexts (m) for different datasets

6.5 Con Quest as an interpretable anomaly detector

The majority of the anomaly detection methods provide only scores indicating whether an instance is an anomaly. However, the knowledge of what types of anomalies or normal groups are present in the dataset, what are their characteristics, or in what way they resemble or differ from normal groups is rarely available. Here, we show how Con Quest can provide answers to such questions.

We use the Pendigits dataset containing samples of hand drawings of digits. Features are the x and y coordinates of the hand in 8 consecutive time ticks as a human draws each digit on paper. We assign all drawings of digit 0 as normal and sub-sample digits 3 and 6 to create anomalies with two different characteristics. We run Con Quest with \(m=2\), aiming for two different contexts.

Figure 5a, c show two different anomalous samples scored at the top by MCAF in two contexts discovered by NSGA-II. The blue and red markers represent contextual and behavioral attributes, respectively. Figure 5b, d show averages of reference groups found for “digit 3” and “digit 6”. Contextual and behavioral attributes explain why an anomalous digit belongs to a certain reference group and how this digit differs from the group. Consequently, they explain why each one is marked as anomalous. It can be seen that two types of anomalous digits are revealed under two different contexts, i.e., \(\{t1,t5\}\) and \(\{t2,t3,t4,t5,t6\}\). Although reference groups in both cases resemble “digit 0”, they show different characteristics that can be noticed in the hand positions at times t5, t6, t7, and t8. This suggests that there is more than one group among members of the “digit 0” normal class. Furthermore, behavioral attributes help explain how anomalies differ from their normal groups. For example, the behavior of “digit 6”, which is the positioning of the hand respectively at times t1, t7, and t8, is significantly different from hand positions for “digit 0”s represented in Fig. 5c. “Digit 3” (Fig. 5a), however, has more features that significantly deviate from and show less similarity with “digit 0”s. These explanations also help to understand the differences and commonalities between different kinds of anomalies.

Fig. 5
figure 5

Examples of two different anomalies and their reference groups. Blue and red triangles correspond to points representing contextual and behavioral attributes, respectively

7 Limitations

Context discovery in high-dimensional tabular datasets is a computationally challenging problem as one can generate \(2^d-2\) different contexts given a dataset with d features. For example, the Arrhythmia dataset with \(d=274\) leads to an astronomical \(3.53\times 10^{328}\) possible contexts. To tackle this, we require a strategy to maintain a manageable search space without sacrificing vital information.

One common approach involves employing dimensionality reduction techniques like PCA and autoencoders on the original features, generating contexts from the resulting lower-dimensional features. However, this reduces the interpretability of our approach, making it challenging to discern which original features are contextual or behavioral.

In our study, we adopt a strategy akin to ICL [11], using a sliding window to generate an initial pool of contexts as input for our algorithm. This method involves limiting only consecutive nearby features to be grouped together as a context or a behavior. Naturally, the implications of this constraint depend on the dataset. Authors in [11] thoroughly discuss this limitation and have shown obtaining different permutations of features sometimes harms the accuracy and can only boost performance for datasets with small feature d and sample sizes n.

During our experiments, we’ve observed a similar trend: slight performance drops when randomly combining different feature permutations into contexts, even after multiple iterations. Thus, we contend that combining consecutive features in the original order is a reasonable strategy, often yielding informative contexts within the input set for efficient searching.

8 Conclusion

In this work, we introduced Con Quest, an approach that automatically discovers multiple contexts that help to unveil contextual anomalies hidden in the global feature space. We defined a multi-objective function to search for contexts, in which individual objectives assessing the quality (i.e., relevance) of different contexts are derived from the desired properties in contextual anomaly detection. We designed the search procedure utilizing a multi-objective genetic algorithm that returns a Pareto front comprising diverse non-dominating solutions. Furthermore, we proposed a new contextual anomaly detection algorithm combining decisions from multiple contexts to detect anomalies standing out in those different contexts. Through experiments, we showed the effectiveness of Con Quest over state-of-the-art baselines and demonstrated the clear benefit of multi-objective optimization over single-objective counterparts. Finally, we showcased the interpretability aspect of Con Quest.