A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining

The main aim of the outlying aspect mining algorithm is to automatically detect the subspace(s) (a.k.a. aspect(s)), where a given data point is dramatically different than the rest of the data in each of those subspace(s) (aspect(s)). To rank the subspaces for a given data point, a scoring measure is required to compute the outlying degree of the given data in each subspace. In this paper, we introduce a new measure to compute outlying degree, called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which not only detects the outliers but also provides an explanation on why the selected point is an outlier. SiNNE is a dimensionally unbias measure in its raw form, which means the scores produced by SiNNE are compared directly with subspaces having different dimensions. Thus, it does not require any normalization to make the score unbiased. Our experimental results on synthetic and publicly available real-world datasets revealed that (i) SiNNE produces better or at least the same results as existing scores. (ii) It improves the run time of the existing outlying aspect mining algorithm based on beam search by at least two orders of magnitude. SiNNE allows the existing outlying aspect mining algorithm to run in datasets with hundreds of thousands of instances and thousands of dimensions which was not possible before.


Introduction
Outliers (a.k.a anomalies) are data points that show dramatically different behavior from the remainder of data points in the dataset. The process of finding such data points is known as Outlier Detection (OD). In the era of big data, OD is considered as one of the vital task of data mining with a wide range of application domains [21], i.e., (i) fraud detection-in this domain, outlier refers to the fraud that includes credit card frauds [6], insurance claim frauds [4]; (ii) Medical or public health-in this domain, outlier refers to an unusual health condition of patients that happens due to instrumental error or disease symptoms [14].
Recently, researchers have been interested in the explanation of why the data point is considered as an outlier. The problem of finding these explanations leads to the Outlying Aspect Mining (OAM) [8,22,27,28]. OAM is the task of identifying feature subset(s), in which a given data point is dramatically inconsistent with the rest of the data. In literature, the problem of OAM is also referred as outlying subspace detection [31], outlier explanation [9,17,18], outlier interpretation [7,16,29], outlying property detection [1] and outlying aspect mining [8,22,23,[26][27][28]30].
In many application scenarios, it is required to find out in which set of feature(s), a given point is different than others. For example, in a bank, a fraud analyst collects information about various aspects of credit card fraud, and he/she is interested to know in which aspects the fraud does not conform with the remainder of that set of data. Moreover, when evaluating job applications, a panel member wants to know the job applicant's unique features. Another exciting application of OAM is in the medical domain [20]. Assume that you are a doctor and while treating a specific patient, you want to know, how this patient is different than others. Existing OD methods cannot answer all these questions.
To detect outlying aspects, OAM algorithms require a scoring measure to rank subspaces based on the outlying degrees of the given query. Existing OAM algorithms such as HOSMiner [31], OAMiner [8], Density Z-Score [27] and sGrid [28] use a traditional distance or density-based outlier score as the ranking measure. Because distance or densitybased outlier scores depend on the dimensionality of subspaces, they cannot be compared directly to rank subspaces. [27] proposed to use Z-Score normalization to make them comparable. It requires computing the outlier scores of all the data points in each subspace. It adds significant computational overhead making OAM algorithms infeasible to run in large and/or high-dimensional datasets. Also, we discover that Z-Score normalization is not appropriate for OAM in some cases.
In this paper, we focus on the two issues of existing scores used in OAM: (i) dimensionality unbiasedness, and (ii) computational complexity. It is worth noting that another computational issue in OAM is to deal with the exponentially large number of subspaces. Current OAM methods perform a systematic search; which is computationally prohibitive when the number of dimensions is high. This paper does not deal with this computational issue. It still uses the existing systematic search approach but deals with computing the score in each subspace efficiently. This paper makes the following contributions: -Identify an issue of using Z-Score normalization of density-based outlier scores to rank subspaces and shows that it is biased towards a subspace having high-density variance. -Propose a new simple measure called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which is useful for detecting outliers from the dataset and outlying aspects of the given outlier points. -Provide an objective measure to assess the quality of discovered outlying subspaces.
-Validate the effectiveness and efficiency of SiNNE in OAM. Our empirical results show that SiNNE can detect more interesting outlying aspects than the existing score, and it allows the OAM algorithm to run orders of magnitude faster than the existing scoring measure.
The rest of the paper is organized as follows. Section 2 provides a summary of previous work on outlying aspect mining. The proposed outlier detector scoring measure is presented in Sect. 3. Experimental settings are provided in Sect. 4, and empirical evaluation results are provided in Sect. 5. Finally, conclusions are provided in Sect. 6.

Related Works
In this section, first, we fixed some notations for the rest of the paper, provided some basic definitions, and then discussed recent outlying aspect mining methods. The highlevel process pipeline of OAM is shown in Fig. 1. Fig. 1 The high-level process pipeline

Basic Notations and Definitions
. Let F be a full feature space and = {S 1 , S 2 , … , S ð } be a set of all possible subspaces, where ð = 2 d − 1 is the number of possible subspaces. The key symbols and notations used in this paper are provided in Table 1.
The problem of outlier detection is to identify all x i which remarkably deviates from others in full feature set F , whereas the problem of outlying aspect mining is to identify subspace S i ∈ , where the given data point x i ∈ X is significantly different from the rest of the data. That given data point x i ∈ X is referred as a query . Definition 1 (Outlier) An outlier is a data instance that significantly deviates from others in the full feature set F . Definition 2 (Subspace) A subspace is a subset of the dimensions d of dataset X .

Definition 3 (Query point)
A query is a data point of interest, which is used to find outlying aspects. Definition 4 (Problem definition) Given a set of n instances X ( |X| = n ) in d dimensional space, a query ∈ X , a subspace S is called outlying aspect of iff, -outlying degree of in subspace S is higher than other subspaces, and there is no other subspace with same or higher outlying degree.

Outlying Aspect Mining
To the best of our knowledge, [31] is the earliest work that defines the problem of OAM. They introduced a framework to detect an outlying subspace called HOS-Miner (stands for High-dimensional Outlying Subspace Miner). Therein, the author used a distance-based measure called Outlying Degree (OutD in short). The OutD of query in subspace S is computed as: where ℵ k S ( ) is a set of k-nearest neighbors of in subspace S , d S (a, b) is an euclidean distance between a and b in subspace S , which is computed as d S (a, b) = √ ∑ i∈S (a i − b i ) 2 . In 2015, [8] introduced Outlying Aspect Miner (OAMiner in short). Instead of using distance, therein, authors employed a kernel density estimation [24]-based scoring measure to compute the outlyingness of query in subspace S: where f S ( ) is a kernel density estimation of in subspace S , m is the dimensionality of subspace S ( m = |S| ), h i is the kernel bandwidth in dimension i. [8] stated that f S is bias towards high-dimensional subspaces-density tends to decrease as dimension increases. Thus, to remove the effect of dimensionality biasedness, they proposed to use the density rank of the query as a measure of outlyingness. [27] proposed two outlying scoring metrics (i) density Z-Score and (ii) iPath score (stands for isolation Path).
Therein, the density Z-Score is defined as follows: where f S and f S are the mean and standard deviation of the density of all data instances in subspace S , respectively. The iPath score is motivated by Isolation Forest (iForest) anomaly detection approach [15]. The process of calculating the iPath score in subspace S of query w.r.t. sub-samples of the data is: where l i S ( ) is path length of in i th tree and subspace S. [27] were the first to coin the term dimensionality unbiasedness, i.e., "A dimensionality unbiased outlyingness measure (OM) is a measure of which the baseline value, i.e., average value for any data sample X = {x 1 , x 2 , … , x n } drawn from a uniform distribution, is a quantity independent of the dimension of the subspace S." [28] introduced a simple grid-based density estimator called sGrid. sGrid is a smoothed variant of a grid-based density estimator [24]. Let X be a collection of n data objects in d-dimensional space, x.S be a projection of a data object x ∈ X in subspace S . The sGrid density of point is computed as the number of points that falls into a bin that covers point and its surrounding neighbors. In their work, they show that the proposed density estimator has advantages over the existing kernel density estimator in outlying aspect mining by replacing the kernel density estimator with sGrid.
In recent work, [30] proposed a reconstruction-based method using completely random trees (RecForest in short). Therein, reconstruction has been done using the intersection of the bounding boxes in the completely random forest for each data point. The outlying score OS of each feature i = 1, 2, … , d for query is defined as: where rec is a reconstructed sample of . [29] proposed an Attention-guided Triplet deviation network for Outlier interpretatioN (ATON). Instead of searching subspaces, ATON learns an embedding space and learns how each dimension is contributing to the outlyingness of the query.

The Framework
We first outline the motivation for our method, followed by the details of SiNNE. Figure 2 presents the flowchart of the complete framework.

Issue of Using Z-Score
Because Z-Score normalization uses mean and variance of density values of all data instances in a subspace ( f S i and , it can be biased towards a subspace having high variation of density values (i.e., high f S i ). Let's take a simple example to demonstrate this. Assume that S i and S j ( i ≠ j ), be two different subspaces of the same dimensionality (i.e., |S i | = |S j | ). Intuitively, because they have the same dimensionality, they can be ranked based on the raw density (unnormalized) values of a query . Assum- , S i is ranked higher than S j based on density Z-Score normalization just because of higher f S i ). To show this effect in a real-world dataset, let's take an example of the pendigits 1 dataset ( n = 9868 and d = 16 ). Figure 3 shows the distribution of data in two three-dimensional subspaces S i = {7, 8, 13} and S j = {2, 10, 13} . Visually, the query represented by the red square appears to be more outlier in S j than in S i . This is consistent with its raw Apart from these, existing OAM scoring measures have two limitations: -they are dimensionally biased and they require normalization; and -they are expensive to compute in each subspace.
Being motivated by these limitations of density-based scores in OAM, we introduce a new measure which is dimensionally unbias in its raw form and can be computed efficiently.

Outlierness Computation
We now introduce a new scoring measure called simple isolation using nearest-neighbor ensembles (SiNNE in short). This scoring function is inspired by the isolation-based anomaly detection using nearest-neighbor ensembles [2,3]. The proposed scoring function has two major steps: -Building hyperspheres: The process of building hyperspheres in each subspace. The hyperspheres are build using nearest neighbors. -Scoring query: The current model is used to score the query.

Build Model
represents the position of data point x in X , n is the number of data points in the dataset and d is the number of dimensions. We randomly choose data samples from X , t times in each subspace. Our proposed scoring function follows same procedure as the iNNE [2] to build ensemble of hyperspheres. However, in context of OAM, the difference is that we create ensembles in subspaces instead of full feature space.
Basically, SiNNE creates an ensemble of hyperspheres. Ensemble is defined as t sets of hyperspheres, where each set consists of hyperspheres.

Definition 6
Given sub-samples, an ensemble H contains t sets and each set consists of hyperspheres. H is defined as: Note that the training process of SiNNE and iNNE is same, however, they differ in the computation of outlier score (cf. Sect. 3.5 for more differences).
Definition 7 (Simple isolation score) The simple isolation score of in subspace S based on sub-sample D is defined as: SI takes the value either 0 or 1. When is covered by any of the hypersphere, it assigns 0 and if it is not covered by any of the hypersphere then SiNNE assumes that point is far away from the data and assigns 1.

Definition 8
The outlier score for in subspace S based on SiNNE is defined as the average of simple isolation score over t sets.
As SI takes 0 or 1 score only, SiNNE(q) have score values in the range [0, 1].
Because the area covered by each hypersphere decreases as the dimensionality of the space increases and so is the actual data space covered by normal instances. Therefore, SiNNE is independent of the dimensionality of space in its raw form without any normalization making it ideal for OAM. It adapts to the local data density in the space because the sizes of the hyperspheres depend on the local density. It can be computed a lot faster than the k-NN distance or density. Also, it does not require to compute outlier scores of all n instances in each subspace (which is required with existing score for Z-Score normalization) which gives it a significant advantage in terms of run time. The procedures to build an ensemble of models and using them to compute outlyingness of the given query data in subspace S are provided in Algorithms 1 and 2.
Algorithm 1: Build Hyperspheres (X , t, ψ) Input: X -given data set; t -number of sets, ψnumber of sub-samples Output: H -An ensemble of t sets of ψ Generate D i by randomly selecting ψ data points from X without replacement ;

Subspace Search
Apart from scoring measure, OAM framework requires subspace search method. In this work, we will be using Beam [27] search method, because it is the latest search method and used in literature. We replicate the procedure of beam search in Algorithm 3 for ease of reference. The overall time complexity of beam search is O(d 2 + W ⋅ d ⋅ ) , where W is beam width and maximum dimension of subspace.

An Example of Proposed Method
In this section, we present an illustrative example of proposed method. Figure 4a shows a randomly selected 8 sub-samples (highlighted in black color) from dataset with n = 50 in 2-d subspace. Figure 4b shows an example of how (c) hypershpere is build at centered c with radii (c) . Figure 4c shows all 8 hyperspheres created using 8 sub-samples, which is used to compute outlying degree of the data point. As shown in Fig. 4d, to compute outlying degree of point x, the hypershpere that covers x needs to be determined. The SI (x) = 0 as x falls in hypershpere while data point y does not fall in any hypersphere, and thus outlying degree of y is obtained as 1.

Key Differences with Closely Related Work
In this subsection, we discuss the difference between SiNNE and iNNE.
Although having similar training process, SiNNE and iNNE employ different scoring mechanism. Specifically, iNNE employs local isolation-based score which is computed as follows: Apart from this, iNNE creates a model in full feature space since it has single sole purpose of detecting outliers from the full feature space F while the purpose of SiNNE is to detect subspace for the given data point, and thus it creates a model in subspace. Although iNNE [2] was previously used as a outlier detector, its use in OAM context is new. (3) otherwise Theorem 1 The isolation score using iNNE with sub-sample size = 2 is equivalent to SiNNE.
Proof Given a iNNE model H and sample size = 2 , each set contains two hypersphere with same radius (cf. Definition 5). Thus, ( cnn( ) ) = (cnn( )) . For sample size ( = 2 ) isolation score is as follows: which is same as Eq. 1. ◻ In terms of performance, SiNNE detects the ground truth for each query while iNNE only detects the ground truth for 11 out of 15 queries (details are presented in "Appendix"). In addition to that, SiNNE is faster than iNNE, this is because SiNNE does not require to find smallest hypersphere and its neighboring hypersphere for score.
The characteristics of datasets in terms of data size and the dimensionality of the original input space are provided in Table 2.
We used default parameters as suggested in respective papers unless specified otherwise. For SiBeam, we set = 8 and t = 100 . The Beam and RBeam employed KDE (kernel density estimator) to estimate density. KDE uses the Gaussian kernel with default bandwidth. 4 To calculate the Gaussian kernel, we use Euclidean distance. The parameter w block size for bit set operation in sGBeam was set to 64 as suggested by the authors [28]. Parameters beam width (W) and maximum dimensionality of subspace ( ) in Beam search procedure were set to 100 and 3, respectively, as done in [27].

Evaluation Metric
As far as we know, there is no such publicly available realworld dataset which offers ground truth to verify the quality of discovered subspaces. Therefore, in the absence of a better evaluation measure, we propose to use a mean kernel embedding [19] to evaluate the quality of discovered subspaces. The intuition behind the mean kernel embedding is, in the most outlying aspect, the query is far away from the distribution of the data, i.e., it has the minimum average similarity with rest of the data. The quality of discovered subspace S for a query using a kernel mean embedding method [19] is computed as follows: where K S ( , x) is a kernel similarity of and x in subspace S.
We use Chi-square kernel [32] because it is parameterfree and widely used by the computer vision research community. The Chi-square kernel K S ( , x) is computed as follows: In OAM, is considered to be more outlier in

Implementation
All measures and experimental setup were implemented in Java using WEKA platform [10]. We made the required changes in the Java implementation of iNNE 5 provided by the authors to implement SiNNE. We used the Java implementations of sGrid made available by the authors [28].
All experiments were conducted on a machine with Intel 8-core i9 CPU and 16 GB main memory, running on macOS Monterey version 12.0.1.
We run each jobs on multiple single CPU treads, which is done using GNU parallel [25]. All jobs were performed upto 24 h, and incomplete jobs were killed and marked as ' ⧫'.

Empirical Evaluation
In this section, we compare SiNNE and three contenders in four set of experiments: (a) Experiment 1-dimensionality unbiasedness; (b) Experiment 2-performance on synthetic datasets; (c) Experiment 3-performance on real-world datasets; and (d) Experiment 4-run-time comparisons.

Experiment 1: Dimensionality Unbiasedness
We generated 19 synthetic datasets using NumPy [12] library. Each dataset contains 1000 data points from uniform distribution U([0,1] d ) , where d varied from 2 to 20. We computed the average score of all instances using SiNNE and KDE. The results are presented in Fig. 5. The flat line for SiNNE shows that it is dimensionality unbiased, whereas KDE (without Z-Score normalization) is not. Note that [27] shows that ranks and Z-Score normalization make any score dimensionally unbias. Hence, we did not include them in our experiment.
3 Available at https:// elki-proje ct. github. io/ datas ets/ outli er. 4 Note that a better rule of thumb [11] 25n] , where X [0.25n] and X [0.75n] are the first and third quartiles of data X , respectively. 5 Available at https:// github. com/ thari ndurb/ iNNE. [13] provided several synthetic datasets, which are used in previous studies [8,22,27,28]. The collection of these synthetic datasets have 1000 data points and dimensions are 10, 20, 50, 75, and 100. Each dataset has a fixed number of outliers for which outlying subspaces are known (ground truth). synth_10D has 19 outliers, we passed all outliers one at a time as a query. Table 3 summarize the subspace discovered by SiBeam, RBeam, Beam, and sGBeam for all 19 queries. In terms of exact matches, SiBeam is the best performing measure which detects the ground truth as a top outlying aspect of each query. Beam and sGBeam perform similar by producing 19 exact matches. RBeam is the worst performing measure, which produces only five exact matches. Table 4 summarizes the mining results of SiBeam, RBeam, Beam, and sGBeam on four synthetic datasets, i.e., synth_20D, synth_50D, synth_75D and synth_100D. SiBeam finds the ground truth as a top outlying subspace for each query (ten queries from each datasets). Beam and sGBeam perform similar by producing 39 exact matches out of 40. RBeam is the worst performing measure, which produces exact matches for 5 queries out of 40.

Experiment 3: Performance on Real-World Datasets
In real-world datasets, outliers and their outlying aspects are not available. Thus, we used the state-of-the-art outlier detector called iForest 6 [15] to find top k ( k = 5 ) outliers and they were used as queries. We then use the f S score (cf. Eq. 5) in the top-ranked subspace to measure the quality of discovered subspace-the lower the value, the more likely the subspace is outlying aspect of a given query. It is worth noting that SiBeam and sGBeam are the only methods which are able to finish the process for each query, while RBeam and Beam finish the process for only 10 queries. Table 5 shows subspaces discovered by four OAM methods (i.e., SiBeam, RBeam, Beam, and sGBeam) on six realworld datasets. Table 6 shows the quality of discovered subspaces by SiBeam, RBeam, Beam, and sGBeam. High-quality subspaces of each query is highlighted in bold. SiBeam is best Table 7 Visualization of discovered subspaces by SiBeam, RBeam, Beam, and sGBeam in the wilt dataset 6 We used default parameter of and t which are 256 and 100, respectively.   Table 11 Visualization of discovered subspaces by SiBeam, RBeam, Beam and sGBeam in the mulcross dataset performer on 28 out of 30 according to proposed quality measure. sGBeam discovered high-quality subspace for only 5 queries out of 30. On the other hand, RBeam discovered high-quality subspace for only one query out of ten, whereas Beam was unable to detect high-quality subspace even for a single query.
The average run time of five queries for each dataset is presented in Table 5. Next, we visually compare discovered subspaces by each measure for top query from each datasets. Tables 7, 8, 9, 10, 11 and 12 shows the subspace discovered by SiBeam and contending measures on wilt, pageblock, mnist, u2r, mulcross, and covertype, respectively. Visually, we can say that SiBeam detects better subspace than its 3 contenders. Table 7 shows average run time for randomly chosen 10 queries from each real-world datasets of the SiBeam and its three contending measures. SiBeam and sGBeam were able to finish for all datasets, whereas RBeam and beam only able to finish on wilt, and pageblock datasets within 24 h. These results shows that the proposed scoring measure enables the existing OAM approach based on beam search to run orders of magnitude faster in large datasets. Specifically, SiBeam runs at least two and three magnitude faster than RBeam and Beam on wilt and pageblocks datasets, respectively. SiBeam runs at least two order of magnitude faster than sGBeam on large datasets ( n > 50K).

Conclusion
In this paper, we have introduced an efficient and effective scoring measure Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which is dimensionally unbias. By replacing the existing scoring measure to proposed scoring measure, we gain three benefits. The first benefit is that SiNNE is dimensionally unbiased measure, which does not rely on any normalization means it can be used directly to compare subspaces with different dimensionality. The second benefit is that SiNNE allows existing OAM (i.e., Beam) to run orders of magnitude faster compared to three stateof-the-art scoring measures. Thus it is more suitable for mining huge datasets with thousands of dimensions. The third benefit is now we can identify more interesting outlying subspace for a given query. This is confirmed by considerably better performance of SiNNE, compared to three state-of-the-art scoring measures in empirical evaluation. In addition to that, we introduced a new performance measure for outlying aspect mining. Our experimental results on real-world datasets show that SiNNE perform comparatively better than state-of-the-art measures.

Appendix: SiNNE Versus iNNE
This appendix provides the additional results of SiNNE and iNNE comparison from the following Sect. 3.5. Table 14 presents the subspaces discovered by SiNNE and iNNE on synthetic datasets. In term of exact matches, SiNNE detects ground truth of each query as outlying aspects, whereas iNNE only detects ground truth of 11 queries out of 15 as most outlying aspects.