A density estimation approach for detecting and explaining exceptional values in categorical data

Angiulli, Fabrizio; Fassetti, Fabio; Palopoli, Luigi; Serrao, Cristina

doi:10.1007/s10489-022-03271-3

A density estimation approach for detecting and explaining exceptional values in categorical data

Open access
Published: 02 April 2022

Volume 52, pages 17534–17556, (2022)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

A density estimation approach for detecting and explaining exceptional values in categorical data

Download PDF

Fabrizio Angiulli¹,
Fabio Fassetti¹,
Luigi Palopoli¹ &
…
Cristina Serrao¹

1649 Accesses
1 Altmetric
Explore all metrics

This article has been updated

Abstract

In this work we deal with the problem of detecting and explaining anomalous values in categorical datasets. We take the perspective of perceiving an attribute value as anomalous if its frequency is exceptional within the overall distribution of frequencies. As a first main contribution, we provide the notion of frequency occurrence. This measure can be thought of as a form of Kernel Density Estimation applied to the domain of frequency values. As a second contribution, we define an outlierness measure for categorical values that leverages the cumulated frequency distribution of the frequency occurrence distribution. This measure is able to identify two kinds of anomalies, called lower outliers and upper outliers, corresponding to exceptionally low or high frequent values. Moreover, we provide interpretable explanations for anomalous data values. We point out that providing interpretable explanations for the knowledge mined is a desirable feature of any knowledge discovery technique, though most of the traditional outlier detection methods do not provide explanations. Considering that when dealing with explanations the user could be overwhelmed by a huge amount of redundant information, as a third main contribution, we define a mechanism that allows us to single out outstanding explanations. The proposed technique is knowledge-centric, since we focus on explanation-property pairs and anomalous objects are a by-product of the mined knowledge. This clearly differentiates the proposed approach from traditional outlier detection approaches which instead are object-centric. The experiments highlight that the method is scalable and also able to identify anomalies of a different nature from those detected by traditional techniques.

A Density Estimation Approach for Detecting and Explaining Exceptional Values in Categorical Data

Outlying property detection with numerical attributes

Article 29 March 2016

The Concept of α-Outliers in Structured Data Situations

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. Their detection can identify system faults and frauds before they escalate with potentially catastrophic consequences; it turns out that, in some applications, the rare events can be more interesting than the more regularly occurring ones [2, 26].

As outliers are interesting because they are suspected of not being generated by the same mechanisms as the rest of the data, it is important to justify why detected outliers are generated by some other mechanisms [9, 22, 23]. However, the border between data normality and abnormality is often not clear cut; consequently, while some outlier detection methods assign to each object in the input data set a label as either “normal” or “outlier”, in this paper we describe a method able to single out anomalous data bunches by working on attributes and associated values.

We deal with categorical data where it is generally more difficult to devise criteria able to discriminate normal and abnormal data [46, 48]. Moreover, the notion of outlierness in the field of numerical data has long been analysed [20] and many approaches have already been designed to define the exceptional nature of a property, but most of them cannot be easily generalized to deal with categorical or mixed categorical/numerical dataset.

Some algorithms exist that build an anomaly detection model specially devised for categorical variables and transform any numerical variable into a categorical space through a previous discretization phase [32, 49]; however, the main drawback of such a strategy is that the result of the analysis strongly depends on the results of the discretization process.

As an alternative, some mixed criteria techniques have been developed which manage numerical and categorical data separately and then merge the two by providing a method which encompasses the analysis of an element in both spaces [28, 30, 39].

In this paper, we specifically focus on problems which are typical of categorical data. We do that by taking the perspective of perceiving an attribute value as anomalous if its frequency occurrence is exceptionally typical or un-typical within the distribution of frequencies occurrences of any other attribute value.

However, within the categorical scenario the process of comparing frequencies poses several challenges. Indeed, if we take the point of view that the data at hand is the result of a sampling procedure in which data values are associated with some pre-defined occurrence probabilities, then the fact that a certain categorical value is observed exactly f times is a matter of chance rather than being a hard property of that value.

This has led us to the definition of the concept of soft frequency occurrence which, intuitively, consists in the estimate of the density associated with frequency occurrences. We obtain this measure by specializing the classical Kernel Density Estimation technique to the domain of frequency values.

As a second contribution, we leverage the cumulated frequency distribution of the above density estimate to decide if the frequency of a certain value is rare when compared to the frequencies associated with the other values. In particular, we are able to identify two kinds of anomalies, namely lower outliers and upper outliers. A lower outlier is a value whose frequency is small while, typically, the dataset objects assume a few similar values, namely the frequencies of the other values are large. An upper outlier is a value whose frequency is large while, typically, the dataset objects assume almost distinct values, namely the frequencies of the other values are small.

Note that both these definitions look for unexpected behaviors by establishing a comparison between the frequency of the outlier value and that of the other values in the attribute domain. While the notion of lower outlier shares similarities with the classical concept of anomaly, the notion of upper outlier is conceptually different and should not be confused with the concept of mode. Indeed, for a value, a high frequency is not enough to be an upper outlier, as the rest of the values must appear with low frequencies.

Consider the example reported in Table 1: as for the attribute A1, value c is a lower outlier, since it occurs only once and the other values occur many times; as for the attribute A2, value e is an upper outliers, since it occurs many times and the other values occur once or at most twice; as for the attribute A3, although the value e has the same frequency both in A2 and A3 and it is the most frequent in A3, it is not an (upper) outlier since the other values in A3 have comparable frequencies.

Table 1 An example highlighting the notion of lower (A1) and upper (A2) outliers

Full size table

Thus, we are able to single out, by one unified outlierness measure, both exceptionally infrequent and exceptionally frequent values; this peculiarity clearly differentiates our proposal from almost all the existing measures of outlierness.

Although values can show exceptional behavior with respect to the whole population, it must be pointed out that very often a value emerges as exceptional only when we restrict our attention to a subset of the whole population [7]. In particular, our method, differently from many others, is capable of returning to the user not only data bunches which are anomalous with respect to the entire data population, but also those records which are normal in general but anomalous only when contrasted to a sub-population. Therefore, our technique has been designed to output the so-called explanation-property pairs(E, p), where E, called explanation, denotes a condition used to determine the target subpopulation and p, called property, represents an attribute p_a and a value p_v such that p_v is exceptionally frequent or infrequent within the subpopulation selected by the explanation E.

The output of the algorithm corresponds to the so-called explanation-property pairs, which allows us to provide an interpretable explanation for the abnormal values discovered.

However, it must be noticed that, when dealing with explanations, there exists a risk that the user is overwhelmed by a huge amount of redundant explaining patterns. Thus, as a further contribution, we define a subtle mechanism that allows us to single out the explanations encoding the outstanding exceptionalities in the data, carrying out no redundant information. Loosely speaking, for a condition to encode a significant explanation we require the frequency distribution of the associated sub-population to be unexpected given the knowledge of the frequency distribution of any other of its super-population, where unexpectedness is measured by means of the chi-squared goodness-of-fit test. Moreover, for an attribute and a value to encode a significant property, we require that the outlierness measured within the sub-population associated with a given significant explanation improves the one measured within any of its super-populations associated with significant explanations. Outlierness improvement must be greater than a factor which is inversely related to the unexpectedness of the sub-population frequency distribution. Maximal significant explanation-property pairs are said to be outstanding. The output of the algorithm precisely corresponds of the so-called outstanding explanation-property pairs.

Finally, our technique is knowledge-centric as the search space we visit is formed by explanation-property pairs and the outliers we provide can be seen as a product of the knowledge mined. This is clearly different from traditional outlier detection approaches which are object-centric.

The rest of the work is organised as follows. Section 2 discusses work related with the present one. Section 3 introduces the frequency occurrence function. Section 4 describes the outlierness function for ranking categorical values. Section 5 introduces the concept of outstanding explanation-property pair and describes the goal of our mining method. Section 6 describes experimental results.

2 Related work

Categorical data has received relatively little attention as compared to quantitative data because detecting anomalies in categorical data is a challenging problem [48]. Generally, traditional approaches do not handle categorical data in a satisfactory manner, due to the fact that, in most cases, there is no concept of sorting for the set of values a categorical variable can assume; so the development of specific techniques is needed. We start by noting that there is little literature about detecting anomalous properties, and/or related outlier objects, equipped with explanations which face the task of the identification of both features and subpopulations which characterize anomalies. Moreover, to the best of our knowledge, no technique is able to natively detect upper outliers.

There exist several approaches to detect outliers in the certain setting, namely statistical - based [14, 25], distance - based [4, 12, 13, 29, 34], density - based [18, 45], isolation - based [38], subspace - based [1, 6, 7], knowledge - based [5], neural network - based [31, 42], and many others [3, 19].

Some anomaly detection techniques depend on the identification of a representative pattern suggested by the majority observations so that objects that result to be far from it, according to a suitable distance measure, can be perceived as anomalies. However, designing such a measure in presence of categorical data is challenging [17].

Different strategies have been proposed to face with the above problem. In [21] some methods are presented to map categorical data on numerical data together with a framework for their analysis. However, the effectiveness of these techniques is strongly related to the choice of the mapping function. A different prospective is that of exploiting for categorical data some traditional approaches designed for the quantitative domain by choosing an appropriate distance measure. This is done in [8, 13], where anomalies are defined as the N observations whose average distance to the k nearest neighbors are the greatest; instead, [35] considers as anomalous those observations with fewer than p observations within a certain distance d.

Many other methods exploit the Hamming distance to identify anomalies among categorical data [15, 16, 37] together with a pruning strategy to cope with the quadratic complexity of evaluating distances.

Recently a new idea of distance suitable for the categorical domain has been introduced to detect and characterize outliers in a semi-supervised fashion [33]. The key intuition is that the distance between two values of a categorical attribute can be determined by the way in which they co-occur with the values of other attributes in the data set: if two values are similarly distributed with respect to a certain set of attributes, their distance must be low. A model defining the distances between categorical values is defined on the training set and is used to evaluate the outlier score associated with each test instance t as the sum the distances between t and a subset of objects known to be normal.

The family of density-based aqpproaches includes those methods that identify observations having outlying behavior in local areas, thus result to be inconsistent within their neighborhood and not necessary with the pattern suggested by the majority of all other observations.

Local anomaly detection methods for categorical data include the k-Local Anomalies Factor k-LOF [51] and, ROAD [47] and WATCH [36] methods.

The k-LOF is a local anomaly detection method for both categorical and quantitative data [51]. It extends Local Anomalies Factor (LOF) [18] to categorical domain. The k-LOF identifies an observation as a local outlier if its relationships with its neighbors are weaker than the relationships between its neighbors and its neighbors’ neighbors. It does that by building a similarity graph and by using the concept of k-walk, i.e the paths of length k on the similarity graph joining two observations, to provide an outlierness score.

The ROAD algorithm [47] exploits both distances and densities. The Hamming distance is used to group objects into clusters and highlight observations located in sparse regions; then a density measure is calculated for each object on the basis of the frequencies of its values, in order to identify objects whose values are almost infrequent in the dataset.

Both measures present some limitations: as for Hamming distance, outliers with few exceptional attributes are not captured, as for density, they are not compared with expected values and attributes associated with distinct values, as primary keys, can affect results.

The WATCH method [36] has been recently designed to find out outliers in high dimensional categorical datasets using feature grouping. First, it groups correlated features, then it looks for outliers in each feature group by calculating a weighting factor for each categorical variable that takes into account the correlation between this variable and the others in the same group.

A completely different prospective is taken by methods that exploit information - theoretic measures. The idea behind these approaches relies on the direct relationship between the existence of anomalies and the amount of noise in the dataset. This led some authors [32] to formulate the outlier detection task in terms of an optimization problem, i. e. finding a subset of k objects such that the expected entropy of the resultant dataset after the removal of this subset is minimized. This strategy has to be intended from a global point of view as the outlierness measure involves simultaneously all the attributes and is neither able to detect outliers in sub-populations nor to identify outliers characterised by one (or few) outlying attributes.

To overcome with the last issue, in [24] an outlier factor is designed on the basis of the ratio between the probability of co-occurrence of two sets of attributes and the product between the probabilities of occurrence of the two sets taken separately. Here, the authors are interested in properties consisting in at least two attributes and do not address sub-populations.

The importance of learning value interactions has shown to be effective when handling categorical data and some recent contributions are found in the literature that exploit such a strategy to detect anomalies.

CBRW [40] estimates the outlierness of each feature value which can either detect outliers directly or determine feature selection for subsequent outlier detection. The value is computed by comparing the frequency of each value with the most frequent value (the mode). However, this is just a measure of deviation, explanations are not provided and, by definition, upper outliers cannot be detected.

As a further drawback, noisy values may significantly influence the performance of CBRW, thus the same authors propose HOUR [41], a new outlier detection framework for data with noisy features. A noise-resilient outlier scoring function is defined to rank objects based on their outlierness in a given feature subset and an outlier ranking evaluation function is proposed to evaluate the quality of the ranking w.r.t the feature subset. Feature selection and ranking evaluation are iteratively performed until the best feature subset is obtained.

Nevertheless, this approach shows to be not particularly suitable for high dimensional data, thus two further main improvement have been proposed, namely POP [44] and OUVAS [50].

The idea of investigating feature subsets has been taken into account also in [43] but in a completely different way. Here, the main intuition is that, provided with a random subsample of the main dataset, those instances with rare combinations of values on any attribute subset have also a higher probability of having zero appearances in subsamples of any size.

Subspaces managment seems to meet our concept of explanation, but the semantic is completely different. The problem of outlier explanation [10] we deal with in this paper consists in finding features that can justify the outlierness of an object, in a sense that its anomalous state emerges only as a consequence of the selection we have made.

Some previous works [9, 10] follow this path and propose a technique for categorical and numerical domains respectively that, given in input one single object known to be outlier, provides features justifying its anomaly and subpopulations where its exceptionality is evident. A generalization is proposed in [11] where a set, required to be small, of outliers is provided in input.

3 Frequency occurrence

In this section we give same preliminary definitions and introduce the notation employed throughout the paper.

A dataset ${\mathcal {D}}$ on a set of categorical attributes ${\mathcal {A}}$ is a set of objects o assuming values on the attributes in ${\mathcal {A}}$. By o[a] we denote the value of o on the attribute $a\in {\mathcal {A}}$. ${\mathcal {D}}[a]$ denotes the multiset $\{ o[a] \mid o\in {\mathcal {D}} \}$.

A condition ${\mathcal {C}}$ is a set of pairs (a, v) where each a is an attribute and each $v \in {\mathcal {D}}[a]$. A singleton condition is said to be atomic. By ${\mathcal {D}}_{{\mathcal {C}}}$ we denote the new dataset $\{o \in {\mathcal {D}} \mid o[a] = v, \forall (a, v) \in {\mathcal {C}})\}$.

A condition ${\mathcal {C}}$ is said to be valid if ${\mathcal {D}}_{{\mathcal {C}}} \neq \emptyset $. It follows from this definition that in a valid condition ${\mathcal {C}}$, for all $(a,v), (a,u) \in \mathcal {C}$, it holds that u = v. Thus, $\mathcal {C}$ is valid, the number $|\mathcal {C}|$ of atomic conditions in $\mathcal {C}$ is equal to the number of attributes involved in $\mathcal {C}$. In the following, if not otherwise stated, we will take into account only valid conditions.

Definition 1 (Frequency distribution)

A frequency distribution ${{\mathscr{H}}}$ is a multiset of the form ${{\mathscr{H}}} = \{ f_{1}^{(1)}, \ldots ,f_{1}^{(w_{1})},$ $ \ldots , f_{n}^{(1)},\ldots ,f_{n}^{(w_{n})} \}$ where each $f_{i}^{(j)}\in \mathbb {N}$ is a distinct frequency, $f_{i}^{(j)}=f_{i}^{(k)} = f_{i}$ for each 1 ≤ j, k ≤ w_i, and w_i denotes the number of occurrences of the frequency f_i. By $N({{{\mathscr{H}}}})$ (or simply N whenever ${{\mathscr{H}}}$ is clear from the context) we denote w₁ ⋅ f₁ + … + w_n ⋅ f_n.

For the sake of simplicity, we will refer to a frequency distribution as a set ${{\mathscr{H}}} = \{ f_{1}, f_{2}, \ldots , f_{n} \}$ and to the number of occurrences w_i of f_i as w(f_i). To ease the writing of expressions, we also assume that the dummy frequency f₀ = 0 with w₀ = 0 is always implicitly part of any frequency distribution.

Given a multiset V, the frequency ${f^{V}_{v}}$ of the value v ∈ V is the number of occurrences of v in V.

The frequency distribution of the dataset ${{\mathcal {D}}}$ on the attribute a is the multiset ${{\mathscr{H}}}^{{\mathcal {D}}}_{a} = \{ f^{{\mathcal {D}}[a]}_{v} \mid v \in {\mathcal {D}}[a] \}$. Note that $N({{\mathscr{H}}}^{{\mathcal {D}}}_{a}) = |{\mathcal {D}}|$.

Theorem 1

Let ${{\mathscr{H}}} = \{f_{1},\ldots ,f_{n}\}$ be a frequency distribution. Then, $n \le \sqrt {N({{\mathscr{H}}})}$.

Proof

Since $N({{\mathscr{H}}}) = w_{1}\cdot f_{1} + w_{2}\cdot f_{2} + {\dots } + w_{n} \cdot f_{n}$, n is maximized when (i) f₁ = 1, (ii) ∀i, w_i = 1, and (iii) ∀i > 1, f_i+ 1 = f_i + 1. Thus, the maximum n is such that $1+2+\dots +n = N({{\mathscr{H}}})$ and, since $1+2+\dots +n = \frac {n(n+1)}{2}$, it follows that $n\cdot (n+1) = 2\cdot N({{\mathscr{H}}})$ and, then, that $n = O\big (\sqrt {N({{\mathscr{H}}})}\big )$. □

From the above theorem, it immediately follows that the number of distinct frequencies in ${{\mathscr{H}}}^{D}$ is at most $\sqrt {|{\mathcal {D}}|}$.

Now we define the notion of frequency occurrence as a tool for quantifying how frequent is a certain frequency.

Definition 2 (Hard frequency occurrence)

Given a frequency distribution ${{\mathscr{H}}}$, the frequency occurrence ${\mathcal {F}}_{{{\mathscr{H}}}}(f_{i})$ of f_i, also denoted by ${\mathcal {F}}(f_{i})$ whenever ${{\mathscr{H}}}$ is clear from the context, is the product w_i ⋅ f_i.

The above definition allows us to associate with each distinct value in ${\mathcal {D}}[a]$ a score that is related not only to its frequency in the dataset but also to how many other values have its same frequency.

A major drawback of the previous definition is that close frequency values do not interact with each other and, as a consequence, small variations of the frequency distribution may cause sensible variations in the frequency occurrence values. E.g., consider the case in which the frequencies f_i = 49, w_i = 1 and f_i+ 1 = 51, w_i+ 1 = 1 are replaced with $f^{\prime }_{i}=50, w^{\prime }_{i} = 2$. While in the former case ${\mathcal {F}}(f_{i})=49$ and ${\mathcal {F}}(f_{i+1})=51$, in the latter case we have that ${\mathcal {F}}(f^{\prime }_{i})=100$ that is about twice the frequency occurrence associated with f_i and f_i+ 1. Intuitively, we do not desire a similar small variation in the frequency distribution to impact so largely on the outcome of the measure. Indeed, if we take the point of view that the data at hand is the result of a sampling procedure in which data values are associated with some pre-defined occurrence probabilities, then the fact that a certain categorical value is observed exactly f times is a matter of chance, rather than being an hard property of that value.

Thus, we refine the previous definition of frequency occurrence in order to cope with the scenario depicted above. Specifically, to overcome the mentioned drawback, we need to force close frequency values to influence each other in order to jointly contribute to the frequency occurrence value. With this aim, we inspired to Kernel Density Estimation (KDE) methods to design an ad-hoc density estimation procedure.

First of all, we point out that we are working in a discrete domain composed of frequency values, a peculiarity that differentiates it from the standard framework of KDE. We start by illustrating the proposed density estimation procedure.

A (discrete) kernel function $K_{f_{i}}$ with parameter f_i is a probability mass function having the property that $\sup _{f\ge 0}$ $K_{f_{i}}(f)=K_{f_{i}}(f_{i})$.

Given an interval I = [f_l, f_u] of frequencies, a frequency f_i, and a kernel function K, the volume of $K_{f_{i}}$ in I, denoted as $V_{I}(K_{f_{i}})$, is given by ${\sum }_{f=f_{l}}^{f_{u}} K_{f_{i}}(f)$. The following expression

$$ {\mathcal{F}}(f) = \underset{\varphi \in I(f)}{\sum} \left\{ \sum\limits_{i=1}^{n} w_{i}\cdot f_{i}\cdot K_{f_{i}}(\varphi) \right\}. $$

where I(f) represents an interval of frequencies centred in f, provides the density estimate of the frequency occurrence of the frequency f.

Since $K_{f_{i}}(\cdot )$ is a probability mass function, the frequency f_i provides a contribution to the frequency occurrence of f corresponding to the portion of the volume of $K_{f_{i}}$ which is contained in I(f), that is $V_{I(f)}(K_{f_{i}})$. Hence, if the interval I(f) contains the entire domain of $K_{f_{i}}$ then f_i provides its maximal contribution w_i ⋅ f_i. Frequencies f_i whose domain do not intersect I(f) do not contribute to the frequency occurrence of f at all.

The above definition needs to properly calibrate the width I(f) of the interval to be centred in f. To eliminate the dependence of the formulation from an arbitrary interval, we resort to the following alternative formulation in which frequencies φ are not constrained to belong to the interval I(f). However, since the generic kernel $K_{f_{i}}(\cdot )$ could be arbitrarily far from the frequency of interest f, now its contribution has to be properly weighted

$$ {\mathcal{F}}(f) = \underset{\varphi \ge 0}{\sum} \left\{ \sum\limits_{i=1}^{n} \left[ w_{i}\cdot f_{i}\cdot K_{f_{i}}(\varphi)\cdot \frac{Pr[X_{f_{i}} = f]}{Pr[X_{f_{i}} = f_{i}]} \right] \right\}. $$

Let $X_{f_{i}}$ denote the random variable distributed according to $K_{f_{i}}$ and, hence, having f_i as the value that is most likely to be observed. The ratio $\frac {Pr[X_{f_{i}} = f]}{Pr[X_{f_{i}} = f_{i}]} \le 1$ represents a weight factor for the kernel $K_{f_{i}}(\cdot )$ which is maximum, in that evaluates to 1, for f = f_i. Hence, the closer the kernel $K_{f_{i}}(\cdot )$ to the frequency of interest f, the larger its contribution to the frequency occurrence of f. Since the above probabilities can be directly obtained from the associated kernel, it can be rewritten as follows

$$ {\mathcal{F}}(f) = \underset{\varphi \ge 0}{\sum} \left\{ \sum\limits_{i=1}^{n} \left[ w_{i}\cdot f_{i}\cdot K_{f_{i}}(\varphi)\cdot \frac{K_{f_{i}}(f)}{K_{f_{i}}(f_{i})} \right] \right\}. $$

(1)

Equation (1) can be rewritten as

$${\mathcal{F}}(f) = \sum\limits_{i=1}^{n} \left[w_{i}\cdot f_{i} \cdot \frac{K_{f_{i}}(f)}{K_{f_{i}}(f_{i})} \cdot \underset{\varphi \ge 0}{\sum} K_{f_{i}}(\varphi) \right].$$

Since $K_{f_{i}}(\cdot )$ is a probability mass function, the summation over the domain of all its values is equal to 1 and, thus, the above expression can be finally simplified in

$$ {\mathcal{F}}(f) = \sum\limits_{i=1}^{n} \left[ w_{i}\cdot f_{i}\cdot \frac{K_{f_{i}}(f)}{K_{f_{i}}(f_{i})} \right]. $$

(2)

Since ${\mathcal {F}}$ represents a notion of density function associated with frequency occurrences, it is preferable that its volume evaluated in the frequencies ${{\mathscr{H}}} = \{f_{1},\ldots ,f_{n}\}$ evaluates to $N({{\mathscr{H}}})$. This leads to the following final form of the frequency occurrence function.

Definition 3 (Soft occurrence function)

Given a frequency distribution ${{\mathscr{H}}}$, the frequency occurrence ${\mathcal {F}}_{{{\mathscr{H}}}}(f_{i})$ of f_i, also denoted by ${\mathcal {F}}(f_{i})$ whenever ${{\mathscr{H}}}$ is clear from the context, is given by the following expression

$$ {\mathcal{F}}(f) = \frac{N({\mathcal{H}})}{N_{{\mathcal{F}}}({\mathcal{H}})}\cdot \sum\limits_{i=1}^{n} \left[ w_{i}\cdot f_{i}\cdot \widehat{K}_{f_{i}}(f) \right], $$

(3)

where

$$\widehat{K}_{f_{i}}(f) = \frac{K_{f_{i}}(f)}{K_{f_{i}}(f_{i})} ~~\text{ and }~~ N_{{\mathcal{F}}}({\mathcal{H}}) = \sum\limits_{j=1}^{n} \left\{ \sum\limits_{i=1}^{n} \left[ w_{i}\cdot f_{i}\cdot \widehat{K}_{f_{i}}(f_{j}) \right] \right\}. $$

Figure 1 reports the frequency occurrence values according to the hard and soft definition. Note that frequencies 48 and 50 are closer to each other and the soft frequency occurrence definition for them lead to a value that is more like the one we would get if we observed a value between 48 and 50 twice in the distribution. The values of the soft occurrences are not normalized to improve intelligibility of the example.

As for the kernel selection, interestingly we can take advantage of the peculiarity of the frequency domain to base our estimation on a very natural kernel definition. Indeed, as kernel $K_{f_{i}}(\cdot )$ we will exploit the binomial distribution binopdf(f;n, p) with parameter n, denoting the number of independent trials, equal to $N({{\mathscr{H}}})$, and parameter p, denoting the success probability, equal to $p={f_{i}}/{N({{\mathscr{H}}})}$. We argue that this kind of kernel is particularly natural for our setting and, moreover, note that its use relieve us from the problem of selecting a suitable kernel bandwidth, a problem that affects almost all the kernel density estimation procedures. Indeed, the fact that we observe a certain number f_i of occurrences for a given value v can be assimilated to the outcome of a sequence of Bernoulli trials each having success probability p_v, which is modeled by a binomial random variable. Note that, for large sample sizes $N({{\mathscr{H}}})$, the frequency f_i tends to the expected value $N({{\mathscr{H}}})\cdot p_{v}$ of the above random variable. Hence, ${f_{i}}/{N({{\mathscr{H}}})}$ closely approximates p_v and the binomial function with parameters $n=N({{\mathscr{H}}})$ and $p={f_{i}}/{N({{\mathscr{H}}})}$ represents the distribution of that variable. This distribution clearly provides the probability to observe any other frequency $f_{i}^{\prime }\neq f_{i}$ for the same value v.

3.1 Computational cost

Computational complexity of the proposed measure can be estimated by taking into account the cost of computing the set of frequency occurrences for each attribute in ${\mathcal {A}}$ and the cost of evaluating $\widehat {K}_{f_{i}}(\cdot )$ for each frequency in ${{\mathscr{H}}}^{D}$. Theorems discussed in the following provide some results about time complexity.

Theorem 2

Let ${\mathcal {D}}$ be a dataset and let ${{\mathscr{H}}}^{D}=\{f_{1},\ldots ,f_{n}\}$. Then, the cost of computing the set of frequency occurrences $\{ {\mathcal {F}}_{{{\mathscr{H}}}^{{\mathcal {D}}}}(f_{1}), \ldots , {\mathcal {F}}_{{{\mathscr{H}}}^{{\mathcal {D}}}}(f_{n}) \}$ is $O(|{\mathcal {D}}|\cdot C_{K})$, where C_K represents the cost of evaluating $\widehat {K}_{f_{i}}(\cdot )$.

Proof

Consider (3). The cost of evaluating this Equation for a given f involves the computation of a summation of n terms, with n equals to the number of different frequencies. Due Theorem 1, $n = O\left (\sqrt {|{\mathcal {D}}|}\right )$. Since we have to evaluate (3) for any distinct f in the dataset, the Equation has to be computed n times. Then, since each evaluation costs C_k, the overall cost is $O\left (\sqrt {|{\mathcal {D}}|}\cdot \sqrt {|{\mathcal {D}}|} \cdot C_{k}\right ) = O(|{\mathcal {D}}|\cdot C_{k})$. □

Now we take into account the cost of computing the term $\widehat {K}_{f_{i}}(f)$ when $K_{f_{i}}(\cdot )$ is the binomial kernel.

Theorem 3

Given a dataset ${\mathcal {D}}$ and two frequencies f_i and f_j in ${{\mathscr{H}}}^{{\mathcal {D}}}$, the cost of computing $\widehat {K}_{f_{i}}(f_{j})$ is O(1) with a pre-processing $O(|{\mathcal {D}}|)$.

Proof

First of all, we prove that $\widehat {K}_{f_{i}}(f_{j})$ can be obtained by evaluating the following expression:

$$ \begin{array}{@{}rcl@{}} &&{\kern-8.5pt}\exp\left[ \left( \sum\limits_{k=1}^{f_{i}}{\ln{k}} \right) + \left( \sum\limits_{k=1}^{N-f_{i}}{\ln{k}} \right) - \left( \sum\limits_{k=1}^{f_{j}}{\ln{k}} \right) \right.\\&&\qquad\left.- \left( \sum\limits_{k=1}^{N-f}{\ln{k}} \right) + (f_{i}-f)\ln\left( \frac{N}{f_{i}}-1\right) \right]. \end{array} $$

To get this equality, consider the logarithm of $\widehat {K}_{f_{i}}(f_{j})$. Since $K_{f_{i}}(f_{j})$ is a binomial probability function, by exploiting the properties of logarithms we obtain:

$$ \begin{array}{@{}rcl@{}} \begin{array}{l} \ln{\widehat{K}_{f_{i}}(f_{j})} = \ln{\frac{K_{f_{i}}(f_{j})}{K_{f_{i}}(f_{i})}} =\\ =\ln{\frac{\binom{N}{f_{j}}\left( \frac{f_{i}}{N}\right)^{f_{j}}\left( 1 - \frac{f_{i}}{N} \right)^{N-f_{j}}}{\binom{N}{f_{i}}\left( \frac{f_{i}}{N}\right)^{f_{i}}\left( 1 - \frac{f_{i}}{N} \right)^{N-f_{i}}}} = \ln{\frac{\binom{N}{f_{j}}}{\binom{N}{f_{i}}}} + (f_{j}-f_{i})\ln\left( \frac{f_{i}}{N}\right) \\\quad+ (f_{i}-f_{j})\ln\left( 1-\frac{f_{i}}{N}\right) = \\ =\ln{\frac{\binom{N}{f_{j}}}{\binom{N}{f_{i}}}} + (f_{i}-f_{j})\ln\left( \frac{1-f_{i}/N}{f_{i}/N}\right) \\= \ln{\frac{\binom{N}{f_{j}}}{\binom{N}{f_{i}}}} + (f_{i}-f_{j})\ln\left( \frac{N}{f_{i}}-1\right). \end{array} \end{array} $$

Since

$$ \begin{array}{@{}rcl@{}} \ln{\frac{\binom{N}{f_{j}}}{\binom{N}{f_{i}}}}&=& \ln\left( \frac{N!}{f_{j}!(N-f_{j})!}\cdot\frac{f_{i}!(N-f_{i})!}{N!}\right)\\&&= \sum\limits_{k=1}^{f_{i}}{\ln{k}} + \sum\limits_{k=1}^{N-f_{i}}{\ln{k}} - \sum\limits_{k=1}^{f_{j}}{\ln{k}} - \sum\limits_{k=1}^{N-f_{j}}{\ln{k}}, \end{array} $$

the statement follows. We note that all the above terms can be pre-computed. Indeed, during the pre-processing step we can build an array of N elements such that the generic i^th entry stores ${\sum }_{k=1}^{i} \ln k$. □

4 Categorical outlierness

In this section we introduce the concept of outlierness and discuss about the measure we have designed to discover outlier properties in categorical datasets.

Definition 4 (Cumulated frequency distribution)

Given a frequency distribution ${{\mathscr{H}}}=\{f_{1},\ldots ,f_{n}\}$, the associated cumulated frequency distribution H is

$$ {H}(f) = \underset{f_{j}\le f}{\sum} {\mathcal{F}}_{{\mathcal{H}}}(f_{j}). $$

In the following, we refer to the value H(f_i) also as to H_i.

The idea behind the measure we will discuss in the following is that an object in a categorical dataset can be considered an outlier with respect to an attribute if the frequency of the value assumed by this object on such an attribute is rare if compared to the frequencies associated with the other values assumed on the same attribute by the other objects of the dataset.

We are interested in two relevant kinds of anomalies referring to two different scenarios.

Lower Outlier.:: An object o is anomalous since for a given attribute a the value that o assumes in a is rare (its frequency is low) while, typically, the dataset objects assume a few similar values, namely the frequencies of the other values are large.
Upper Outlier.:: An object o is anomalous since for a given attribute a the value that o assumes in a is usual (its frequency is high) while, typically, the dataset objects assume almost distinct values, namely the frequencies of the other values are small.

In order to discover outliers, we exploit the cumulated frequency distribution associated with the dataset. With this aim, we use the area above and below the curve of the cumulated frequency distribution to quantify the degree of anomaly associated with a certain frequency.

Intuitively, the larger the area above the portion of the curve included from a certain frequency f_i to the maximum frequency $f_{\max \limits }$, and the more f_i differs from frequencies that are greater than f_i. At the same time, the larger the area below the portion of the curve included from the minimum frequency $f_{\min \limits }$ and a certain frequency f_i, and the more f_i differs from frequencies that are smaller than f_i.

You can evaluate the contribution given by the area above the cumulated frequency distribution curve to the outlierness of a certain frequency f_i, using the following expression

$$ A^{\uparrow}(f_{i}) = \underset{j> i}{\sum}(f_{j} - f_{j-1})\cdot \left( {H}_{n} - {H}_{j-1} \right). $$

(4)

The lower outlier score out^↓(f_i) is given by the the normalised area

$$ out^{\downarrow}(f_{i}) = A^{\uparrow}(f_{i}) / A_{\max}^{\uparrow}(f_{i}), $$

(5)

obtained by dividing the area A^↑(f_i) by

$$ A_{\max}^{\uparrow}(f_{i}) = (A^{\uparrow}(f_{0}) - A^{\uparrow}(f_{i})) + (f_{n} - f_{i})\cdot ({H}_{n}-{H}_{i-1}) $$

(6)

corresponding to the area above the cumulated frequency histogram up to the frequency f_i, represented by the term (A^↑(f₀) − A^↑(f_i)), plus an upper bound to the area above the cumulated frequency histogram starting from f_i, represented by the term (f_n − f_i) ⋅ (H_n − H_i− 1). Notice that the former term is minimised for $f_{i} \rightarrow 1$, while the latter term tends to A^↑(f_i) for $f_{n} \rightarrow \infty $ and, hence, in this case out^↓(f_i) tends to its maximum value 1.

The second scenario we are interested in aims to highlight the upper outliers, namely those objects that, for a given attribute, assume a value whose frequency is high, while typically, the dataset objects assume distinct values, that is the frequencies of the other values are low.

In order to discover such a kind of anomaly we take into account the area below the cumulated frequency distribution, starting from the lowest frequency up to the target frequency f_i. The bigger this area, the more this frequency can be highlighted as anomalous. The contribution of the frequency f_i is computed as

$$ A^{\downarrow}(f_{i}) = \underset{j\le i}{\sum}{(f_{j}- f_{j-1}) \cdot {H}_{j-1}}. $$

(7)

The upper outlier score out^↑(f_i) is given by the the normalised area

$$ out^{\uparrow}(f_{i}) = A^{\downarrow}(f_{i}) / A_{\max}^{\downarrow}(f_{i}), $$

(8)

obtained by dividing the area A^↓(f_i) by the term

$$ A_{\max}^{\downarrow}(f_{i}) = (f_{i}-1) \cdot {H}_{i} $$

(9)

representing an upper bound to the area below the cumulated frequency histogram up to the frequency f_i. Notice that A^↓(f_i) tends to $A_{\max \limits }^{\downarrow }(f_{i})$ for $f_{i-1} \rightarrow 1$ and ${H}_{i} \rightarrow {H}_{i-1}$, or equivalently ${\mathcal {F}}(f_{i}) \ll {\mathcal {F}}(f_{i-1})$, so in this case out^↑(f_i) tends to its maximum value 1.

The outlierness, or abnormality score, associated with the frequency f_i is a combined measure of the above two normalised areas:

$$ {\textit{out}}(f_{i}) = \frac{W^{\uparrow}_{i} \cdot out^{\uparrow}(f_{i}) + W^{\downarrow}_{i} \cdot out^{\downarrow}(f_{i})}{ W^{\uparrow}_{i} \cdot {\varDelta}(out^{\uparrow}(f_{i})) + W^{\downarrow}_{i} \cdot {\varDelta}(out^{\downarrow}(f_{i}))} $$

(10)

Specifically, the (global) outlierness score of f_i is the weighted mean of the upper and lower outliernesses associated with f_i, with weights $W^{\uparrow }_{i} = H_{i}$ and $W^{\downarrow }_{i} = (H_{n}-$ H_i− 1), respectively. Note that H_i represents the fraction of the frequencies having value less or equal than f_i, while (H_n − H_i− 1) represents the fraction of the frequencies having value greater or equal than f_i. Thus, when both the contributions out^↑(f_i) and out^↓(f_i) are greater than 0, the two weights provide their relative importance in terms of the fraction of the data population used to compute each of them. As for the function Δ(x), it evaluates to 0 if x = 0, and to 1 otherwise. Thus, it serves the purpose of ignoring the weight associated with the lower or upper outlierness if it evaluates to 0 and, otherwise, of taking it into account in its entirety.

In order to clarify areas employed for outlierness computation, let us refer to the following example. Consider a single attribute dataset whose associated set of distinct frequencies is {f₁ = 1, f₂ = 2, f₃ = 3, f₄ = 4, f₅ = 5, f₆ = 6} and the set of weights is {w₁ = 3, w₂ = 2, w₃ = 1, w₄ = 2, w₅ = 1,, w₆ = 2}. Assume that we want to compute the outlierness associated with the frequency f₃ = 3. Figure 2a and b represent the areas exploited to compute such outlierness.

On the left the area $A_{i}^{\downarrow }$ together with the area used for normalisation, $A_{\downarrow }^{\max \limits }$, is reported, while, on the right, the area $A_{i}^{\uparrow }$ together with the area used for normalisation, $A_{\uparrow }^{\max \limits }$, is reported.

If $W^{\downarrow }_{i} > W^{\uparrow }_{i}$ we say that the global score is an upper score. Conversely if $W^{\uparrow }_{i} \le W^{\downarrow }_{i}$ we say that the global score is a lower score.

We will use the notation $out_{{{\mathscr{H}}}}(\cdot )$ whenever it is needed to highlight the frequency distribution ${{\mathscr{H}}}$ used to compute the outlierness. The outlierness out_a(v, D) of the value v ∈ D[a] with respect to the attribute a in the dataset ${\mathcal {D}}$, is given by $out_{{{\mathscr{H}}}^{{\mathcal {D}}[a]}}(f^{{{\mathscr{H}}}^{{\mathcal {D}}[a]}}_{v})$.

5 Outstanding explanation-property pairs

Exceptional values v for an attribute a, are those associated with large values of outlierness out_a(v, D). Thus, we are interested in detecting such exceptional values. However, it must be pointed out that very often a value emerges as exceptional for a certain attribute only when we restrict our attention to a subset of the whole population.

This intuition leads to the definition of the notion of explanation-property pair.

Definition 5

An explanation-property pair (E, p), or simply pair for the sake of conciseness, consists of condition E, also called explanation, and of an atomic condition p = (p_a, p_v), also called property. By p_a (p_v, resp.) we denote the attribute (value, resp.) involved in the atomic condition p.

Given a pair π = (E, p), ${\mathcal {D}}_{\pi }$ denotes the set of objects ${\mathcal {D}}_{E\cup \{p\}}$. The outlierness out(π) of an explanation-property pair π = (E, p) is the outlierness $out_{p_{a}}(p_{v},D_{E})$ of the value p_v with respect the attribute p_a in the dataset D_E.

To illustrate definitions, we exploit a running example derived from the Breast Cancer Wisconsin dataset [32]. The example is based on the Clump Thickness attribute, referred to as CT. Specifically, Table 2 refers to the frequency distribution of values in the domain of CT in the full dataset before any explanation is taken into account. Conversely, Table 2b shows the frequency distribution of the same attribute when we focus on the subset of benign tumors, together with the scores gained by each value (fourth column). The row there highlighted concerns the pair (E₁, p), with explanation E₁ = {(Type,2)} and property p = (CT,7).

Table 2 Behaviour of Clump Thickness attribute in the dataset portions selected by different set of conditions (CT = Clump Thickness, UCS = Uniformity of Cell Shape, SECS = Single Epithelial Cell Size, T = Type)

Full size table

We note that the number of possible explanation-property pairs is usually very large. So we need a mechanism that allows us to single out the subset of these pairs encoding the outstanding exceptionalities in the data and carrying no redundant information; we do that through the notions of unexpected, significant, and outstanding pairs defined in the following.

Given pairs (E, p) and $(E^{\prime },p)$, we say that (E, p) is more specific than $(E^{\prime },p)$, or equivalently that $(E^{\prime },p)$ is more general than (E, p), if $E^{\prime }\subset E$. In this case, we also say that the two pairs are related.

Given two related pairs (E, p) and $(E^{\prime },p)$, with (E, p) more specific than $(E^{\prime },p)$, we wonder if the frequency distribution ${{\mathscr{H}}}^{{\mathcal {D}}_{E}}_{p_{a}}$ (also called observed distribution) is statistically different from the frequency distribution ${{\mathscr{H}}}^{{\mathcal {D}}_{E^{\prime }}}_{p_{a}}$ (also called reference distribution). We can ask the question by leveraging a goodness-of-fit test, which establishes if the observed frequency distribution is unexpected given the reference distribution.

A suitable test for categorical values is the chi-square test. The chi-square test relies on the chi-square statistic X², which is the sum of the squared difference between the observed frequencies and the reference frequencies, normalized on the value of the expected frequencies:

$$ X^{2}(E^{\prime},E,p) = \sum\limits_{i=1}^{r} \left[ \frac{\left( h_{i} - h_{i}^{\prime}\left( \frac{m}{n}\right) \right)^{2}}{h_{i}^{\prime}\left( \frac{m}{n}\right)} \right] $$

(11)

where: (i) $h_{i}^{\prime }$ (h_i, resp.) is the number of occurrences in ${\mathcal {D}}_{E^{\prime }}[p_{a}]$ (${\mathcal {D}}_{E}[p_{a}]$, resp.) associated with the i-th distinct categorical value of ${\mathcal {D}}_{E^{\prime }}[p_{a}]$, (ii) r is the number of these distinct categorical values, and (iii) n and m represent the number of objects in ${\mathcal {D}}_{E^{\prime }}$ and ${\mathcal {D}}_{E}$, respectively.

It is known that the X² statistics asymptotically approaches the χ² distribution with r − 1 degrees of freedom, hence, the X² value can be used to calculate a p-value by comparing its value to the proper chi-squared distribution. The p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis, that is that the observed distribution complies with the reference one, holds. Thus the value $F_{\chi ^{2}_{r-1}}(X^{2})$, where $F_{\chi ^{2}_{r-1}}$ denotes the cumulative distribution function of the χ² distribution with r − 1 degrees of freedom, provides the desired p-value.

Definition 6 (Unexpected pair)

Given related pairs π = (E, p) and $\pi ^{\prime }=(E^{\prime },p)$, with (E, p) more specific than $(E^{\prime },p)$, we say that (E, p) is unexpected given $(E^{\prime },p)$, if $out(\pi ) \ge (1+\alpha ) \cdot out(\pi ^{\prime })$, where $\alpha = 1 - F_{\chi ^{2}_{r-1}}\big (X^{2}(E^{\prime },E,p)\big )$.

Note that, the smaller the p-value provided by the cdf value $F_{\chi ^{2}_{r-1}}\big (X^{2}(E^{\prime },E,p)\big )$ and the less expected the distribution of categorical values in ${\mathcal {D}}_{E}[p_{a}]$ given the knowledge of the distribution of the categorical values in ${\mathcal {D}}_{E^{\prime }}[p_{a}]$ and, consequently, α ∈ [0,1] is inversely related to the unexpectedness of the former distribution. Hence, in order for the pair (E, p) to represent non-redundant information, we require that the outlierness of $(E^{\prime },p)$ must be improved of a quantity which is inversely related to unexpectedness of the distribution of categorical values in ${\mathcal {D}}_{E}[p_{a}]$.

Therefore, the concept of unexpected pair serves the purpose of avoiding that statistical fluctuations of the frequencies may slightly favor the un-typicality of the property value. Indeed, while outlierness improvements might be observed by augmenting the current explanation with some attribute-value pairs, this circumstance does not necessarily imply that the augmented explanation-property pair is more relevant than the original one, since the improvement could be so slight to be associated with statistical fluctuations of the property distribution within the two explanations. Hence, when an explanation is augmented with some attribute-value pairs, the more is preserved the property distribution with respect to the original explanation, the larger the outlierness improvement required to declare the augmented explanation-property pair as unexpected.

Consider the running example and suppose you want to extend the explanation E₁ = {(Type,2)} with a further condition to get the more specific E₂ = {(Type,2), (Uniformity of Cell Shape, 2)}.

To verify if the pair (E₂, p) is unexpected given the pair (E₁, p), the frequency distributions ${{\mathscr{H}}}^{{\mathcal {D}}_{E_{2}}}_{CT}$ and ${{\mathscr{H}}}^{{\mathcal {D}}_{E_{1}}}_{CT}$ have to be compared. The plots reported Fig. 3 show the above distributions and the associated α value demonstrates that a very weak correlation exists. Thus, for the pair (E2, p) to be considered unexpected given (E1, p) just a slight increase in the outlierness score is necessary to satisfy Definition 6:

$$ \begin{array}{@{}rcl@{}} out_{CT}(7,{\mathcal{D}}_{E_{2}}) &=& 0.6866 > (1+0.0065)\cdot out_{CT}(7,{\mathcal{D}}_{E_{1}})\\& &= 1.0065 \cdot 0.6530. \end{array} $$

Consider now the property $p^{\prime }=(CT,2)$. Also the pair $(E_{2},p^{\prime })$ is unexpected given $(E_{1},p^{\prime })$, indeed

$$ \begin{array}{@{}rcl@{}} out_{CT}(2,{\mathcal{D}}_{E_{2}})& =& 0.5629 > (1+0.0065)\cdot out_{CT}(2,{\mathcal{D}}_{E_{1}}) \\&=& 1.0065 \cdot 0.3325. \end{array} $$

However, if we consider the explanation E₃ = E₂ ∪{(Single Epithelial Cell Size,2)} although the score of the pair $(E_{3}, p^{\prime })$ is greater than that of the pair $(E_{2}, p^{\prime })$, the former pair is not unexpected given the latter since (Fig. 4):

$$ \begin{array}{@{}rcl@{}} out_{CT}(2,{\mathcal{D}}_{E_{3}}) &=& 0.6031 < (1+0.9871)\cdot out_{CT}(2,{\mathcal{D}}_{E_{2}}) \\&=& 1.9871 \cdot 0.5629. \end{array} $$

It is important to note that, given any pair $(E^{\prime },p)$, often it suffices to augment $E^{\prime }$ with a random attribute r to observe a slight outlierness improvement, that is to have $out_{p_{a}}(p_{v},{\mathcal {D}}_{E^{\prime }\cup \{r\}}) \ge out_{p_{a}}(p_{v},{\mathcal {D}}_{E^{\prime }})$. To understand why this is not unusual, assume that r is uncorrelated with the attributes in $E^{\prime }$, then the distribution ${\mathcal {D}}_{E^{\prime }}[p_{a}]$ will be almost preserved in ${\mathcal {D}}_{E^{\prime }\cup \{r\}}[p_{a}]$. Thus, we compute α since it precisely provides a measure of the unexpectedness of the sub-population distribution: being the value p_v fixed, larger explanations that preserve the original distribution represent redundant information. Moreover, since statistical fluctuations of the frequencies may slightly favor the un-typicality of the property value, the more preserved the above distribution, the larger the outlierness improvement required. In the scenario depicted above, this almost certainly filters out the expended pair (E ∪{r}, p), even if the associated outlierness gets larger.

Definition 7 (Significant pair)

A pair (∅, p) is significant by definition. Moreover, a pair (E, p) with E≠∅ is said to be significant if there exists a more general significant pair $(E^{\prime },p)$ such that (E, p) is unexpected given $(E^{\prime },p)$.

Thus, a pair is significant if it is unexpected given at least one other more general significant pair. The notion of significant pair is implemented by requiring unexpectedness of pairs with respect to a subset of the given explanation that resulted to be unexpected itself. To better understand this notion, consider the explanations $E^{\prime \prime }$, $E^{\prime }=E^{\prime \prime }\cup \{(a^{\prime },v^{\prime })\}$, and $E=E^{\prime }\cup \{(a,v)\}$ and the three related pairs $(E^{\prime \prime },p)$, $(E^{\prime },p)$, and (E, p), with $(E^{\prime \prime },p)$ a significant pair. Often neither (E, p) is unexpected given $(E^{\prime },p)$ nor $(E^{\prime },p)$ is unexpected given $(E^{\prime \prime },p)$. This can be understood since atomic conditions may not in general carry sufficient additional correlations.

However, a more complex condition, like $E\setminus E^{\prime \prime }$, could sensibly alter the initial distribution of values. Thus, if (E, p) is unexpected given the pair $(E^{\prime \prime },p)$, and $(E^{\prime \prime },p)$ is itself significant, we say that also (E, p) is significant, no matter of the expectedness of (E, p) given $(E^{\prime },p)$ which is more specific than $(E^{\prime \prime },p)$, for otherwise chains as the one described above will prevent almost all pairs to be identified as significant.

Definition 8 (Strongly significant pair)

A pair (E, p) is said to be strongly significant if, for any more general significant pair $(E^{\prime },p)$, the pair (E, p) is unexpected given $(E^{\prime },p)$.

In the running example, the pair $(E_{3},p^{\prime })$ is significant since it is unexpected given the pair $(\emptyset ,p^{\prime })$, which is significant by definition, indeed:

$$ \begin{array}{@{}rcl@{}} out_{CT}(2,{\mathcal{D}}_{E_{3}}) &=& 0.6032 > (1+0.0288)\cdot out_{CT}(2,{\mathcal{D}}) \\&=& 1.0288 \cdot 0.3416. \end{array} $$

However, it is not strongly significant, since $(E_{3},p^{\prime })$ is not unexpected given the more general significant pair $(E_{2},p^{\prime })$. Note that the pair $(E_{2},p^{\prime })$ is also significant (since it is unexpected given $(\emptyset ,p^{\prime })$) and, hence, it will be strongly significant provided that any other more specific significant pair is not unexpected given $(E_{2},p^{\prime })$.

Strongly significant pairs are those unexpected given any other more general significant pair. While the definition of significant pair serves the purpose of identifying those pairs that represent interesting information, the definition of strongly significant pair serves the purpose of identifying those pairs that carry no redundant information. Indeed, if a significant pair (E, p) is not unexpected given a more specific significant pair $(E^{\prime },p)$ then, from what has been stated above, it follows that the same information can be obtained from the more general pair $(E^{\prime },p)$.

Definition 9 (Outstanding)

Maximal strongly significant pairs (E, p) are said to be outstanding pairs. Given threshold 𝜃, a pair is said 𝜃-outstanding if it is outstanding and has outlierness not smaller than 𝜃.

Outstanding pairs can be thought of as the maximal interesting non-redundant explanation-property pairs in the dataset at hand. Outstanding pairs form the output of the technique. Specifically, we are interested in outstanding pairs associated with the highest outlierness values. Thus, given parameter N and threshold 𝜃, the technique outputs the top–N𝜃-outstanding explanation-property pairs π = (E, p) together with the associated sets of objects ${\mathcal {D}}_{\pi }$.

This strategy greatly reduces sensitivity of our outlierness measure to variations of the explanation. Moreover, by means of the strongly significant and outstanding definitions, we present to the analyst only the explanation-property pairs that are unexpected with respect to any other more general pair, thus avoiding she/he to be overwhelmed by useless information.

5.1 The FDEOut algorithm

In this section we describe the FDEOut algorithm. In order to detect outstanding pairs, the algorithm performs the exploration of the set enumeration tree associated with the explanations according to a depth-first strategy.

At a given iteration, the algorithm analyses pairs composed by an explanation E and, simultaneously, all the possible properties p concerning attributes not involved in E. The depth-first strategy is adopted for reducing the cost of the search.

In order to evaluate the score of a given pair (E, p), the dataset objects have to be grouped according to E and this can be directly exploited also to group objects according to E ∪{e} due to the depth-first visit.

It follows from the adopted strategy that the algorithm works with partial information, since when the algorithm analyzes the pair δ = (E, p) it has not yet explored all the pairs $(E^{\prime }, p)$ with $E^{\prime } \subset E$.

The algorithm builds the result set composed by outstanding pairs, denoted as $\mathcal {O}\mathcal {P}$, as follows. When δ = (E, p) is evaluated, consider the set Δ^⊂ of pairs $(E^{\prime }, p)$ with $E^{\prime }\subset E$ currently in $\mathcal {O}\mathcal {P}$ and the set ${\varDelta }^{\supset }$ of pairs $(E^{\prime }, p)$ with $E^{\prime }\supset E$ currently in $\mathcal {O}\mathcal {P}$. Note that each pair Δ^⊂ is more general than δ and that each pair ${\varDelta }^{\supset }$ is more specific than δ, thus, $\delta ^{\prime }\in {\varDelta }^{\subset }$ can be exploited to evaluate the significance of δ and δ can be exploited to evaluate the significance of $\delta ^{\prime }\in {\varDelta }^{\supset }$. In more details:

(i):: if the score of δ is lower than the threshold 𝜃 or lower than at least one pair in Δ^⊂ then δ is dropped since it is not significant;
(ii):: if δ is significant with respect to at least one pair in Δ^⊂ then δ is candidate to be strongly significant and is inserted in $\mathcal {O}\mathcal {P}$ as marked;
(iii):: if δ is significant with respect to no pairs in Δ^⊂, then δ is not strongly significant and is inserted in $\mathcal {O}\mathcal {P}$ as not marked; note that it has to be inserted since, being significant, it can be relevant to disprove the significance of an other pair;
(iv):: if the score of δ is larger than that of a $\delta ^{\prime }\in {\varDelta }^{\supset }$ then $\delta ^{\prime }$ is dropped since it is no more significant;
(v):: if $\delta ^{\prime }\in {\varDelta }^{\supset }$ is marked and is not significant with respect to δ then the mark of $\delta ^{\prime }$ is removed since δ disprove the strongly significance of $\delta ^{\prime }$.

Once the explanation set enumeration tree is explored, only the maximal marked pairs are kept in $\mathcal {O}\mathcal {P}$.

6 Experimental results

In this section, we describe experimental results obtained by using the FDEOut algorithm.

First of all, to study the applicability of our method to real datasets, we have tested its scalability by varying the number of objects, the number of attributes, and the depth of the analysis.

Then, to clarify the different nature of the anomalies we detect w.r.t those returned by classical outlier detection methods, we have performed two families of experiments to test whether classical techniques are able to detect anomalies pointed out by our approach and to compare the detection ability of our method with related ones on known outliers.

Specifically, in Section 6.2 we employ as target the top-10 lower and upper anomalies detected by our approach for different datasets and compute their outlier scores according to the distance-based and density-based detection approaches we choose as competitors.

In Section 6.3 we inject outliers into the datasets by selecting objects within the majority class and replacing, for each attribute, the value each selected object assumes with a different value randomly picked from the attribute domain. Then, we evaluate the change in the outlierness score of the alterated objects to highlight the sensitivity of each techniques in identifying such anomalies.

Finally, in Section 6.4 we discuss knowledge mined by means of our approach.

6.1 Scalability

Figure 5 shows the scalability analysis of our method on the Mushrooms dataset from UCI Machine learning repository [27]. In the experiment reported in Fig. 5a, we varied the number of objects n in {500, 1000, 2000, 5000, 8000} and the number of attributes m in {7, 14, 22}, while the depth parameter has been held fixed to δ = 3. The dashed lines represent the trend of the linear growth estimated exploiting regression. This estimation confirms that the algorithm scales linearly with respect to the dataset size. As for the number of attributes, as expected for a given number of objects, the execution time increases due to the growth of the associated search space. On the full dataset the execution time is rather limited, as it amounts to about 2 minutes. In the experiment reported in Fig. 5b, we varied both the number of objects n and the depth parameter δ in {1,2,3,4}, while considering the full feature space. Also in this case, the linear growth is represented by the dashed lines, so similar considerations can be drawn.

Table 3 reports the number of outstanding explanation-property pairs returned by the algorithm on the following dataset: Zoo (n = 101 objects and m = 18 attributes), Mushrooms (n = 8,124 objects and m = 22 attributes), Cars (n = 1,728 objects and m = 8 attributes).

Table 3 Outstanding pairs vs depth analysis δ

Full size table

We consider increasing values of the depth parameter δ ∈{1,2,3,4,5}. The column #Pairs reports the total number of pairs forming the search space up to the depth level δ. The column #Outstanding reports the number of outstanding pairs and the percentage of these pairs on the total number of pairs (within brackets). The latter column reports the number of outstanding pairs of size δ, that are the novel pairs introduced by exploring the last level of the current search space, together with their percentage on the total number of pairs (within brackets). From the table, it can be seen that the fraction of novel patterns rapidly decreases with the depth, thus suggesting that meaningful analyses do not require large values for the parameter δ. Indeed, the number of outstanding pairs settles for depth δ = 3 on all the datasets. Moreover, it can be seen that the notion of outstanding pair is able to greatly reduce the number of potential explaining patterns to be presented to the user, since the percentage of these patterns on the whole search space rapidly decreases with the depth. E.g., for Mushrooms the outstanding pairs represent the 0.96% of the pairs of size up to δ = 5.

We further note that outstanding pairs are not necessarily associated with large values of outlierness. Thus, it is sensible to determine how many of these outstanding pairs are indeed 𝜃-outstanding, for suitable values of the threshold parameter 𝜃. Figure 6 reports the distribution of the outliernesses associated with the outstanding pairs obtained for δ = 3. Plots on the left concern the lower scores (that we recall are outlierness scores having weights $W^{\uparrow }_{i} \ge W^{\downarrow }_{i}$ in (10)), while plot on the right concern the upper scores (having weights $W^{\downarrow }_{i} > W^{\uparrow }_{i}$ in (10)), We can notice that the top lower scores are always likely to reach high values of outlierness, close to 1. As for the upper scores, they are in general less pronounced, due to the different nature of these two kinds of anomalies. In any case, it is interesting to notice that only a little fraction of the outstanding pairs is associated with the largest score values and, hence, only a reduced fraction of the outstanding pairs are indeed 𝜃-outstanding pairs. Usually, the 0.1% or 1% of the outstanding pairs are associated with outlierness values comparable to the maximum outlierness scores by any pair. It is easy to check that the number of these pairs amounts to some hundreds in the general, being about 100 for Zoo and Cars, and about 1, 000 for Mushrooms.

6.2 Comparison with classic distance and density based outlier detection methods

We compare our method with two of the main categories of outliers: (i) distance-based approaches, that are used to discover global outliers, i.e. objects showing abnormal behaviour when compared with the whole dataset population; (ii) density-based approaches, which are able to single out local outliers, i.e. objects showing abnormal behaviour when compared with a certain subset of the data with their neighbourhood.

As distance-based definition, we use the average KNN score, representing the average distance from the k-nearest neighbours of the object [13]. As density-based, we use Local Outlier Factor or LOF [18]. Both methods employ the Hamming distance. Moreover, we compare our method with the ROAD algorithm [47] that exploits both densities and distances, namely it establishes two independent rankings: (i) each data object is assigned to a frequency score and objects with low scores are considered outliers (Type-1 outliers); (ii) the k-mode clustering is performed in order to isolate those objects that are far from big clusters (i.e. clusters containing at least α% of the whole dataset) according to Hamming distance (Type-2 outliers). The goal of these experiments is to highlight that we are able to detected anomalies of different nature and to provide evidence that our method is knowledge-centric, since it concentrates on anomalous values, as opposed to classical methods which are instead object-centric.

To compare the approaches, we ranked the dataset objects o by assigning to each of the them the largest outlierness of a pair π such that $o\in {\mathcal {D}}_{\pi }$. We determined our top–10 outliers by selecting the objects associated with the largest outliernesses. Then we selected these objects, containing values deemed to be exceptional by our method, with the purpose of verifying how they are ranked by popular object-centric techniques. Hence, we computed their outlier scores according to the KNN, LOF and ROAD definitions.

All the chosen competitors require an input parameter k, representing the number of k nearest-neighbors or the number of clusters to be taken into account. Since selecting the right value of k is a challenging task, we computed the KNN, LOF and ROAD outlier scores for all the possible values of k and determined the ranking positions associated with our top–10 outliers. Particularly, all the integers from 1 to the number of objects n have been considered for KNN while 30 log-space values between 1 and n have been considered for LOF due to its higher temporal cost. For ROAD algorithm, we stopped at the value of k such that at least one big cluster is obtained and use the frequency score to rank those objects having the same distance from their nearest big cluster.

Figures 7 and 8, report the box-plots for k varying in [1, n] of the KNN, LOF and ROAD Type-2 outliers rankings associated with our top–10 outliers. Plots on the top concern lower outliers, while plots on the bottom concern upper outliers. From these plots it can be seen that the median ranking associated with our outliers can be far away from the top and also that, within the whole ranking distribution, the same outlier can be ranked in very different positions. In general, it seems that lower outliers are likely to be ranked better than upper outliers by our competitors, and this witnesses for the peculiar nature of upper outliers. On the Zoo dataset there is no apparent correlation between our outliers and KNN, LOF and ROAD outliers. On the Mushrooms dataset some of our lower outliers are, on the average, ranked very high also by the other algorithms. Some of them are almost always top outliers for all methods (see the top 1st, 2nd, 5th, and 7th outliers) thus witnessing that these outliers have both global and local nature. However, most of our outliers are not detected by these techniques.

Before concluding this comparison, it must be pointed out that the best rankings associated with the selected objects are obtained for very different values of the parameter k. Since, the output of the KNN, LOF and ROAD methods are determined for a selected value of k, it is very unlike that, even in presence of some agreement between our top outliers and local and global outliers, they are simultaneously ranked in high positions for the same provided value of k.

6.3 Comparison with other techniques

In this section, the proposed technique is compared with other methods.

To this aim, for a given dataset, we selected the objects within its majority class and then generated a family of altered datasets as follows. For each of the above objects and for each attribute, we generated a novel dataset by altering the value the object assumes on that attribute with a different value randomly picked from the attribute domain. Thus, the total number of datasets of the family is given by n ⋅ d, where n is the number of objects within the majority class and d is the number of dataset attributes.

Due to the heavy computations required by this kind of experiment, we selected the following datasets from the UCI Machine learning repository [27]: Zoo (n = 101 objects and m = 18 attributes), Breast cancer (n = 286 objects and m = 9 attributes), House votes (n = 232 objects and m = 16 attributes).

On each family, we then ran the FDEOut algorithm and KNN [13], LOF [18], ROAD [47], CBRW [40], WATCH [36], and KLOF [51].

To make scores comparable, we determined the standardized outlier score z = (sc − μ_sc)/σ_sc of the altered object both before (say z₀ this value) and after (say z₁ this value) the alteration, where μ_sc and σ_sc are, respectively, the mean and standard deviation of the outlier score distribution before the alteration. As for FDEOut we used as outlier score the maximum outlierness associated with an outstanding pair involving the object. Moreover, since no other method is designed to detect upper outliers, in the comparison we considered only our lower outliers.

In order to evaluate the ability of the method to detect the contaminated data, we measured the increase Δz = z₁ − z₀ of standardized outlier score associated with altered objects and compared the Δz of each method with that of FDEOut. We note that objects with large standardized outlier score z₀ in almost one of the two compared methods are less interesting for this analysis, since they already show an exceptional value of outlierness and their Δz is unlike to achieve large values. Clearly this will unfairly favor one of the two methods and, hence, objects whose standardized outlier score exceeds the mean by more than one standard deviation, i.e. such that z₀ > 1, are not considered in the comparison. We further note that this corresponds to focus on the normal objects that become anomalous due to the performed alteration.

Figures 9, 10 and 11 show the comparison between the distribution of the Δz s associated with FDEOut and the same distribution associated with each other method. On the ordinate there are the Δz values sorted in decreasing order, while each value on the abscissa corresponds to a dataset of the family.

The plots highlight that FDEOut is more sensitive to perturbations of the data, since in all cases the most pronounced variations of standardized score associated with FDEOut amount to about 3 standard deviations, while rarely the other methods exceed 1.5. As for the other methods, their quality vary with the data characteristics and, hence, they show a comparable performance on altered data. This seems to suggest that our method is able to detect also subtle anomalies.

6.4 Knowledge mined

In this section we present some knowledge mined by our method. For ease of interpretation, we report the outstanding pairs mined on the Zoo and Breast cancer. It is worth pointing out that our measure is defined at the value level, thus it does not allow us to state that an object is anomalous in an absolute sense. As our score refers to 〈Explanation, Property〉 pairs, we label as anomalous all objects, in the dataset portion isolated by the explanation, whose value p_v for attribute p_a receives a reasonable high score in such a dataset projection. Note that these objects result to be anomalous w.r.t that specific pair and not in a general sense.

Information provided by the top lower pairs we get from the Zoo dataset is summarized below:

The scorpion is the only invertebrate with a tail.
Among vertebrates without fins, the seasnake is the only one that does not breathe.
Among non-acquatic animals, the clam is the only one that breathes.
The platypus lays eggs although it provides milk.
Among predators without feathers, the ladybird is the only airborne.
Among catsized animals, the octopus is the only invertebrate.
The stingray is a catsize animal, but it is venomous.
Among animals which don’t lay eggs, the seasnake and the scorpion are the only ones that do not breastfeed offspring.
The crab is the only invertebrate having four legs.
Among vertebrate breathing animals, the pitviper and the frog are the only venomous ones in the dataset.

Table 4 clarifies how such knowledge is mined. We report the outstanding pairs from which we have deduced the information above, together with the objects selected by each pair.

Table 4 Lower Outlier pairs detected no the Zoo dataset

Full size table

It is interesting to note that the same object can be considered anomalous with respect to different pairs, as in the case of the scorpion and the seasnake. Furthermore, a pair can isolate multiple objects as for the 8^th and 10^th pair.

As for the upper outliers, we find out that the dataset contains the frog twice, the former is venomous and the latter is non-venomous. However, the animal names are like primary keys for the dataset, so having the same name twice can be pointed out as anomalous. Our technique is able to highlight such a situation. Other curiosities about the animal world are spotted by our upper outliers, including the following:

Among breathing not catsized predators, the most frequent are non-flying birds.
Most no-feathers no-toothed animals have six legs.
Among no-flying breathing catsized animals, the most frequent are mammals.
Most gastropods have no legs.
Most no-toothed have two legs.

To infer this type of knowledge we consider the best ranked outstanding pairs identifying upper outliers. Table 5 outlines the 〈Explanation, Property〉 pairs we take into account. Note that in this case more objects have the value reported as anomalous, but what makes them special is the fact that the frequency of such values is atypical within the distribution.

Table 5 Upper Outlier pairs detected no the Zoo dataset

Full size table

As for the Breast cancer Wisconsin, information provided by the top lower outlier pairs is summarized below:

When uniformity of cell size is 1, the value of attribute mitoses is almost equal to 1, except for three samples having mitoses equal to 5, 7, and 8.
When uniformity of cell shape is 1, the tumor is always benign except for two samples.
Among samples having bare nuclei equal to 1 and Bland Chromatin equal to 2, only one is malign.
Among samples having both bare nuclei and marginal adhesion equal to 1, only three are malign.
Among samples having bare nuclei and normal nucleoli equal to 1, only three are malign.

Mining upper outliers, the technique identified that there are two duplicated identifiers. Other upper outliers are discussed below:

Among samples having uniformity of cell shape equal to 10, most have normal nucleoli equal to 10.
Among malignant tumor having marginal adhesion equal to 1, most have bare nuclei equal to 10.
Among samples with uniformity of cell size equal to 10, most have normal nucleoli equal to 10.
Among samples with clump thickness equal to 8, most have bare nucleoli equal to 10.
Among samples with uniformity cell size equal to 10, most have marginal adhesion equal to 10.

The explanation-property pairs we take into account to discuss the knowledge are reported in Table 6 for lower outliers and in Table 7 for the upper ones.

Table 6 Lower Outlier pairs detected on the Breast Cancer Wisconsin dataset (UCZ = Uniformity Cell Size; UCS = Uniformity Cell Shape; BN = Bare Nuclei; BC = Bland Chromatin; MA = Marginal Adhesion; NN = Normal Nucleoli)

Full size table

Table 7 Upper Outlier pairs detected no the Breast Cancer Wisconsin dataset (UCZ = Uniformity Cell Size; UCS = Uniformity Cell Shape; BN = Bare Nuclei; BC = Bland Chromatin; MA = Marginal Adhesion; NN = Normal Nucleoli, CT=Clump Thickness)

Full size table

7 Conclusions

In this work we have provided a contribution to single out and explain anomalous values in categorical domains. We perceive frequencies of attribute values as samples of a distribution whose density has to be estimated. This leads to the notion of frequency occurrence we exploit to build our definition of outlier: an attribute value is suspected to be an outlier if its frequency occurrence is exceptionally typical or un-typical within the distribution of frequencies occurrences of any other attribute value. As a second contribution, our technique is able to provide interpretable explanations for the abnormal values discovered. Thus, the outliers we provide can be seen as a product of the knowledge mined, making the approach knowledge-centric rather than object centric.

The performances have been evaluated on some popular benchmark categorical datasets and a comparative view is proposed.

We notice that our method could be possibly combined with techniques that deal with outlier detection in numerical domains, in the spirit of what was done in [28, 30, 39], and we leave this as a subject of future research.

Change history

16 July 2022
Missing Open Access funding information has been added in the Funding Note

References

Aggarwal CC, Yu P (2001) Outlier detection for high dimensional data. In: SIGMOD
Aggarwal CC (2017) An Introduction to Outlier Analysis, pp 1–34 Springer
Aggarwal CC (2017) Outlier Detection in Categorical, Text, and Mixed Attribute Data, pp 249–272. Springer International Publishing, Cham
Google Scholar
Angiulli F, Fassetti F (2009) Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Disc Data 3(1 Article):4
Google Scholar
Angiulli F, Fassetti F (2014) Exploiting domain knowledge to detect outliers. Data Min Knowl Discov 28(2):519–568
Article MathSciNet MATH Google Scholar
Angiulli F, Fassetti F, Palopoli L (2009) Detecting outlying properties of exceptional objects. ACM Trans. Database Syst 34(1)
Angiulli F, Fassetti F, Palopoli L (2013) Discovering characterizations of the behavior of anomalous subpopulations. IEEE Trans. Knowl. Data Eng. 25(6):1280–1292
Article Google Scholar
Angiulli F, Basta S, Pizzuti C (2006) Distance-based detection and prediction of outliers. IEEE Trans Knowl Data Eng 18(2):145–160
Article MATH Google Scholar
Angiulli F, Fassetti F, Manco G, Palopoli L (2017) Outlying property detection with numerical attributes. Data Min Knowl Discov 31(1):134–163
Article MathSciNet MATH Google Scholar
Angiulli F, Fassetti F, Palopoli L (2009) Detecting outlying properties of exceptional objects. ACM Trans Database Syst (TODS) 34(1):7
Article Google Scholar
Angiulli F, Fassetti F, Palopoli L (2013) Discovering characterizations of the behavior of anomalous subpopulations. IEEE TKDE 25(6):1280–1292
Google Scholar
Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: Principles of data mining and knowledge discovery, 6th european conference, PKDD 2002, helsinki, finland, august 19-23, 2002, proceedings. pp 15–26
Angiulli F, Pizzuti C (2005) Outlier mining in large high-dimensional data sets. IEEE Trans Knowl Data Eng 17(2):203–215
Article MATH Google Scholar
Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, NJ
MATH Google Scholar
Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the 9th ACM SIGKDD international conference on Knowledge discovery and data mining. pp 29–38. ACM
Bhaduri K, Matthews BL, Giannella CR (2011) Algorithms for speeding up distance-based outlier detection. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. pp 859–867. ACM
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM international conference on data mining. pp 243–254. SIAM
Breunig MM, Kriegel H, Ng R, Sander J (2000) Lof: Identifying density-based local outliers. In: SIGMOD, pp 93–104
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: A survey. ACM Comput. Surv 41(3)
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM computing surveys (CSUR) 41(3):15
Article Google Scholar
Chandola V, Boriah S, Kumar V (2009) A framework for exploring categorical data. In: SIAM Int. Conf. on data mining (SDM). pp 187–198
Dang XH, Assent I, Ng RT, Zimek A, Schubert E (2014) Discriminative features for identifying and interpreting outliers. In: 2014 IEEE 30Th international conference on data engineering. pp 88–99. IEEE
Dang XH, Micenková B., Assent I, Ng RT (2013) Local outlier detection with interpretation. In: Joint european conference on machine learning and knowledge discovery in databases. pp 304–320. Springer
Das K, Schneider J (2007) Detecting anomalous records in categorical datasets. In: ACM Int. Conf. on knowl. Discovery and data mining (KDD). pp 220–229
Dave D, Varma THR, Méan AM (2014) A review of various statestical methods for outlier detection
Domingues R, Filippone M, Michiardi P, Zouaoui J (2018) A comparative evaluation of outlier detection algorithms: Experiments and analyses. Pattern Recogn 74:406–421
Article MATH Google Scholar
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Eiras-Franco C, Martinez-Rego D, Guijarro-Berdinas B, Alonso-Betanzos A, Bahamonde A (2019) Large scale anomaly detection in mixed numerical and categorical input spaces. Inf Sci 487:115–127
Article MathSciNet MATH Google Scholar
Ghoting A, Parthasarathy S, Otey M (2006) Fast mining of distance-based outliers in high-dimensional datasets. In: SDM. Bethesda, MD, USA
Ghoting A, Otey ME, Parthasarathy S (2004) Loaded: Link-based outlier and anomaly detection in evolving data sets. In: Fourth IEEE international conference on data mining (ICDM’04). pp 387–390. IEEE
Hancock JT, Khoshgoftaar TM (2020) Survey on categorical data for neural networks. Journal of Big Data 7(1):1–41
Article Google Scholar
He Z, Deng S, Xu X (2005) An optimization model for outlier detection in categorical data. In: International conference on intelligent computing. pp 400–409. Springer
Ienco D, Pensa RG, Meo R (2016) A semisupervised approach to the detection and characterization of outliers in categorical data. IEEE Trans Neural Netw Learn Syst 28(5):1017–1029
Article Google Scholar
Knorr EM, Ng RT (1999) Finding intensional knowledge of distance-based outliers. In: Int. Conf. on very large data bases. pp 211–222. VLDB
Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: Algorithms and applications. The VLDB Journal 8(3-4):309–338
Article Google Scholar
Li J, Zhang J, Pang N, Qin X (2018) Weighted outlier detection of high-dimensional categorical data using feature grouping. IEEE Transactions on Systems, Man and cybernetics: Systems
Li S, Lee R, Lang SD (2007) Mining distance-based outliers from categorical data. In: Seventh IEEE int. Conf. on data mining workshops (ICDMW 2007). pp. 225–230. IEEE
Liu F, Ting K, Zhou ZH (2012) Isolation-based anomaly detection. ACM Trans on Knowledge Discovery from Data (TKDD) 6(1)
Otey ME, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Min Knowl Discov 12(2):203–228
Article MathSciNet Google Scholar
Pang G, Cao L, Chen L (2016) Outlier detection in complex categorical data by modelling the feature value couplings. In: IJCAI. pp 1902–1908
Pang G, Cao L, Chen L, Liu H (2017) Learning homophily couplings from non-iid data for joint feature selection and noise-resilient outlier detection. In: IJCAI. pp 2585–2591
Pang G, Shen C, Cao L, Hengel AVD (2021) Deep learning for anomaly detection: A review. ACM Comput. Surv 54(2) (mar)
Pang G, Ting KM, Albrecht D, Jin H (2016) Zero++: Harnessing the power of zero appearances to detect anomalies in large-scale data sets. J Artif Intell Res 57:593–620
Article Google Scholar
Pang G, Xu H, Cao L, Zhao W (2017) Selective value coupling learning for detecting outliers in high-dimensional categorical data. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. pp 807–816
Papadimitriou S, Kitagawa H, Gibbons P, Faloutsos C (2003) Loci: Fast outlier detection using the local correlation integral. In: ICDE. pp 315–326
Ranga Suri NNR, Murty M, Athithan N (2019) Outlier Detection in Categorical Data, pp 69–93. Springer International Publishing, Cham
Suri NR, Murty MN, Athithan G (2012) An algorithm for mining outliers in categorical data through ranking. In: IEEE Int. Conf. on hybrid intelligent systems (HIS). pp 247–252
Taha A, Hadi AS (2019) Anomaly detection methods for categorical data: a review. ACM Computing Surveys (CSUR) 52(2):38
Article Google Scholar
Wei L, Qian W, Zhou A, Jin W, Jeffrey XY (2003) Hot: Hypergraph-based outlier test for categorical data. In: Pacific-asia conf. on knowledge discovery and data mining. pp 399–410. Springer
Xu H, Wang Y, Cheng L, Wang Y, Ma X (2018) Exploring a high-quality outlying feature value set for noise-resilient outlier detection in categorical data. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. pp 17–26
Yu JX, Qian W, Lu H, Zhou A (2006) Finding centric local outliers in categorical/numerical spaces. Knowl Inf Syst 9(3):309–338
Article Google Scholar

Download references

Funding

Open access funding provided by Università della Calabria within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

DIMES, University of Calabria, 87036, Rende (CS), Italy
Fabrizio Angiulli, Fabio Fassetti, Luigi Palopoli & Cristina Serrao

Authors

Fabrizio Angiulli
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Fassetti
View author publications
You can also search for this author in PubMed Google Scholar
Luigi Palopoli
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Serrao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cristina Serrao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Angiulli, F., Fassetti, F., Palopoli, L. et al. A density estimation approach for detecting and explaining exceptional values in categorical data. Appl Intell 52, 17534–17556 (2022). https://doi.org/10.1007/s10489-022-03271-3

Download citation

Accepted: 18 January 2022
Published: 02 April 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10489-022-03271-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A density estimation approach for detecting and explaining exceptional values in categorical data

Abstract

Similar content being viewed by others

A Density Estimation Approach for Detecting and Explaining Exceptional Values in Categorical Data

Outlying property detection with numerical attributes

The Concept of α-Outliers in Structured Data Situations

1 Introduction

2 Related work

3 Frequency occurrence

Definition 1 (Frequency distribution)

Theorem 1

Proof

Definition 2 (Hard frequency occurrence)

Definition 3 (Soft occurrence function)

3.1 Computational cost

Theorem 2

Proof

Theorem 3

Proof

4 Categorical outlierness

Definition 4 (Cumulated frequency distribution)

5 Outstanding explanation-property pairs

Definition 5

Definition 6 (Unexpected pair)

Definition 7 (Significant pair)

Definition 8 (Strongly significant pair)

Definition 9 (Outstanding)

5.1 The FDEOut algorithm

6 Experimental results

6.1 Scalability

6.2 Comparison with classic distance and density based outlier detection methods

6.3 Comparison with other techniques

6.4 Knowledge mined

7 Conclusions

Change history

16 July 2022

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation