1 Introduction

An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. Their detection can identify system faults and frauds before they escalate with potentially catastrophic consequences; it turns out that, in some applications, the rare events can be more interesting than the more regularly occurring ones [2, 26].

As outliers are interesting because they are suspected of not being generated by the same mechanisms as the rest of the data, it is important to justify why detected outliers are generated by some other mechanisms [9, 22, 23]. However, the border between data normality and abnormality is often not clear cut; consequently, while some outlier detection methods assign to each object in the input data set a label as either “normal” or “outlier”, in this paper we describe a method able to single out anomalous data bunches by working on attributes and associated values.

We deal with categorical data where it is generally more difficult to devise criteria able to discriminate normal and abnormal data [46, 48]. Moreover, the notion of outlierness in the field of numerical data has long been analysed [20] and many approaches have already been designed to define the exceptional nature of a property, but most of them cannot be easily generalized to deal with categorical or mixed categorical/numerical dataset.

Some algorithms exist that build an anomaly detection model specially devised for categorical variables and transform any numerical variable into a categorical space through a previous discretization phase [32, 49]; however, the main drawback of such a strategy is that the result of the analysis strongly depends on the results of the discretization process.

As an alternative, some mixed criteria techniques have been developed which manage numerical and categorical data separately and then merge the two by providing a method which encompasses the analysis of an element in both spaces [28, 30, 39].

In this paper, we specifically focus on problems which are typical of categorical data. We do that by taking the perspective of perceiving an attribute value as anomalous if its frequency occurrence is exceptionally typical or un-typical within the distribution of frequencies occurrences of any other attribute value.

However, within the categorical scenario the process of comparing frequencies poses several challenges. Indeed, if we take the point of view that the data at hand is the result of a sampling procedure in which data values are associated with some pre-defined occurrence probabilities, then the fact that a certain categorical value is observed exactly f times is a matter of chance rather than being a hard property of that value.

This has led us to the definition of the concept of soft frequency occurrence which, intuitively, consists in the estimate of the density associated with frequency occurrences. We obtain this measure by specializing the classical Kernel Density Estimation technique to the domain of frequency values.

As a second contribution, we leverage the cumulated frequency distribution of the above density estimate to decide if the frequency of a certain value is rare when compared to the frequencies associated with the other values. In particular, we are able to identify two kinds of anomalies, namely lower outliers and upper outliers. A lower outlier is a value whose frequency is small while, typically, the dataset objects assume a few similar values, namely the frequencies of the other values are large. An upper outlier is a value whose frequency is large while, typically, the dataset objects assume almost distinct values, namely the frequencies of the other values are small.

Note that both these definitions look for unexpected behaviors by establishing a comparison between the frequency of the outlier value and that of the other values in the attribute domain. While the notion of lower outlier shares similarities with the classical concept of anomaly, the notion of upper outlier is conceptually different and should not be confused with the concept of mode. Indeed, for a value, a high frequency is not enough to be an upper outlier, as the rest of the values must appear with low frequencies.

Consider the example reported in Table 1: as for the attribute A1, value c is a lower outlier, since it occurs only once and the other values occur many times; as for the attribute A2, value e is an upper outliers, since it occurs many times and the other values occur once or at most twice; as for the attribute A3, although the value e has the same frequency both in A2 and A3 and it is the most frequent in A3, it is not an (upper) outlier since the other values in A3 have comparable frequencies.

Table 1 An example highlighting the notion of lower (A1) and upper (A2) outliers

Thus, we are able to single out, by one unified outlierness measure, both exceptionally infrequent and exceptionally frequent values; this peculiarity clearly differentiates our proposal from almost all the existing measures of outlierness.

Although values can show exceptional behavior with respect to the whole population, it must be pointed out that very often a value emerges as exceptional only when we restrict our attention to a subset of the whole population [7]. In particular, our method, differently from many others, is capable of returning to the user not only data bunches which are anomalous with respect to the entire data population, but also those records which are normal in general but anomalous only when contrasted to a sub-population. Therefore, our technique has been designed to output the so-called explanation-property pairs(E, p), where E, called explanation, denotes a condition used to determine the target subpopulation and p, called property, represents an attribute pa and a value pv such that pv is exceptionally frequent or infrequent within the subpopulation selected by the explanation E.

The output of the algorithm corresponds to the so-called explanation-property pairs, which allows us to provide an interpretable explanation for the abnormal values discovered.

However, it must be noticed that, when dealing with explanations, there exists a risk that the user is overwhelmed by a huge amount of redundant explaining patterns. Thus, as a further contribution, we define a subtle mechanism that allows us to single out the explanations encoding the outstanding exceptionalities in the data, carrying out no redundant information. Loosely speaking, for a condition to encode a significant explanation we require the frequency distribution of the associated sub-population to be unexpected given the knowledge of the frequency distribution of any other of its super-population, where unexpectedness is measured by means of the chi-squared goodness-of-fit test. Moreover, for an attribute and a value to encode a significant property, we require that the outlierness measured within the sub-population associated with a given significant explanation improves the one measured within any of its super-populations associated with significant explanations. Outlierness improvement must be greater than a factor which is inversely related to the unexpectedness of the sub-population frequency distribution. Maximal significant explanation-property pairs are said to be outstanding. The output of the algorithm precisely corresponds of the so-called outstanding explanation-property pairs.

Finally, our technique is knowledge-centric as the search space we visit is formed by explanation-property pairs and the outliers we provide can be seen as a product of the knowledge mined. This is clearly different from traditional outlier detection approaches which are object-centric.

The rest of the work is organised as follows. Section 2 discusses work related with the present one. Section 3 introduces the frequency occurrence function. Section 4 describes the outlierness function for ranking categorical values. Section 5 introduces the concept of outstanding explanation-property pair and describes the goal of our mining method. Section 6 describes experimental results.

2 Related work

Categorical data has received relatively little attention as compared to quantitative data because detecting anomalies in categorical data is a challenging problem [48]. Generally, traditional approaches do not handle categorical data in a satisfactory manner, due to the fact that, in most cases, there is no concept of sorting for the set of values a categorical variable can assume; so the development of specific techniques is needed. We start by noting that there is little literature about detecting anomalous properties, and/or related outlier objects, equipped with explanations which face the task of the identification of both features and subpopulations which characterize anomalies. Moreover, to the best of our knowledge, no technique is able to natively detect upper outliers.

There exist several approaches to detect outliers in the certain setting, namely statistical - based [14, 25], distance - based [4, 12, 13, 29, 34], density - based [18, 45], isolation - based [38], subspace - based [1, 6, 7], knowledge - based [5], neural network - based [31, 42], and many others [3, 19].

Some anomaly detection techniques depend on the identification of a representative pattern suggested by the majority observations so that objects that result to be far from it, according to a suitable distance measure, can be perceived as anomalies. However, designing such a measure in presence of categorical data is challenging [17].

Different strategies have been proposed to face with the above problem. In [21] some methods are presented to map categorical data on numerical data together with a framework for their analysis. However, the effectiveness of these techniques is strongly related to the choice of the mapping function. A different prospective is that of exploiting for categorical data some traditional approaches designed for the quantitative domain by choosing an appropriate distance measure. This is done in [8, 13], where anomalies are defined as the N observations whose average distance to the k nearest neighbors are the greatest; instead, [35] considers as anomalous those observations with fewer than p observations within a certain distance d.

Many other methods exploit the Hamming distance to identify anomalies among categorical data [15, 16, 37] together with a pruning strategy to cope with the quadratic complexity of evaluating distances.

Recently a new idea of distance suitable for the categorical domain has been introduced to detect and characterize outliers in a semi-supervised fashion [33]. The key intuition is that the distance between two values of a categorical attribute can be determined by the way in which they co-occur with the values of other attributes in the data set: if two values are similarly distributed with respect to a certain set of attributes, their distance must be low. A model defining the distances between categorical values is defined on the training set and is used to evaluate the outlier score associated with each test instance t as the sum the distances between t and a subset of objects known to be normal.

The family of density-based aqpproaches includes those methods that identify observations having outlying behavior in local areas, thus result to be inconsistent within their neighborhood and not necessary with the pattern suggested by the majority of all other observations.

Local anomaly detection methods for categorical data include the k-Local Anomalies Factor k-LOF [51] and, ROAD [47] and WATCH [36] methods.

The k-LOF is a local anomaly detection method for both categorical and quantitative data [51]. It extends Local Anomalies Factor (LOF) [18] to categorical domain. The k-LOF identifies an observation as a local outlier if its relationships with its neighbors are weaker than the relationships between its neighbors and its neighbors’ neighbors. It does that by building a similarity graph and by using the concept of k-walk, i.e the paths of length k on the similarity graph joining two observations, to provide an outlierness score.

The ROAD algorithm [47] exploits both distances and densities. The Hamming distance is used to group objects into clusters and highlight observations located in sparse regions; then a density measure is calculated for each object on the basis of the frequencies of its values, in order to identify objects whose values are almost infrequent in the dataset.

Both measures present some limitations: as for Hamming distance, outliers with few exceptional attributes are not captured, as for density, they are not compared with expected values and attributes associated with distinct values, as primary keys, can affect results.

The WATCH method [36] has been recently designed to find out outliers in high dimensional categorical datasets using feature grouping. First, it groups correlated features, then it looks for outliers in each feature group by calculating a weighting factor for each categorical variable that takes into account the correlation between this variable and the others in the same group.

A completely different prospective is taken by methods that exploit information - theoretic measures. The idea behind these approaches relies on the direct relationship between the existence of anomalies and the amount of noise in the dataset. This led some authors [32] to formulate the outlier detection task in terms of an optimization problem, i. e. finding a subset of k objects such that the expected entropy of the resultant dataset after the removal of this subset is minimized. This strategy has to be intended from a global point of view as the outlierness measure involves simultaneously all the attributes and is neither able to detect outliers in sub-populations nor to identify outliers characterised by one (or few) outlying attributes.

To overcome with the last issue, in [24] an outlier factor is designed on the basis of the ratio between the probability of co-occurrence of two sets of attributes and the product between the probabilities of occurrence of the two sets taken separately. Here, the authors are interested in properties consisting in at least two attributes and do not address sub-populations.

The importance of learning value interactions has shown to be effective when handling categorical data and some recent contributions are found in the literature that exploit such a strategy to detect anomalies.

CBRW [40] estimates the outlierness of each feature value which can either detect outliers directly or determine feature selection for subsequent outlier detection. The value is computed by comparing the frequency of each value with the most frequent value (the mode). However, this is just a measure of deviation, explanations are not provided and, by definition, upper outliers cannot be detected.

As a further drawback, noisy values may significantly influence the performance of CBRW, thus the same authors propose HOUR [41], a new outlier detection framework for data with noisy features. A noise-resilient outlier scoring function is defined to rank objects based on their outlierness in a given feature subset and an outlier ranking evaluation function is proposed to evaluate the quality of the ranking w.r.t the feature subset. Feature selection and ranking evaluation are iteratively performed until the best feature subset is obtained.

Nevertheless, this approach shows to be not particularly suitable for high dimensional data, thus two further main improvement have been proposed, namely POP [44] and OUVAS [50].

The idea of investigating feature subsets has been taken into account also in [43] but in a completely different way. Here, the main intuition is that, provided with a random subsample of the main dataset, those instances with rare combinations of values on any attribute subset have also a higher probability of having zero appearances in subsamples of any size.

Subspaces managment seems to meet our concept of explanation, but the semantic is completely different. The problem of outlier explanation [10] we deal with in this paper consists in finding features that can justify the outlierness of an object, in a sense that its anomalous state emerges only as a consequence of the selection we have made.

Some previous works [9, 10] follow this path and propose a technique for categorical and numerical domains respectively that, given in input one single object known to be outlier, provides features justifying its anomaly and subpopulations where its exceptionality is evident. A generalization is proposed in [11] where a set, required to be small, of outliers is provided in input.

3 Frequency occurrence

In this section we give same preliminary definitions and introduce the notation employed throughout the paper.

A dataset \({\mathcal {D}}\) on a set of categorical attributes \({\mathcal {A}}\) is a set of objects o assuming values on the attributes in \({\mathcal {A}}\). By o[a] we denote the value of o on the attribute \(a\in {\mathcal {A}}\). \({\mathcal {D}}[a]\) denotes the multiset \(\{ o[a] \mid o\in {\mathcal {D}} \}\).

A condition \({\mathcal {C}}\) is a set of pairs (a, v) where each a is an attribute and each \(v \in {\mathcal {D}}[a]\). A singleton condition is said to be atomic. By \({\mathcal {D}}_{{\mathcal {C}}}\) we denote the new dataset \(\{o \in {\mathcal {D}} \mid o[a] = v, \forall (a, v) \in {\mathcal {C}})\}\).

A condition \({\mathcal {C}}\) is said to be valid if \({\mathcal {D}}_{{\mathcal {C}}} \neq \emptyset \). It follows from this definition that in a valid condition \({\mathcal {C}}\), for all \((a,v), (a,u) \in \mathcal {C}\), it holds that u = v. Thus, \(\mathcal {C}\) is valid, the number \(|\mathcal {C}|\) of atomic conditions in \(\mathcal {C}\) is equal to the number of attributes involved in \(\mathcal {C}\). In the following, if not otherwise stated, we will take into account only valid conditions.

Definition 1 (Frequency distribution)

A frequency distribution \({{\mathscr{H}}}\) is a multiset of the form \({{\mathscr{H}}} = \{ f_{1}^{(1)}, \ldots ,f_{1}^{(w_{1})},\) \( \ldots , f_{n}^{(1)},\ldots ,f_{n}^{(w_{n})} \}\) where each \(f_{i}^{(j)}\in \mathbb {N}\) is a distinct frequency, \(f_{i}^{(j)}=f_{i}^{(k)} = f_{i}\) for each 1 ≤ j, kwi, and wi denotes the number of occurrences of the frequency fi. By \(N({{{\mathscr{H}}}})\) (or simply N whenever \({{\mathscr{H}}}\) is clear from the context) we denote w1f1 + … + wnfn.

For the sake of simplicity, we will refer to a frequency distribution as a set \({{\mathscr{H}}} = \{ f_{1}, f_{2}, \ldots , f_{n} \}\) and to the number of occurrences wi of fi as w(fi). To ease the writing of expressions, we also assume that the dummy frequency f0 = 0 with w0 = 0 is always implicitly part of any frequency distribution.

Given a multiset V, the frequency \({f^{V}_{v}}\) of the value vV is the number of occurrences of v in V.

The frequency distribution of the dataset \({{\mathcal {D}}}\) on the attribute a is the multiset \({{\mathscr{H}}}^{{\mathcal {D}}}_{a} = \{ f^{{\mathcal {D}}[a]}_{v} \mid v \in {\mathcal {D}}[a] \}\). Note that \(N({{\mathscr{H}}}^{{\mathcal {D}}}_{a}) = |{\mathcal {D}}|\).

Theorem 1

Let \({{\mathscr{H}}} = \{f_{1},\ldots ,f_{n}\}\) be a frequency distribution. Then, \(n \le \sqrt {N({{\mathscr{H}}})}\).

Proof

Since \(N({{\mathscr{H}}}) = w_{1}\cdot f_{1} + w_{2}\cdot f_{2} + {\dots } + w_{n} \cdot f_{n}\), n is maximized when (i) f1 = 1, (ii) ∀i, wi = 1, and (iii) ∀i > 1, fi+ 1 = fi + 1. Thus, the maximum n is such that \(1+2+\dots +n = N({{\mathscr{H}}})\) and, since \(1+2+\dots +n = \frac {n(n+1)}{2}\), it follows that \(n\cdot (n+1) = 2\cdot N({{\mathscr{H}}})\) and, then, that \(n = O\big (\sqrt {N({{\mathscr{H}}})}\big )\). □

From the above theorem, it immediately follows that the number of distinct frequencies in \({{\mathscr{H}}}^{D}\) is at most \(\sqrt {|{\mathcal {D}}|}\).

Now we define the notion of frequency occurrence as a tool for quantifying how frequent is a certain frequency.

Definition 2 (Hard frequency occurrence)

Given a frequency distribution \({{\mathscr{H}}}\), the frequency occurrence \({\mathcal {F}}_{{{\mathscr{H}}}}(f_{i})\) of fi, also denoted by \({\mathcal {F}}(f_{i})\) whenever \({{\mathscr{H}}}\) is clear from the context, is the product wifi.

The above definition allows us to associate with each distinct value in \({\mathcal {D}}[a]\) a score that is related not only to its frequency in the dataset but also to how many other values have its same frequency.

A major drawback of the previous definition is that close frequency values do not interact with each other and, as a consequence, small variations of the frequency distribution may cause sensible variations in the frequency occurrence values. E.g., consider the case in which the frequencies fi = 49, wi = 1 and fi+ 1 = 51, wi+ 1 = 1 are replaced with \(f^{\prime }_{i}=50, w^{\prime }_{i} = 2\). While in the former case \({\mathcal {F}}(f_{i})=49\) and \({\mathcal {F}}(f_{i+1})=51\), in the latter case we have that \({\mathcal {F}}(f^{\prime }_{i})=100\) that is about twice the frequency occurrence associated with fi and fi+ 1. Intuitively, we do not desire a similar small variation in the frequency distribution to impact so largely on the outcome of the measure. Indeed, if we take the point of view that the data at hand is the result of a sampling procedure in which data values are associated with some pre-defined occurrence probabilities, then the fact that a certain categorical value is observed exactly f times is a matter of chance, rather than being an hard property of that value.

Thus, we refine the previous definition of frequency occurrence in order to cope with the scenario depicted above. Specifically, to overcome the mentioned drawback, we need to force close frequency values to influence each other in order to jointly contribute to the frequency occurrence value. With this aim, we inspired to Kernel Density Estimation (KDE) methods to design an ad-hoc density estimation procedure.

First of all, we point out that we are working in a discrete domain composed of frequency values, a peculiarity that differentiates it from the standard framework of KDE. We start by illustrating the proposed density estimation procedure.

A (discrete) kernel function \(K_{f_{i}}\) with parameter fi is a probability mass function having the property that \(\sup _{f\ge 0}\) \(K_{f_{i}}(f)=K_{f_{i}}(f_{i})\).

Given an interval I = [fl, fu] of frequencies, a frequency fi, and a kernel function K, the volume of \(K_{f_{i}}\) in I, denoted as \(V_{I}(K_{f_{i}})\), is given by \({\sum }_{f=f_{l}}^{f_{u}} K_{f_{i}}(f)\). The following expression

$$ {\mathcal{F}}(f) = \underset{\varphi \in I(f)}{\sum} \left\{ \sum\limits_{i=1}^{n} w_{i}\cdot f_{i}\cdot K_{f_{i}}(\varphi) \right\}. $$

where I(f) represents an interval of frequencies centred in f, provides the density estimate of the frequency occurrence of the frequency f.

Since \(K_{f_{i}}(\cdot )\) is a probability mass function, the frequency fi provides a contribution to the frequency occurrence of f corresponding to the portion of the volume of \(K_{f_{i}}\) which is contained in I(f), that is \(V_{I(f)}(K_{f_{i}})\). Hence, if the interval I(f) contains the entire domain of \(K_{f_{i}}\) then fi provides its maximal contribution wifi. Frequencies fi whose domain do not intersect I(f) do not contribute to the frequency occurrence of f at all.

The above definition needs to properly calibrate the width I(f) of the interval to be centred in f. To eliminate the dependence of the formulation from an arbitrary interval, we resort to the following alternative formulation in which frequencies φ are not constrained to belong to the interval I(f). However, since the generic kernel \(K_{f_{i}}(\cdot )\) could be arbitrarily far from the frequency of interest f, now its contribution has to be properly weighted

$$ {\mathcal{F}}(f) = \underset{\varphi \ge 0}{\sum} \left\{ \sum\limits_{i=1}^{n} \left[ w_{i}\cdot f_{i}\cdot K_{f_{i}}(\varphi)\cdot \frac{Pr[X_{f_{i}} = f]}{Pr[X_{f_{i}} = f_{i}]} \right] \right\}. $$

Let \(X_{f_{i}}\) denote the random variable distributed according to \(K_{f_{i}}\) and, hence, having fi as the value that is most likely to be observed. The ratio \(\frac {Pr[X_{f_{i}} = f]}{Pr[X_{f_{i}} = f_{i}]} \le 1\) represents a weight factor for the kernel \(K_{f_{i}}(\cdot )\) which is maximum, in that evaluates to 1, for f = fi. Hence, the closer the kernel \(K_{f_{i}}(\cdot )\) to the frequency of interest f, the larger its contribution to the frequency occurrence of f. Since the above probabilities can be directly obtained from the associated kernel, it can be rewritten as follows

$$ {\mathcal{F}}(f) = \underset{\varphi \ge 0}{\sum} \left\{ \sum\limits_{i=1}^{n} \left[ w_{i}\cdot f_{i}\cdot K_{f_{i}}(\varphi)\cdot \frac{K_{f_{i}}(f)}{K_{f_{i}}(f_{i})} \right] \right\}. $$
(1)

Equation (1) can be rewritten as

$${\mathcal{F}}(f) = \sum\limits_{i=1}^{n} \left[w_{i}\cdot f_{i} \cdot \frac{K_{f_{i}}(f)}{K_{f_{i}}(f_{i})} \cdot \underset{\varphi \ge 0}{\sum} K_{f_{i}}(\varphi) \right].$$

Since \(K_{f_{i}}(\cdot )\) is a probability mass function, the summation over the domain of all its values is equal to 1 and, thus, the above expression can be finally simplified in

$$ {\mathcal{F}}(f) = \sum\limits_{i=1}^{n} \left[ w_{i}\cdot f_{i}\cdot \frac{K_{f_{i}}(f)}{K_{f_{i}}(f_{i})} \right]. $$
(2)

Since \({\mathcal {F}}\) represents a notion of density function associated with frequency occurrences, it is preferable that its volume evaluated in the frequencies \({{\mathscr{H}}} = \{f_{1},\ldots ,f_{n}\}\) evaluates to \(N({{\mathscr{H}}})\). This leads to the following final form of the frequency occurrence function.

Definition 3 (Soft occurrence function)

Given a frequency distribution \({{\mathscr{H}}}\), the frequency occurrence \({\mathcal {F}}_{{{\mathscr{H}}}}(f_{i})\) of fi, also denoted by \({\mathcal {F}}(f_{i})\) whenever \({{\mathscr{H}}}\) is clear from the context, is given by the following expression

$$ {\mathcal{F}}(f) = \frac{N({\mathcal{H}})}{N_{{\mathcal{F}}}({\mathcal{H}})}\cdot \sum\limits_{i=1}^{n} \left[ w_{i}\cdot f_{i}\cdot \widehat{K}_{f_{i}}(f) \right], $$
(3)

where

$$\widehat{K}_{f_{i}}(f) = \frac{K_{f_{i}}(f)}{K_{f_{i}}(f_{i})} ~~\text{ and }~~ N_{{\mathcal{F}}}({\mathcal{H}}) = \sum\limits_{j=1}^{n} \left\{ \sum\limits_{i=1}^{n} \left[ w_{i}\cdot f_{i}\cdot \widehat{K}_{f_{i}}(f_{j}) \right] \right\}. $$

Figure 1 reports the frequency occurrence values according to the hard and soft definition. Note that frequencies 48 and 50 are closer to each other and the soft frequency occurrence definition for them lead to a value that is more like the one we would get if we observed a value between 48 and 50 twice in the distribution. The values of the soft occurrences are not normalized to improve intelligibility of the example.

Fig. 1
figure 1

Comparison between hard and soft occurrence

As for the kernel selection, interestingly we can take advantage of the peculiarity of the frequency domain to base our estimation on a very natural kernel definition. Indeed, as kernel \(K_{f_{i}}(\cdot )\) we will exploit the binomial distribution binopdf(f;n, p) with parameter n, denoting the number of independent trials, equal to \(N({{\mathscr{H}}})\), and parameter p, denoting the success probability, equal to \(p={f_{i}}/{N({{\mathscr{H}}})}\). We argue that this kind of kernel is particularly natural for our setting and, moreover, note that its use relieve us from the problem of selecting a suitable kernel bandwidth, a problem that affects almost all the kernel density estimation procedures. Indeed, the fact that we observe a certain number fi of occurrences for a given value v can be assimilated to the outcome of a sequence of Bernoulli trials each having success probability pv, which is modeled by a binomial random variable. Note that, for large sample sizes \(N({{\mathscr{H}}})\), the frequency fi tends to the expected value \(N({{\mathscr{H}}})\cdot p_{v}\) of the above random variable. Hence, \({f_{i}}/{N({{\mathscr{H}}})}\) closely approximates pv and the binomial function with parameters \(n=N({{\mathscr{H}}})\) and \(p={f_{i}}/{N({{\mathscr{H}}})}\) represents the distribution of that variable. This distribution clearly provides the probability to observe any other frequency \(f_{i}^{\prime }\neq f_{i}\) for the same value v.

3.1 Computational cost

Computational complexity of the proposed measure can be estimated by taking into account the cost of computing the set of frequency occurrences for each attribute in \({\mathcal {A}}\) and the cost of evaluating \(\widehat {K}_{f_{i}}(\cdot )\) for each frequency in \({{\mathscr{H}}}^{D}\). Theorems discussed in the following provide some results about time complexity.

Theorem 2

Let \({\mathcal {D}}\) be a dataset and let \({{\mathscr{H}}}^{D}=\{f_{1},\ldots ,f_{n}\}\). Then, the cost of computing the set of frequency occurrences \(\{ {\mathcal {F}}_{{{\mathscr{H}}}^{{\mathcal {D}}}}(f_{1}), \ldots , {\mathcal {F}}_{{{\mathscr{H}}}^{{\mathcal {D}}}}(f_{n}) \}\) is \(O(|{\mathcal {D}}|\cdot C_{K})\), where CK represents the cost of evaluating \(\widehat {K}_{f_{i}}(\cdot )\).

Proof

Consider (3). The cost of evaluating this Equation for a given f involves the computation of a summation of n terms, with n equals to the number of different frequencies. Due Theorem 1, \(n = O\left (\sqrt {|{\mathcal {D}}|}\right )\). Since we have to evaluate (3) for any distinct f in the dataset, the Equation has to be computed n times. Then, since each evaluation costs Ck, the overall cost is \(O\left (\sqrt {|{\mathcal {D}}|}\cdot \sqrt {|{\mathcal {D}}|} \cdot C_{k}\right ) = O(|{\mathcal {D}}|\cdot C_{k})\). □

Now we take into account the cost of computing the term \(\widehat {K}_{f_{i}}(f)\) when \(K_{f_{i}}(\cdot )\) is the binomial kernel.

Theorem 3

Given a dataset \({\mathcal {D}}\) and two frequencies fi and fj in \({{\mathscr{H}}}^{{\mathcal {D}}}\), the cost of computing \(\widehat {K}_{f_{i}}(f_{j})\) is O(1) with a pre-processing \(O(|{\mathcal {D}}|)\).

Proof

First of all, we prove that \(\widehat {K}_{f_{i}}(f_{j})\) can be obtained by evaluating the following expression:

$$ \begin{array}{@{}rcl@{}} &&{\kern-8.5pt}\exp\left[ \left( \sum\limits_{k=1}^{f_{i}}{\ln{k}} \right) + \left( \sum\limits_{k=1}^{N-f_{i}}{\ln{k}} \right) - \left( \sum\limits_{k=1}^{f_{j}}{\ln{k}} \right) \right.\\&&\qquad\left.- \left( \sum\limits_{k=1}^{N-f}{\ln{k}} \right) + (f_{i}-f)\ln\left( \frac{N}{f_{i}}-1\right) \right]. \end{array} $$

To get this equality, consider the logarithm of \(\widehat {K}_{f_{i}}(f_{j})\). Since \(K_{f_{i}}(f_{j})\) is a binomial probability function, by exploiting the properties of logarithms we obtain:

$$ \begin{array}{@{}rcl@{}} \begin{array}{l} \ln{\widehat{K}_{f_{i}}(f_{j})} = \ln{\frac{K_{f_{i}}(f_{j})}{K_{f_{i}}(f_{i})}} =\\ =\ln{\frac{\binom{N}{f_{j}}\left( \frac{f_{i}}{N}\right)^{f_{j}}\left( 1 - \frac{f_{i}}{N} \right)^{N-f_{j}}}{\binom{N}{f_{i}}\left( \frac{f_{i}}{N}\right)^{f_{i}}\left( 1 - \frac{f_{i}}{N} \right)^{N-f_{i}}}} = \ln{\frac{\binom{N}{f_{j}}}{\binom{N}{f_{i}}}} + (f_{j}-f_{i})\ln\left( \frac{f_{i}}{N}\right) \\\quad+ (f_{i}-f_{j})\ln\left( 1-\frac{f_{i}}{N}\right) = \\ =\ln{\frac{\binom{N}{f_{j}}}{\binom{N}{f_{i}}}} + (f_{i}-f_{j})\ln\left( \frac{1-f_{i}/N}{f_{i}/N}\right) \\= \ln{\frac{\binom{N}{f_{j}}}{\binom{N}{f_{i}}}} + (f_{i}-f_{j})\ln\left( \frac{N}{f_{i}}-1\right). \end{array} \end{array} $$

Since

$$ \begin{array}{@{}rcl@{}} \ln{\frac{\binom{N}{f_{j}}}{\binom{N}{f_{i}}}}&=& \ln\left( \frac{N!}{f_{j}!(N-f_{j})!}\cdot\frac{f_{i}!(N-f_{i})!}{N!}\right)\\&&= \sum\limits_{k=1}^{f_{i}}{\ln{k}} + \sum\limits_{k=1}^{N-f_{i}}{\ln{k}} - \sum\limits_{k=1}^{f_{j}}{\ln{k}} - \sum\limits_{k=1}^{N-f_{j}}{\ln{k}}, \end{array} $$

the statement follows. We note that all the above terms can be pre-computed. Indeed, during the pre-processing step we can build an array of N elements such that the generic ith entry stores \({\sum }_{k=1}^{i} \ln k\). □

4 Categorical outlierness

In this section we introduce the concept of outlierness and discuss about the measure we have designed to discover outlier properties in categorical datasets.

Definition 4 (Cumulated frequency distribution)

Given a frequency distribution \({{\mathscr{H}}}=\{f_{1},\ldots ,f_{n}\}\), the associated cumulated frequency distribution H is

$$ {H}(f) = \underset{f_{j}\le f}{\sum} {\mathcal{F}}_{{\mathcal{H}}}(f_{j}). $$

In the following, we refer to the value H(fi) also as to Hi.

The idea behind the measure we will discuss in the following is that an object in a categorical dataset can be considered an outlier with respect to an attribute if the frequency of the value assumed by this object on such an attribute is rare if compared to the frequencies associated with the other values assumed on the same attribute by the other objects of the dataset.

We are interested in two relevant kinds of anomalies referring to two different scenarios.

Lower Outlier.:

An object o is anomalous since for a given attribute a the value that o assumes in a is rare (its frequency is low) while, typically, the dataset objects assume a few similar values, namely the frequencies of the other values are large.

Upper Outlier.:

An object o is anomalous since for a given attribute a the value that o assumes in a is usual (its frequency is high) while, typically, the dataset objects assume almost distinct values, namely the frequencies of the other values are small.

In order to discover outliers, we exploit the cumulated frequency distribution associated with the dataset. With this aim, we use the area above and below the curve of the cumulated frequency distribution to quantify the degree of anomaly associated with a certain frequency.

Intuitively, the larger the area above the portion of the curve included from a certain frequency fi to the maximum frequency \(f_{\max \limits }\), and the more fi differs from frequencies that are greater than fi. At the same time, the larger the area below the portion of the curve included from the minimum frequency \(f_{\min \limits }\) and a certain frequency fi, and the more fi differs from frequencies that are smaller than fi.

You can evaluate the contribution given by the area above the cumulated frequency distribution curve to the outlierness of a certain frequency fi, using the following expression

$$ A^{\uparrow}(f_{i}) = \underset{j> i}{\sum}(f_{j} - f_{j-1})\cdot \left( {H}_{n} - {H}_{j-1} \right). $$
(4)

The lower outlier score out(fi) is given by the the normalised area

$$ out^{\downarrow}(f_{i}) = A^{\uparrow}(f_{i}) / A_{\max}^{\uparrow}(f_{i}), $$
(5)

obtained by dividing the area A(fi) by

$$ A_{\max}^{\uparrow}(f_{i}) = (A^{\uparrow}(f_{0}) - A^{\uparrow}(f_{i})) + (f_{n} - f_{i})\cdot ({H}_{n}-{H}_{i-1}) $$
(6)

corresponding to the area above the cumulated frequency histogram up to the frequency fi, represented by the term (A(f0) − A(fi)), plus an upper bound to the area above the cumulated frequency histogram starting from fi, represented by the term (fnfi) ⋅ (HnHi− 1). Notice that the former term is minimised for \(f_{i} \rightarrow 1\), while the latter term tends to A(fi) for \(f_{n} \rightarrow \infty \) and, hence, in this case out(fi) tends to its maximum value 1.

The second scenario we are interested in aims to highlight the upper outliers, namely those objects that, for a given attribute, assume a value whose frequency is high, while typically, the dataset objects assume distinct values, that is the frequencies of the other values are low.

In order to discover such a kind of anomaly we take into account the area below the cumulated frequency distribution, starting from the lowest frequency up to the target frequency fi. The bigger this area, the more this frequency can be highlighted as anomalous. The contribution of the frequency fi is computed as

$$ A^{\downarrow}(f_{i}) = \underset{j\le i}{\sum}{(f_{j}- f_{j-1}) \cdot {H}_{j-1}}. $$
(7)

The upper outlier score out(fi) is given by the the normalised area

$$ out^{\uparrow}(f_{i}) = A^{\downarrow}(f_{i}) / A_{\max}^{\downarrow}(f_{i}), $$
(8)

obtained by dividing the area A(fi) by the term

$$ A_{\max}^{\downarrow}(f_{i}) = (f_{i}-1) \cdot {H}_{i} $$
(9)

representing an upper bound to the area below the cumulated frequency histogram up to the frequency fi. Notice that A(fi) tends to \(A_{\max \limits }^{\downarrow }(f_{i})\) for \(f_{i-1} \rightarrow 1\) and \({H}_{i} \rightarrow {H}_{i-1}\), or equivalently \({\mathcal {F}}(f_{i}) \ll {\mathcal {F}}(f_{i-1})\), so in this case out(fi) tends to its maximum value 1.

The outlierness, or abnormality score, associated with the frequency fi is a combined measure of the above two normalised areas:

$$ {\textit{out}}(f_{i}) = \frac{W^{\uparrow}_{i} \cdot out^{\uparrow}(f_{i}) + W^{\downarrow}_{i} \cdot out^{\downarrow}(f_{i})}{ W^{\uparrow}_{i} \cdot {\varDelta}(out^{\uparrow}(f_{i})) + W^{\downarrow}_{i} \cdot {\varDelta}(out^{\downarrow}(f_{i}))} $$
(10)

Specifically, the (global) outlierness score of fi is the weighted mean of the upper and lower outliernesses associated with fi, with weights \(W^{\uparrow }_{i} = H_{i}\) and \(W^{\downarrow }_{i} = (H_{n}-\) Hi− 1), respectively. Note that Hi represents the fraction of the frequencies having value less or equal than fi, while (HnHi− 1) represents the fraction of the frequencies having value greater or equal than fi. Thus, when both the contributions out(fi) and out(fi) are greater than 0, the two weights provide their relative importance in terms of the fraction of the data population used to compute each of them. As for the function Δ(x), it evaluates to 0 if x = 0, and to 1 otherwise. Thus, it serves the purpose of ignoring the weight associated with the lower or upper outlierness if it evaluates to 0 and, otherwise, of taking it into account in its entirety.

In order to clarify areas employed for outlierness computation, let us refer to the following example. Consider a single attribute dataset whose associated set of distinct frequencies is {f1 = 1, f2 = 2, f3 = 3, f4 = 4, f5 = 5, f6 = 6} and the set of weights is {w1 = 3, w2 = 2, w3 = 1, w4 = 2, w5 = 1,, w6 = 2}. Assume that we want to compute the outlierness associated with the frequency f3 = 3. Figure 2a and b represent the areas exploited to compute such outlierness.

Fig. 2
figure 2

Outlierness computation example

On the left the area \(A_{i}^{\downarrow }\) together with the area used for normalisation, \(A_{\downarrow }^{\max \limits }\), is reported, while, on the right, the area \(A_{i}^{\uparrow }\) together with the area used for normalisation, \(A_{\uparrow }^{\max \limits }\), is reported.

If \(W^{\downarrow }_{i} > W^{\uparrow }_{i}\) we say that the global score is an upper score. Conversely if \(W^{\uparrow }_{i} \le W^{\downarrow }_{i}\) we say that the global score is a lower score.

We will use the notation \(out_{{{\mathscr{H}}}}(\cdot )\) whenever it is needed to highlight the frequency distribution \({{\mathscr{H}}}\) used to compute the outlierness. The outlierness outa(v, D) of the value vD[a] with respect to the attribute a in the dataset \({\mathcal {D}}\), is given by \(out_{{{\mathscr{H}}}^{{\mathcal {D}}[a]}}(f^{{{\mathscr{H}}}^{{\mathcal {D}}[a]}}_{v})\).

5 Outstanding explanation-property pairs

Exceptional values v for an attribute a, are those associated with large values of outlierness outa(v, D). Thus, we are interested in detecting such exceptional values. However, it must be pointed out that very often a value emerges as exceptional for a certain attribute only when we restrict our attention to a subset of the whole population.

This intuition leads to the definition of the notion of explanation-property pair.

Definition 5

An explanation-property pair (E, p), or simply pair for the sake of conciseness, consists of condition E, also called explanation, and of an atomic condition p = (pa, pv), also called property. By pa (pv, resp.) we denote the attribute (value, resp.) involved in the atomic condition p.

Given a pair π = (E, p), \({\mathcal {D}}_{\pi }\) denotes the set of objects \({\mathcal {D}}_{E\cup \{p\}}\). The outlierness out(π) of an explanation-property pair π = (E, p) is the outlierness \(out_{p_{a}}(p_{v},D_{E})\) of the value pv with respect the attribute pa in the dataset DE.

To illustrate definitions, we exploit a running example derived from the Breast Cancer Wisconsin dataset [32]. The example is based on the Clump Thickness attribute, referred to as CT. Specifically, Table 2 refers to the frequency distribution of values in the domain of CT in the full dataset before any explanation is taken into account. Conversely, Table 2b shows the frequency distribution of the same attribute when we focus on the subset of benign tumors, together with the scores gained by each value (fourth column). The row there highlighted concerns the pair (E1, p), with explanation E1 = {(Type,2)} and property p = (CT,7).

Table 2 Behaviour of Clump Thickness attribute in the dataset portions selected by different set of conditions (CT = Clump Thickness, UCS = Uniformity of Cell Shape, SECS = Single Epithelial Cell Size, T = Type)

We note that the number of possible explanation-property pairs is usually very large. So we need a mechanism that allows us to single out the subset of these pairs encoding the outstanding exceptionalities in the data and carrying no redundant information; we do that through the notions of unexpected, significant, and outstanding pairs defined in the following.

Given pairs (E, p) and \((E^{\prime },p)\), we say that (E, p) is more specific than \((E^{\prime },p)\), or equivalently that \((E^{\prime },p)\) is more general than (E, p), if \(E^{\prime }\subset E\). In this case, we also say that the two pairs are related.

Given two related pairs (E, p) and \((E^{\prime },p)\), with (E, p) more specific than \((E^{\prime },p)\), we wonder if the frequency distribution \({{\mathscr{H}}}^{{\mathcal {D}}_{E}}_{p_{a}}\) (also called observed distribution) is statistically different from the frequency distribution \({{\mathscr{H}}}^{{\mathcal {D}}_{E^{\prime }}}_{p_{a}}\) (also called reference distribution). We can ask the question by leveraging a goodness-of-fit test, which establishes if the observed frequency distribution is unexpected given the reference distribution.

A suitable test for categorical values is the chi-square test. The chi-square test relies on the chi-square statistic X2, which is the sum of the squared difference between the observed frequencies and the reference frequencies, normalized on the value of the expected frequencies:

$$ X^{2}(E^{\prime},E,p) = \sum\limits_{i=1}^{r} \left[ \frac{\left( h_{i} - h_{i}^{\prime}\left( \frac{m}{n}\right) \right)^{2}}{h_{i}^{\prime}\left( \frac{m}{n}\right)} \right] $$
(11)

where: (i) \(h_{i}^{\prime }\) (hi, resp.) is the number of occurrences in \({\mathcal {D}}_{E^{\prime }}[p_{a}]\) (\({\mathcal {D}}_{E}[p_{a}]\), resp.) associated with the i-th distinct categorical value of \({\mathcal {D}}_{E^{\prime }}[p_{a}]\), (ii) r is the number of these distinct categorical values, and (iii) n and m represent the number of objects in \({\mathcal {D}}_{E^{\prime }}\) and \({\mathcal {D}}_{E}\), respectively.

It is known that the X2 statistics asymptotically approaches the χ2 distribution with r − 1 degrees of freedom, hence, the X2 value can be used to calculate a p-value by comparing its value to the proper chi-squared distribution. The p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis, that is that the observed distribution complies with the reference one, holds. Thus the value \(F_{\chi ^{2}_{r-1}}(X^{2})\), where \(F_{\chi ^{2}_{r-1}}\) denotes the cumulative distribution function of the χ2 distribution with r − 1 degrees of freedom, provides the desired p-value.

Definition 6 (Unexpected pair)

Given related pairs π = (E, p) and \(\pi ^{\prime }=(E^{\prime },p)\), with (E, p) more specific than \((E^{\prime },p)\), we say that (E, p) is unexpected given \((E^{\prime },p)\), if \(out(\pi ) \ge (1+\alpha ) \cdot out(\pi ^{\prime })\), where \(\alpha = 1 - F_{\chi ^{2}_{r-1}}\big (X^{2}(E^{\prime },E,p)\big )\).

Note that, the smaller the p-value provided by the cdf value \(F_{\chi ^{2}_{r-1}}\big (X^{2}(E^{\prime },E,p)\big )\) and the less expected the distribution of categorical values in \({\mathcal {D}}_{E}[p_{a}]\) given the knowledge of the distribution of the categorical values in \({\mathcal {D}}_{E^{\prime }}[p_{a}]\) and, consequently, α ∈ [0,1] is inversely related to the unexpectedness of the former distribution. Hence, in order for the pair (E, p) to represent non-redundant information, we require that the outlierness of \((E^{\prime },p)\) must be improved of a quantity which is inversely related to unexpectedness of the distribution of categorical values in \({\mathcal {D}}_{E}[p_{a}]\).

Therefore, the concept of unexpected pair serves the purpose of avoiding that statistical fluctuations of the frequencies may slightly favor the un-typicality of the property value. Indeed, while outlierness improvements might be observed by augmenting the current explanation with some attribute-value pairs, this circumstance does not necessarily imply that the augmented explanation-property pair is more relevant than the original one, since the improvement could be so slight to be associated with statistical fluctuations of the property distribution within the two explanations. Hence, when an explanation is augmented with some attribute-value pairs, the more is preserved the property distribution with respect to the original explanation, the larger the outlierness improvement required to declare the augmented explanation-property pair as unexpected.

Consider the running example and suppose you want to extend the explanation E1 = {(Type,2)} with a further condition to get the more specific E2 = {(Type,2), (Uniformity of Cell Shape, 2)}.

To verify if the pair (E2, p) is unexpected given the pair (E1, p), the frequency distributions \({{\mathscr{H}}}^{{\mathcal {D}}_{E_{2}}}_{CT}\) and \({{\mathscr{H}}}^{{\mathcal {D}}_{E_{1}}}_{CT}\) have to be compared. The plots reported Fig. 3 show the above distributions and the associated α value demonstrates that a very weak correlation exists. Thus, for the pair (E2, p) to be considered unexpected given (E1, p) just a slight increase in the outlierness score is necessary to satisfy Definition 6:

$$ \begin{array}{@{}rcl@{}} out_{CT}(7,{\mathcal{D}}_{E_{2}}) &=& 0.6866 > (1+0.0065)\cdot out_{CT}(7,{\mathcal{D}}_{E_{1}})\\& &= 1.0065 \cdot 0.6530. \end{array} $$
Fig. 3
figure 3

\({{\mathscr{H}}}^{{\mathcal {D}}_{E_{2}}}_{CT}\) (yellow) vs \({{\mathscr{H}}}^{{\mathcal {D}}_{E_{1}}}_{CT}\) (blue)

Consider now the property \(p^{\prime }=(CT,2)\). Also the pair \((E_{2},p^{\prime })\) is unexpected given \((E_{1},p^{\prime })\), indeed

$$ \begin{array}{@{}rcl@{}} out_{CT}(2,{\mathcal{D}}_{E_{2}})& =& 0.5629 > (1+0.0065)\cdot out_{CT}(2,{\mathcal{D}}_{E_{1}}) \\&=& 1.0065 \cdot 0.3325. \end{array} $$

However, if we consider the explanation E3 = E2 ∪{(Single Epithelial Cell Size,2)} although the score of the pair \((E_{3}, p^{\prime })\) is greater than that of the pair \((E_{2}, p^{\prime })\), the former pair is not unexpected given the latter since (Fig. 4):

$$ \begin{array}{@{}rcl@{}} out_{CT}(2,{\mathcal{D}}_{E_{3}}) &=& 0.6031 < (1+0.9871)\cdot out_{CT}(2,{\mathcal{D}}_{E_{2}}) \\&=& 1.9871 \cdot 0.5629. \end{array} $$
Fig. 4
figure 4

\({{\mathscr{H}}}^{{\mathcal {D}}_{E_{3}}}_{CT}\) (yellow) vs \({{\mathscr{H}}}^{{\mathcal {D}}_{E_{2}}}_{CT}\) (blue)

It is important to note that, given any pair \((E^{\prime },p)\), often it suffices to augment \(E^{\prime }\) with a random attribute r to observe a slight outlierness improvement, that is to have \(out_{p_{a}}(p_{v},{\mathcal {D}}_{E^{\prime }\cup \{r\}}) \ge out_{p_{a}}(p_{v},{\mathcal {D}}_{E^{\prime }})\). To understand why this is not unusual, assume that r is uncorrelated with the attributes in \(E^{\prime }\), then the distribution \({\mathcal {D}}_{E^{\prime }}[p_{a}]\) will be almost preserved in \({\mathcal {D}}_{E^{\prime }\cup \{r\}}[p_{a}]\). Thus, we compute α since it precisely provides a measure of the unexpectedness of the sub-population distribution: being the value pv fixed, larger explanations that preserve the original distribution represent redundant information. Moreover, since statistical fluctuations of the frequencies may slightly favor the un-typicality of the property value, the more preserved the above distribution, the larger the outlierness improvement required. In the scenario depicted above, this almost certainly filters out the expended pair (E ∪{r}, p), even if the associated outlierness gets larger.

Definition 7 (Significant pair)

A pair (, p) is significant by definition. Moreover, a pair (E, p) with E is said to be significant if there exists a more general significant pair \((E^{\prime },p)\) such that (E, p) is unexpected given \((E^{\prime },p)\).

Thus, a pair is significant if it is unexpected given at least one other more general significant pair. The notion of significant pair is implemented by requiring unexpectedness of pairs with respect to a subset of the given explanation that resulted to be unexpected itself. To better understand this notion, consider the explanations \(E^{\prime \prime }\), \(E^{\prime }=E^{\prime \prime }\cup \{(a^{\prime },v^{\prime })\}\), and \(E=E^{\prime }\cup \{(a,v)\}\) and the three related pairs \((E^{\prime \prime },p)\), \((E^{\prime },p)\), and (E, p), with \((E^{\prime \prime },p)\) a significant pair. Often neither (E, p) is unexpected given \((E^{\prime },p)\) nor \((E^{\prime },p)\) is unexpected given \((E^{\prime \prime },p)\). This can be understood since atomic conditions may not in general carry sufficient additional correlations.

However, a more complex condition, like \(E\setminus E^{\prime \prime }\), could sensibly alter the initial distribution of values. Thus, if (E, p) is unexpected given the pair \((E^{\prime \prime },p)\), and \((E^{\prime \prime },p)\) is itself significant, we say that also (E, p) is significant, no matter of the expectedness of (E, p) given \((E^{\prime },p)\) which is more specific than \((E^{\prime \prime },p)\), for otherwise chains as the one described above will prevent almost all pairs to be identified as significant.

Definition 8 (Strongly significant pair)

A pair (E, p) is said to be strongly significant if, for any more general significant pair \((E^{\prime },p)\), the pair (E, p) is unexpected given \((E^{\prime },p)\).

In the running example, the pair \((E_{3},p^{\prime })\) is significant since it is unexpected given the pair \((\emptyset ,p^{\prime })\), which is significant by definition, indeed:

$$ \begin{array}{@{}rcl@{}} out_{CT}(2,{\mathcal{D}}_{E_{3}}) &=& 0.6032 > (1+0.0288)\cdot out_{CT}(2,{\mathcal{D}}) \\&=& 1.0288 \cdot 0.3416. \end{array} $$

However, it is not strongly significant, since \((E_{3},p^{\prime })\) is not unexpected given the more general significant pair \((E_{2},p^{\prime })\). Note that the pair \((E_{2},p^{\prime })\) is also significant (since it is unexpected given \((\emptyset ,p^{\prime })\)) and, hence, it will be strongly significant provided that any other more specific significant pair is not unexpected given \((E_{2},p^{\prime })\).

Strongly significant pairs are those unexpected given any other more general significant pair. While the definition of significant pair serves the purpose of identifying those pairs that represent interesting information, the definition of strongly significant pair serves the purpose of identifying those pairs that carry no redundant information. Indeed, if a significant pair (E, p) is not unexpected given a more specific significant pair \((E^{\prime },p)\) then, from what has been stated above, it follows that the same information can be obtained from the more general pair \((E^{\prime },p)\).

Definition 9 (Outstanding)

Maximal strongly significant pairs (E, p) are said to be outstanding pairs. Given threshold 𝜃, a pair is said 𝜃-outstanding if it is outstanding and has outlierness not smaller than 𝜃.

Outstanding pairs can be thought of as the maximal interesting non-redundant explanation-property pairs in the dataset at hand. Outstanding pairs form the output of the technique. Specifically, we are interested in outstanding pairs associated with the highest outlierness values. Thus, given parameter N and threshold 𝜃, the technique outputs the top–N𝜃-outstanding explanation-property pairs π = (E, p) together with the associated sets of objects \({\mathcal {D}}_{\pi }\).

This strategy greatly reduces sensitivity of our outlierness measure to variations of the explanation. Moreover, by means of the strongly significant and outstanding definitions, we present to the analyst only the explanation-property pairs that are unexpected with respect to any other more general pair, thus avoiding she/he to be overwhelmed by useless information.

5.1 The FDEOut algorithm

In this section we describe the FDEOut algorithm. In order to detect outstanding pairs, the algorithm performs the exploration of the set enumeration tree associated with the explanations according to a depth-first strategy.

At a given iteration, the algorithm analyses pairs composed by an explanation E and, simultaneously, all the possible properties p concerning attributes not involved in E. The depth-first strategy is adopted for reducing the cost of the search.

In order to evaluate the score of a given pair (E, p), the dataset objects have to be grouped according to E and this can be directly exploited also to group objects according to E ∪{e} due to the depth-first visit.

It follows from the adopted strategy that the algorithm works with partial information, since when the algorithm analyzes the pair δ = (E, p) it has not yet explored all the pairs \((E^{\prime }, p)\) with \(E^{\prime } \subset E\).

figure g

The algorithm builds the result set composed by outstanding pairs, denoted as \(\mathcal {O}\mathcal {P}\), as follows. When δ = (E, p) is evaluated, consider the set Δ of pairs \((E^{\prime }, p)\) with \(E^{\prime }\subset E\) currently in \(\mathcal {O}\mathcal {P}\) and the set \({\varDelta }^{\supset }\) of pairs \((E^{\prime }, p)\) with \(E^{\prime }\supset E\) currently in \(\mathcal {O}\mathcal {P}\). Note that each pair Δ is more general than δ and that each pair \({\varDelta }^{\supset }\) is more specific than δ, thus, \(\delta ^{\prime }\in {\varDelta }^{\subset }\) can be exploited to evaluate the significance of δ and δ can be exploited to evaluate the significance of \(\delta ^{\prime }\in {\varDelta }^{\supset }\). In more details:

(i):

if the score of δ is lower than the threshold 𝜃 or lower than at least one pair in Δ then δ is dropped since it is not significant;

(ii):

if δ is significant with respect to at least one pair in Δ then δ is candidate to be strongly significant and is inserted in \(\mathcal {O}\mathcal {P}\) as marked;

(iii):

if δ is significant with respect to no pairs in Δ, then δ is not strongly significant and is inserted in \(\mathcal {O}\mathcal {P}\) as not marked; note that it has to be inserted since, being significant, it can be relevant to disprove the significance of an other pair;

(iv):

if the score of δ is larger than that of a \(\delta ^{\prime }\in {\varDelta }^{\supset }\) then \(\delta ^{\prime }\) is dropped since it is no more significant;

(v):

if \(\delta ^{\prime }\in {\varDelta }^{\supset }\) is marked and is not significant with respect to δ then the mark of \(\delta ^{\prime }\) is removed since δ disprove the strongly significance of \(\delta ^{\prime }\).

Once the explanation set enumeration tree is explored, only the maximal marked pairs are kept in \(\mathcal {O}\mathcal {P}\).

figure h

6 Experimental results

In this section, we describe experimental results obtained by using the FDEOut algorithm.

First of all, to study the applicability of our method to real datasets, we have tested its scalability by varying the number of objects, the number of attributes, and the depth of the analysis.

Then, to clarify the different nature of the anomalies we detect w.r.t those returned by classical outlier detection methods, we have performed two families of experiments to test whether classical techniques are able to detect anomalies pointed out by our approach and to compare the detection ability of our method with related ones on known outliers.

Specifically, in Section 6.2 we employ as target the top-10 lower and upper anomalies detected by our approach for different datasets and compute their outlier scores according to the distance-based and density-based detection approaches we choose as competitors.

In Section 6.3 we inject outliers into the datasets by selecting objects within the majority class and replacing, for each attribute, the value each selected object assumes with a different value randomly picked from the attribute domain. Then, we evaluate the change in the outlierness score of the alterated objects to highlight the sensitivity of each techniques in identifying such anomalies.

figure i

Finally, in Section 6.4 we discuss knowledge mined by means of our approach.

6.1 Scalability

Figure 5 shows the scalability analysis of our method on the Mushrooms dataset from UCI Machine learning repository [27]. In the experiment reported in Fig. 5a, we varied the number of objects n in {500, 1000, 2000, 5000, 8000} and the number of attributes m in {7, 14, 22}, while the depth parameter has been held fixed to δ = 3. The dashed lines represent the trend of the linear growth estimated exploiting regression. This estimation confirms that the algorithm scales linearly with respect to the dataset size. As for the number of attributes, as expected for a given number of objects, the execution time increases due to the growth of the associated search space. On the full dataset the execution time is rather limited, as it amounts to about 2 minutes. In the experiment reported in Fig. 5b, we varied both the number of objects n and the depth parameter δ in {1,2,3,4}, while considering the full feature space. Also in this case, the linear growth is represented by the dashed lines, so similar considerations can be drawn.

Fig. 5
figure 5

Scalability analysis

Table 3 reports the number of outstanding explanation-property pairs returned by the algorithm on the following dataset: Zoo (n = 101 objects and m = 18 attributes), Mushrooms (n = 8,124 objects and m = 22 attributes), Cars (n = 1,728 objects and m = 8 attributes).

Table 3 Outstanding pairs vs depth analysis δ

We consider increasing values of the depth parameter δ ∈{1,2,3,4,5}. The column #Pairs reports the total number of pairs forming the search space up to the depth level δ. The column #Outstanding reports the number of outstanding pairs and the percentage of these pairs on the total number of pairs (within brackets). The latter column reports the number of outstanding pairs of size δ, that are the novel pairs introduced by exploring the last level of the current search space, together with their percentage on the total number of pairs (within brackets). From the table, it can be seen that the fraction of novel patterns rapidly decreases with the depth, thus suggesting that meaningful analyses do not require large values for the parameter δ. Indeed, the number of outstanding pairs settles for depth δ = 3 on all the datasets. Moreover, it can be seen that the notion of outstanding pair is able to greatly reduce the number of potential explaining patterns to be presented to the user, since the percentage of these patterns on the whole search space rapidly decreases with the depth. E.g., for Mushrooms the outstanding pairs represent the 0.96% of the pairs of size up to δ = 5.

We further note that outstanding pairs are not necessarily associated with large values of outlierness. Thus, it is sensible to determine how many of these outstanding pairs are indeed 𝜃-outstanding, for suitable values of the threshold parameter 𝜃. Figure 6 reports the distribution of the outliernesses associated with the outstanding pairs obtained for δ = 3. Plots on the left concern the lower scores (that we recall are outlierness scores having weights \(W^{\uparrow }_{i} \ge W^{\downarrow }_{i}\) in (10)), while plot on the right concern the upper scores (having weights \(W^{\downarrow }_{i} > W^{\uparrow }_{i}\) in (10)), We can notice that the top lower scores are always likely to reach high values of outlierness, close to 1. As for the upper scores, they are in general less pronounced, due to the different nature of these two kinds of anomalies. In any case, it is interesting to notice that only a little fraction of the outstanding pairs is associated with the largest score values and, hence, only a reduced fraction of the outstanding pairs are indeed 𝜃-outstanding pairs. Usually, the 0.1% or 1% of the outstanding pairs are associated with outlierness values comparable to the maximum outlierness scores by any pair. It is easy to check that the number of these pairs amounts to some hundreds in the general, being about 100 for Zoo and Cars, and about 1, 000 for Mushrooms.

Fig. 6
figure 6

Outstanding pairs outlierness distribution

6.2 Comparison with classic distance and density based outlier detection methods

We compare our method with two of the main categories of outliers: (i) distance-based approaches, that are used to discover global outliers, i.e. objects showing abnormal behaviour when compared with the whole dataset population; (ii) density-based approaches, which are able to single out local outliers, i.e. objects showing abnormal behaviour when compared with a certain subset of the data with their neighbourhood.

As distance-based definition, we use the average KNN score, representing the average distance from the k-nearest neighbours of the object [13]. As density-based, we use Local Outlier Factor or LOF [18]. Both methods employ the Hamming distance. Moreover, we compare our method with the ROAD algorithm [47] that exploits both densities and distances, namely it establishes two independent rankings: (i) each data object is assigned to a frequency score and objects with low scores are considered outliers (Type-1 outliers); (ii) the k-mode clustering is performed in order to isolate those objects that are far from big clusters (i.e. clusters containing at least α% of the whole dataset) according to Hamming distance (Type-2 outliers). The goal of these experiments is to highlight that we are able to detected anomalies of different nature and to provide evidence that our method is knowledge-centric, since it concentrates on anomalous values, as opposed to classical methods which are instead object-centric.

To compare the approaches, we ranked the dataset objects o by assigning to each of the them the largest outlierness of a pair π such that \(o\in {\mathcal {D}}_{\pi }\). We determined our top–10 outliers by selecting the objects associated with the largest outliernesses. Then we selected these objects, containing values deemed to be exceptional by our method, with the purpose of verifying how they are ranked by popular object-centric techniques. Hence, we computed their outlier scores according to the KNN, LOF and ROAD definitions.

All the chosen competitors require an input parameter k, representing the number of k nearest-neighbors or the number of clusters to be taken into account. Since selecting the right value of k is a challenging task, we computed the KNN, LOF and ROAD outlier scores for all the possible values of k and determined the ranking positions associated with our top–10 outliers. Particularly, all the integers from 1 to the number of objects n have been considered for KNN while 30 log-space values between 1 and n have been considered for LOF due to its higher temporal cost. For ROAD algorithm, we stopped at the value of k such that at least one big cluster is obtained and use the frequency score to rank those objects having the same distance from their nearest big cluster.

Figures 7 and 8, report the box-plots for k varying in [1, n] of the KNN, LOF and ROAD Type-2 outliers rankings associated with our top–10 outliers. Plots on the top concern lower outliers, while plots on the bottom concern upper outliers. From these plots it can be seen that the median ranking associated with our outliers can be far away from the top and also that, within the whole ranking distribution, the same outlier can be ranked in very different positions. In general, it seems that lower outliers are likely to be ranked better than upper outliers by our competitors, and this witnesses for the peculiar nature of upper outliers. On the Zoo dataset there is no apparent correlation between our outliers and KNN, LOF and ROAD outliers. On the Mushrooms dataset some of our lower outliers are, on the average, ranked very high also by the other algorithms. Some of them are almost always top outliers for all methods (see the top 1st, 2nd, 5th, and 7th outliers) thus witnessing that these outliers have both global and local nature. However, most of our outliers are not detected by these techniques.

Fig. 7
figure 7

Comparison with KNN, LOF and ROAD on Zoo

Fig. 8
figure 8

Comparison with KNN, LOF and ROAD on Mushrooms

Before concluding this comparison, it must be pointed out that the best rankings associated with the selected objects are obtained for very different values of the parameter k. Since, the output of the KNN, LOF and ROAD methods are determined for a selected value of k, it is very unlike that, even in presence of some agreement between our top outliers and local and global outliers, they are simultaneously ranked in high positions for the same provided value of k.

6.3 Comparison with other techniques

In this section, the proposed technique is compared with other methods.

To this aim, for a given dataset, we selected the objects within its majority class and then generated a family of altered datasets as follows. For each of the above objects and for each attribute, we generated a novel dataset by altering the value the object assumes on that attribute with a different value randomly picked from the attribute domain. Thus, the total number of datasets of the family is given by nd, where n is the number of objects within the majority class and d is the number of dataset attributes.

Due to the heavy computations required by this kind of experiment, we selected the following datasets from the UCI Machine learning repository [27]: Zoo (n = 101 objects and m = 18 attributes), Breast cancer (n = 286 objects and m = 9 attributes), House votes (n = 232 objects and m = 16 attributes).

On each family, we then ran the FDEOut algorithm and KNN [13], LOF [18], ROAD [47], CBRW [40], WATCH [36], and KLOF [51].

To make scores comparable, we determined the standardized outlier score z = (scμsc)/σsc of the altered object both before (say z0 this value) and after (say z1 this value) the alteration, where μsc and σsc are, respectively, the mean and standard deviation of the outlier score distribution before the alteration. As for FDEOut we used as outlier score the maximum outlierness associated with an outstanding pair involving the object. Moreover, since no other method is designed to detect upper outliers, in the comparison we considered only our lower outliers.

In order to evaluate the ability of the method to detect the contaminated data, we measured the increase Δz = z1z0 of standardized outlier score associated with altered objects and compared the Δz of each method with that of FDEOut. We note that objects with large standardized outlier score z0 in almost one of the two compared methods are less interesting for this analysis, since they already show an exceptional value of outlierness and their Δz is unlike to achieve large values. Clearly this will unfairly favor one of the two methods and, hence, objects whose standardized outlier score exceeds the mean by more than one standard deviation, i.e. such that z0 > 1, are not considered in the comparison. We further note that this corresponds to focus on the normal objects that become anomalous due to the performed alteration.

Figures 910 and 11 show the comparison between the distribution of the Δz s associated with FDEOut and the same distribution associated with each other method. On the ordinate there are the Δz values sorted in decreasing order, while each value on the abscissa corresponds to a dataset of the family.

Fig. 9
figure 9

Comparison on the Zoo dataset

Fig. 10
figure 10

Comparison on the Breast cancer dataset

Fig. 11
figure 11

Comparison on the House votes dataset

The plots highlight that FDEOut is more sensitive to perturbations of the data, since in all cases the most pronounced variations of standardized score associated with FDEOut amount to about 3 standard deviations, while rarely the other methods exceed 1.5. As for the other methods, their quality vary with the data characteristics and, hence, they show a comparable performance on altered data. This seems to suggest that our method is able to detect also subtle anomalies.

6.4 Knowledge mined

In this section we present some knowledge mined by our method. For ease of interpretation, we report the outstanding pairs mined on the Zoo and Breast cancer. It is worth pointing out that our measure is defined at the value level, thus it does not allow us to state that an object is anomalous in an absolute sense. As our score refers to 〈Explanation, Property〉 pairs, we label as anomalous all objects, in the dataset portion isolated by the explanation, whose value pv for attribute pa receives a reasonable high score in such a dataset projection. Note that these objects result to be anomalous w.r.t that specific pair and not in a general sense.

Information provided by the top lower pairs we get from the Zoo dataset is summarized below:

  • The scorpion is the only invertebrate with a tail.

  • Among vertebrates without fins, the seasnake is the only one that does not breathe.

  • Among non-acquatic animals, the clam is the only one that breathes.

  • The platypus lays eggs although it provides milk.

  • Among predators without feathers, the ladybird is the only airborne.

  • Among catsized animals, the octopus is the only invertebrate.

  • The stingray is a catsize animal, but it is venomous.

  • Among animals which don’t lay eggs, the seasnake and the scorpion are the only ones that do not breastfeed offspring.

  • The crab is the only invertebrate having four legs.

  • Among vertebrate breathing animals, the pitviper and the frog are the only venomous ones in the dataset.

Table 4 clarifies how such knowledge is mined. We report the outstanding pairs from which we have deduced the information above, together with the objects selected by each pair.

Table 4 Lower Outlier pairs detected no the Zoo dataset

It is interesting to note that the same object can be considered anomalous with respect to different pairs, as in the case of the scorpion and the seasnake. Furthermore, a pair can isolate multiple objects as for the 8th and 10th pair.

As for the upper outliers, we find out that the dataset contains the frog twice, the former is venomous and the latter is non-venomous. However, the animal names are like primary keys for the dataset, so having the same name twice can be pointed out as anomalous. Our technique is able to highlight such a situation. Other curiosities about the animal world are spotted by our upper outliers, including the following:

  • Among breathing not catsized predators, the most frequent are non-flying birds.

  • Most no-feathers no-toothed animals have six legs.

  • Among no-flying breathing catsized animals, the most frequent are mammals.

  • Most gastropods have no legs.

  • Most no-toothed have two legs.

To infer this type of knowledge we consider the best ranked outstanding pairs identifying upper outliers. Table 5 outlines the 〈Explanation, Property〉 pairs we take into account. Note that in this case more objects have the value reported as anomalous, but what makes them special is the fact that the frequency of such values is atypical within the distribution.

Table 5 Upper Outlier pairs detected no the Zoo dataset

As for the Breast cancer Wisconsin, information provided by the top lower outlier pairs is summarized below:

  • When uniformity of cell size is 1, the value of attribute mitoses is almost equal to 1, except for three samples having mitoses equal to 5, 7, and 8.

  • When uniformity of cell shape is 1, the tumor is always benign except for two samples.

  • Among samples having bare nuclei equal to 1 and Bland Chromatin equal to 2, only one is malign.

  • Among samples having both bare nuclei and marginal adhesion equal to 1, only three are malign.

  • Among samples having bare nuclei and normal nucleoli equal to 1, only three are malign.

Mining upper outliers, the technique identified that there are two duplicated identifiers. Other upper outliers are discussed below:

  • Among samples having uniformity of cell shape equal to 10, most have normal nucleoli equal to 10.

  • Among malignant tumor having marginal adhesion equal to 1, most have bare nuclei equal to 10.

  • Among samples with uniformity of cell size equal to 10, most have normal nucleoli equal to 10.

  • Among samples with clump thickness equal to 8, most have bare nucleoli equal to 10.

  • Among samples with uniformity cell size equal to 10, most have marginal adhesion equal to 10.

The explanation-property pairs we take into account to discuss the knowledge are reported in Table 6 for lower outliers and in Table 7 for the upper ones.

Table 6 Lower Outlier pairs detected on the Breast Cancer Wisconsin dataset (UCZ = Uniformity Cell Size; UCS = Uniformity Cell Shape; BN = Bare Nuclei; BC = Bland Chromatin; MA = Marginal Adhesion; NN = Normal Nucleoli)
Table 7 Upper Outlier pairs detected no the Breast Cancer Wisconsin dataset (UCZ = Uniformity Cell Size; UCS = Uniformity Cell Shape; BN = Bare Nuclei; BC = Bland Chromatin; MA = Marginal Adhesion; NN = Normal Nucleoli, CT=Clump Thickness)

7 Conclusions

In this work we have provided a contribution to single out and explain anomalous values in categorical domains. We perceive frequencies of attribute values as samples of a distribution whose density has to be estimated. This leads to the notion of frequency occurrence we exploit to build our definition of outlier: an attribute value is suspected to be an outlier if its frequency occurrence is exceptionally typical or un-typical within the distribution of frequencies occurrences of any other attribute value. As a second contribution, our technique is able to provide interpretable explanations for the abnormal values discovered. Thus, the outliers we provide can be seen as a product of the knowledge mined, making the approach knowledge-centric rather than object centric.

The performances have been evaluated on some popular benchmark categorical datasets and a comparative view is proposed.

We notice that our method could be possibly combined with techniques that deal with outlier detection in numerical domains, in the spirit of what was done in [28, 30, 39], and we leave this as a subject of future research.