1 Introduction

The majority of processes, structures and phenomena that are of scientific or technological interest generate large amounts of data. Each observation can be an ensemble of multiple attributes, which makes the data high-dimensional. The dataset representing the structure or phenomena under study contains relevant information that has to be extracted by the application of pertinent algorithms. The nature of those algorithms is aligned with the scientific questions that are of relevance for a particular case. Efforts to systematically study datasets to find answers to particular scientific interrogations are found in several disciplines [1]. Each field of science that has to deal with large datsets has contributed with its own (partial) solutions and techniques, and in many cases those solutions are similar to each other, or follow the same ideas.

In a recent development of scientific ideas, the field of data science has emerged with the objective of joining existing techniques along several fields, and to offer new tools to understand data under a rational perspective [2, 3]. Those new tools are firmly based on Mathematics, Computer Science, and Statistics, and elements of each field are taken in order to provide new lines of though [4].

Data science has as one of its goals the identification of patterns or structures defined by observations, commonly referred to as vectors or points, located in usually high-dimensional feature or input spaces. Based on those patterns researchers may grasp some properties of the data under analysis, and make decisions regarding the next steps of analyses. The starting point of data science is usually an unbiased and preliminary exploration of data. This stage is known as exploratory data analysis [5], and it aims to identify structure in the data to pinpoint possible relevant patterns present in it.

In exploratory data analysis, algorithms usually do not assume external labels or classes assigned to the vectors in the datasets. Indeed, one of its objectives is to actually compute a label and assign it to each vector, based on intrinsic properties of the data. Some of the most common label-generating tasks include the identification of clusters [6, 7], the detection of anomalies [8], the identification of manifolds [9], and the generation of centrality-oriented metrics [10].

An anomaly is a point that does not resemble the rest of the elements in a dataset [11]. Vectors in a dataset are characterized in terms of a certain property, and then, those elements that do not fulfill that characterization are labelled as potential anomalies [12,13,14,15]. There are a few loose elements in this definition, which allows the existence of a large number of algorithms for anomaly detection. Since each element in a dataset can be thought of as a point in a high-dimensional dataset, the problem can be framed in a more geometric perspective. A geometric approach allows, for instance, the use of certain aspects from Information Geometry in order to improve the capabilities of certain anomaly detection algorithms.

Since in general the generating process of the data is not known, it is implicitly approximated by one or several attributes. In particular, the concept of neighborhood is heavily relied on, since it allows for a characterization of each vector in terms of its surroundings [16, 17]. By describing the neighborhood of a vector, for example, in terms of density, a direct comparison between that vector and a subset of the remaining vectors may offer an idea of how different that vector is with respect to other vectors [18]. One of the assumptions behind the anomaly detection algorithms is that the majority of vectors in a dataset are usual, and thus, the probability of detecting by chance an anomaly is rather low. Thus, anomaly detection has to be attained under rational perspectives [19].

Anomaly detection is an unsupervised learning task whose main goal is the identification of vectors that significantly differ from the rest of the observations in a dataset [11, 19]. The final objective of anomaly detection algorithms is to label, or classify, each vector in the dataset as either usual or as anomaly [12], or to grade its anomaly level. The candidate anomalies do not satisfy a certain property derived from the entire group of observations, whereas the usual observations comply with that property [11]. Again, it is generally assumed that the number of anomalies in a dataset is small in comparison to the number of expected, usual, or common observations.

The identification of observations that differ from the majority of instances within a dataset is a rather important task. Vectors that are different to the majority of elements under study are called anomalies, outliers or novelties [20]. Anomalies appear in several contexts, and a prompt detection is always desirable. In biology, for example, certain genes expressed in specific tissues have been detected as anomalies, which highlight specific metabolic pathways involved in diseases [21]. In human health, for instance, some heart diseases have as an early sign of appearance anomalies in the signal produced by the heart. An automatic detection of those early anomalies could be of enormous benefit [22]. In aviation, anomalies appear when there is a mismatch between the information measured in sensors and the actions taken by human or robotic actuators [23, 24]. An early detection of such anomalies would prevent the aircraft to perform an unsafe maneuver. In all cases, the common denominator is that specialists do not known beforehand what observations are considered anomalies.

Traditionally, anomalies have been identified as noise in several scientific disciplines [25]. Although some anomalies may be indeed be noise caused by a malfunction in the sensors, caused by human error along the analysis stages, or by an error in some other stage of the data processing steps, not all of anomalies are the result of errors in data processing. The common action was to discard those anomalies, relabelling them as outliers, a synonym of something undesirable [15, 20]. In a more recent perspective, anomalies are considered relevant observations that may reveal hidden or changing aspects of the phenomena under study [19].

Of particular interest to data science is a type of data known as compositional. In compositional data the attributes represent the relative frequency or proportion of the components of the system [26]. The sum of all components add up to a constant value. This value is fixed for all elements in the sample. For example, all food can be cgaracterized in terms of their content of fat, carbohydrates and protein (disregarding other components such as water). In all cases, the composition of these three constituents in food adds up to 100%. The property that the relative abundance or frequency of the components must add to a fixed value is referred to as the closure constraint [27]. This constraint makes compositional data peculiar in geometrical terms, which leads to the need of specific statistical tools its for analysis and interpretation. Compositional data can be embedded in the probability simplex [28], and by taking into account its properties, algorithms are expected to attain more reliable and interpretable results.

The field of Information Geometry offers relevant analysis methods and tools that can be applied to compositional data [29]. Compositional data, such as normalized histograms, may also contain anomalous observations. Anomaly detection in compositional data is often performed by algorithms that do not take into account the aforementioned peculiarities of such data. However, the geometry of compositional data can be considered in anomaly detection algorithms in order to improve the identification of true positives, that is, of anomalies. In a first step to study anomaly detection when the constraints of compositional data are explicitly considered, we studied the impact of distance functions in the identification of anomalies. For that, we selected a distance-based anomaly detection algorithm.

The vast majority of anomaly detection algorithms assume no peculiarities about the distribution of data in the high-dimensional attribute space. This is of course of great benefit since it allows an extended application of those algorithms in any context. However, the constraints of compositional data may be exploited to improve the identification of anomalies in a more efficient way. In this contribution, we report the efforts of the impact of Information Geometry-related aspects in anomaly detection of compositional data. From Information Geometry, we applied several distance functions, and, as we will show in the next sections, when anomaly detection algorithms rely on some of those functions, the results improve. In Information Geometry, originally applied to understand the links between statistical manifolds and geometry, the concept of distance functions and its inherent geometries are of particular relevance [30].

The core idea developed in here can be described as follows. Given a set of points in the probability simplex, can a subset of them be identified as anomalies? Several anomaly detection algorithms are based directly or indirectly in the concept of distance. A distance function compares objects [31]. Based on that comparison, some anomaly detection algorithms compute an anomaly index for each vector in the dataset. In general, vectors are ranked accordingly to that index, or following a binary classification, labelled as either anomalies or as common or usual vectors. The hypothesis we aim to prove in this contribution is that anomalies in the probability simplex can be more accurately detected by relying on distance functions commonly applied in Information Geometry, rather than the commonly applied Euclidean or \(L_{1}\) distances.

Several families of anomaly detection algorithms have been created in more than two decades of active research. Of particular interest are those algorithms that rely on the characterization of vectors in terms of their nearest neighbors. This is of relevance since the relative size of the neighborhood is determined by the applied distance function. Since we are interested in characterizing the effects of distance functions in anomaly detection, it is a natural choice to focus on this family of algorithms. Local Outlier Factor (LOF) [18] is one of the best-known anomaly detection algorithms. LOF takes into account the vicinity of each vector in order to compute an anomaly index. Here, a vector v is characterized in terms of its k nearest neighbors. Each of those k neighbors is in turn characterized in terms of its nearest k neighbors. Once the characterizations are concluded, the descriptions obtained from v are compared to those obtained from its k neighbors. In the original version of the algorithm, the chosen distance function is the Euclidean distance. We implemented de novo a version of LOF in which the distance can be detected by the user. More details of LOF will be offered in Sect. 3.

In order to define our aim in more concise words, suppose there is a group of normalized histograms, each one representing, for example, an approximation of a certain probability function. If all histograms in that group come from an unique distribution described by the same parameters, then, there are no anomalies within that group. On the other hand, if a minority of elements in that group come from a different distribution, the elements in that reduced subset should be identified as anomalies. The path to conclude that in the former case there are no anomalies, whereas in the latter case there are anomalies is the problem we tackle in here. Once again, the hypothesis we aim to prove is that anomalies in the probability simplex can be more accurately detected by relying on distance functions commonly applied in Information Geometry.

The vast majority of distance-based anomaly detection algorithms assume, directly or indirectly, that the feature space can be described by Euclidean geometry. Derived from that geometry, the Euclidean distance is the most common applied one. Here, we aim to ask whether other geometries may be suitable to detect anomalies in compositional data. We explore in this contribution the impact of the applied distance in a well-known anomaly detection algorithm, LOF, for the case of points in the probability simplex. We are most interested in exploring the capabilities of LOF compositional data. In particular, since the algorithm is based on detecting changes as a function of distance, we systematically explored the effects caused by different distance functions in the detection rates of anomalies.

In the present contribution, we present a direct application of Information Geometry to data science. We rely our approach in some of the most prominent results found in the field of Information Geometry. In this contribution, we put our efforts in applying some of the outstanding results from the field [29] in an area in which, to our knowledge, it has not been applied explicitly. We are interested in verifying if the distances that operate under the Riemmanian and other geometries are better choice to work with when detecting anomalies within compositional data.

The rest of the contribution continues as follows. In the next section we briefly describe some of the relevant aspects of compositional data. In Sect. 3 we describe the anomaly detection algorithm to be applied along this contribution. In Sect. 4 the geometries and distances that are to be tested over the anomaly detection algorithm are described. We present the main results of applying anomaly detection algorithms in the probability simplex in Sect. 5, and we offer some conclusions and discuss the main aspects of our approach in Sect. 6.

2 Compositional data and the probability simplex

Compositional data has attracted attention in research since the seminal works by Aitchison [27, 32]. In compositional data analysis, points with N components are constrained to a region of the N–dimensional space, referred to as the probability simplex. In Fig. 1, it is shown an example of the simplex for \(N=3\) components, displaying in pink a subset of all possible foods in terms of their percentage of protein, carbohydrates and fat. Three specific groups of food are shown as an example: Red meat, fish and seafood, and vegetables. The visual capabilities of the probability simplex, putting aside the formal aspects of it, are clear in this type of visualizations. For instance, it is possible to appreciate how these three types of food scatter over the simplex.

Fig. 1
figure 1

A depiction of compositional data and its representation in the probability simplez. All instances of compositional data can be represented as points in the probability simplex \(\Delta ^{3}\). In the example, points in pink represent possible foods in terms of their percentage of protein, carbohidrates and fat. Red meat foods are represented by red squares, the blue circles indicate vegetables and green triangles show some types of fish and seafood

In a recent perspective, the peculiarities of the probability simplex have been considered when computing clusters in compositional data [33]. There, authors systematically study the effects of different distances involved in the clustering of normalized histograms. Histograms can be useful as a visual representation of data, where the range of a data sample is divided into a discrete number of non-overlapping intervals called bins. A frequency is assigned for each bin, by counting the number of points in the dataset that lies within the corresponding interval. Such frequencies may be visualized as a plot bar, where the height of each bar indicates the frequency of a bin. The application of Information Geometry to compositional data has been approached in [34]. There, authors make use of the the Bregman divergence to normalize data and ultimately propose a generalization of PCA able to deal with compositional data.

When working with histograms, it is common practice to normalize the frequencies. In this way, the sum of all frequencies will be equal to 1, and the compositional or closure property is achieved. That is, the sum of all frequencies \(f_{i}\) adds up to 1.0 (see Eq. 1).

$$\begin{aligned} \sum _{i = 1}^{N} f_{i} = 1 \end{aligned}$$
(1)

For convention, when referring to histograms, the assumption in this work is that frequencies are normalized. As a result of Eq. (1), the representation of a sample as a histogram provides a method for viewing data as compositional, which is the description for data that the methods investigated in this study focus on. Under this perspective, each histogram can be though of as a point in a N-dimensional space.

Histograms are useful because according to the frequentist view of probability, as N grows large \(f_{i}\) converges to the probability that a point sampled from the population lies in the \(i-th\) bin or interval; that is: a histogram is an approximation of the distribution for the data [35].

The probability simplex is a geometric object that represents all possible probability distributions over a finite set of outcomes. For a set of N outcomes, the probability simplex is an \((N-1)\)-dimensional convex polytope in N–dimensional Euclidean space. The probability simplex is defined as the set of all non-negative vectors \(x = (x_{1}, x_{2},..., x_{N})\) whose components sum to one. Thus, the probability simplex is a subset of the N-dimensional Euclidean space, and it is bounded by the hyperplanes defined by \(x_{i} = 0\) for \(i=1, 2,..., N\).

The geometry of the probability simplex is characterized by its shape and structure. The simplex has N vertices, where N is also the number of bins in a histogram. Each vertex \(V_{i}\) corresponds to a probability distribution that assigns 1 to a single outcome and 0 to the all others. That is, there is a bin in the histogram with probability 1, and all the remainder bins have a probability of 0. The edges of the probability simplex connect the vertices and represent the possible mixtures of these pure distributions. The interior of the simplex represents all other possible probability distributions over the set of outcomes.

The probability simplex, denoted as \(\Delta ^{d}\), is a convex polytope, which means that any two points on the simplex can be connected by a line segment that lies entirely within the simplex [27]. This property has important implications for optimization and statistical inference, as many optimization problems and statistical models involve finding the point on the probability simplex that maximizes or minimizes a certain criterion [36].

3 Anomaly detection algorithms: the case of local outlier factor

An anomaly is an observation, or a small subset of observations, which is different to the the remainder of the set to which it belongs. The identification or detection of those particular instances or peculiar observations is studied with the tools and perspectives known, collectively, as anomaly detection. In data science, a given dataset, that recovers observations from the process, structure or phenomena of interest, is to be dissected in particular ways. That dissection includes the identification of anomalies.

There are various approaches followed by anomaly detection algorithms, such as statistical methods [15], machine learning algorithms [20], and pattern recognition techniques [37]. Some common approaches include clustering-based methods, density-based methods, distance-based methods, and machine learning-based methods. The choice of method depends on the specific application and the characteristics of the dataset under analysis [11, 19, 38].

In particular, anomaly detection aims to answer whether there are peculiar vectors or observations within a collection of instances. Given a dataset D, it is of interest to identify two mutually exclusive subsets U and A. The elements in U conform the common, normal, or expected observations, whereas the elements in A are known as anomalies, outliers, or novelties, or any other synonym. The development of algorithms that allow the identification of A and U is an open field of research, since many assumptions may vary in specific datasets.

An anomaly detection task comprehends two stages. In the first stage, the goal is to identify a certain description that is common to the majority of the data. Then, in the subsequent stage, all vectors or observations are compared to the description obtained in the first stage [38]. Depending on the nature of the existing data, there are two instances of anomaly detection. The first one is closely related to the problem of classification under unbalanced classes. In this scenario, each observation is labelled as either common, normal or expected, or, on the other hand, labelled as an anomaly or an outlier. In other words, each observation is known to be in either sets U or A. The former class is in general much more abundant than the latter, and thus, there is an unbalance in the classes. The classification algorithm computes a specific description that is common to observations in the normal class (U), and at the same time, that description is not present in the vectors of the second or anomalous class (A). Once the algorithm has been trained, it can be tested over the same data, or over new instances. The description of a vector is inspected by the function inferred by the algorithm in the training stage, and a decision of whether it belongs to U or A is taken.

In the second scenario for anomaly detection, it is not known what vectors, if any, are anomalies. In technical terms, there is no external label or class (U or A) assigned to the elements in D, as opposed to what is found in the first scenario [39]. In such conditions of lack of external label, the algorithm has to infer a certain property that is common in the majority of the observations, and based on it, decide if those observations that do not fit into that property, are indeed anomalies [40]. This scenario of unlabelled data is of high relevance, since in many applications of data science, it is not known beforehand what instances constitute anomalies and which ones are common observations [41, 42]. In this scenario, referred to as unsupervised anomaly detection, the parameters and rules obtained by algorithm assign a degree of anomaly to each element \(v \in D\). The anomaly level of va(v) is one of the parameters from which the algorithm takes the decision to assign v to U or to A. Additional parameters to reach a decision may be the expected anomaly level of all elements in D. This allows for a global comparison of v to the rest of elements in D [39]. Another possibility to reach a decision concerning the class assigned to v is the comparison of a(v) with the corresponding characterization of a certain subset of D. For instance, when v is compared to its k–nearest neighbors, the approach is focused on the identification of local anomalies [38].

We are interested in the unsupervised scenario for anomaly detection. Again, what characterizes this scenario is that vectors are not labelled and thus, the algorithm has to infer the the most likely class of the vectors, or assign an anomaly degree to them, based on undisclosed properties of the data [43, 44]. Since the properties of the data that are to be taken into consideration for telling apart anomalies from common vectors are not unique, several alternatives exist. Some vectors can be identified as anomalies under certain assumptions, and not under a different set of premises.

Several families of anomaly detection algorithms for the unlabelled case have been created in more than two decades of active research [21, 45,46,47,48,49,50,51,52]. In particular, those focused on the analysis of nearest neighbors are of particular relevance, since the relative size of the neighborhood is affected by the properties of the selected distance function [38]. This makes this approach relevant to explore different geometries and distances. LOF is one of the best-known anomaly detection algorithms that take into account the surroundings of each vector in order to compute an anomaly index. Here, a vector v is characterized in terms of its k nearest neighbors. Each of those k neighbors is in turn characterized in terms of its nearest k neighbors. Once the characterizations are concluded, the descriptions obtained from v are compared to those obtained from its k neighbors.

In more technical terms, a vector v is described by means of its k-distance. Let k-distance(v) be the distance from v to its k-nearest neighbor. The set of neighbors within reach of v based on k-distance(v) is denoted as \(N_{k}(v)\), and it is also refrerred to as the context of v. The reachability distance from v to a second vector w is given by:

$$\begin{aligned} \mathrm {reachability-distance}_{k}(v,w) = \text {max}\{k-\mathrm {distance(w)}, d(v,w)\} \end{aligned}$$
(2)

Where d is assumed to be the Euclidean distance. The reachability distance is the maximum between the actual distance from v to w and the k-distance of vector w. Note that for k-distance, it is the context (neighborhood) of w the one that is considered. It may be the case that k-distance\((w) > d(v,w)\).

As stated above, the distance d is assumed to be Euclidean. However, it is in this parameter that we invoke Information Geometry. By considering alternative geometries and distances, we aim to understand their effect on the performance of anomaly detection algorithms based on distances.

Continuing with our description of LOF, all k–neighbors of w will be characterized by the same reachability distance. It should be noted that the reachability distance may be greater than the actual distance. The benefit of this substitution is that it offers more stability for certain distributions.

From the reachability distance, vector v is further described by its local reachability density, defined as:

$$\begin{aligned} lrd_{k}(v) = 1/\left( \frac{\sum _{w \in N_{k}(v)}\mathrm {reachability-distance}_{k}(v,w) }{ |N_{k}(v)|}\right) \end{aligned}$$
(3)

\(lrd_{k}(v)\) is a measure of the density of points around v. In particular, it is the expected value over all the elements in \(N_{k}(v)\), that is, its k–neighbors. From this quantity, the local outlier factor of vector v, denoted as \(lof_{k}(v)\), is computed:

$$\begin{aligned} lof_{k}(v) = \frac{\sum _{w \in N_{k}(v)} lrd_{k}(w)}{ |N_{k}(v)| \times lrd_{k}(v)} \end{aligned}$$
(4)

Algorithm 1 displays the LOF algorithm. It takes as input the dataset to be inspected, and the parameter k, which indicates the number of nearest neighbors to be considered. In our contribution, it also takes the distance d to be applied. The output of the algorithm is the local outlier factor for each vector x in the datset, denoted as lof(x). For simplicity, the parameter k is not displayed, when the context allows it. Note that LOF(Xkd) refers to the Local Outlier Factor algorithm, with parameters Xk,  and d, (with k sometimes not shown), whereas lof(x) refers to the actual value assigned to vector x by the algorithm.

figure a

When \(lof_{k}(v) > 1\), the local density of v compared to that of its neighbors \(N_{k}(v)\) is lower. On the other hand, if \(lof_{k}(v) < 1\), it means that vector v presents a higher density than the expected density of its neighbors. The former case defines v as an outlier, whereas the latter defines it as an inlier. In this contribution we will refer to both cases as anomalies. The more distant from 1, the higher the anomaly level. The control parameter k allows for an increase of the neighborhood and thus. In the extreme case, when k equals the number of elements in the dataset, leads to a global comparison. There is not, however, a formal criteria to identify the correct value of k. As in any other anomaly detection algorithm, if the criteria, in this case defined by the neighborhood size changes, the outcome can also change. This leads to instabilities, but is a problem not tracked in this contribution.

Figure 2 shows the k-neighborhood of a vector, based on the k-distance, for \(k=3\). A vector v is characterized in terms of its context or neighborhood. That characterization is compared to the context of the neighbors of v. The result of the comparison is a measure of the similitude of v with its neighbors.

Fig. 2
figure 2

The k-neighborhood of a vector. LOF identifies the k-nearest neighbors for each vector based on the k-distance. Vector v has as its neighbors vectors \(w_{1}, w_{2}\), and \(w_{3}\). The expected distance from v to its neighbors serves as the basis for the characterization of v. The neighbors of v have to be characterized in terms of their own neighbors. The characterization of v and those in its context (neighborhood) are to be compared to compute the local outlier factor of v

4 Anomaly detection in the probability simplex

The probability simplex \(\Delta ^{d}\) is embedded in a \((d-1)\)-dimensional space. Thus, existing algorithms can be applied to the points in \(\Delta ^{d}\) to identify those that constitute anomalies. However, since the geometry in \(\Delta ^{d}\) is sensible to the selection of the distance function, a relevant question is what is the most adequate geometry to consider, and from it, what is the most suitable distance function to be applied in order to identify anomalies in the probability simplex? In this contribution we explore the effect of the distance function in the LOF algorithm when applied to compositional data, that is, to points in the probability simplex \(\Delta ^{d}\).

In Fig. 4, upper plane, the blue points are expected or normal vectors, since they are part of the annulus. The red points are anomalies, precisely because they do not fall into the definition of an annulus. However, since the defining procedure of usual or expected data is not to be assumed, an alternative description is to computed so the red points are shown as anomalies. LOF does this by finding a description of the neighborhood of each vector and comparing it to the corresponding one from the neighbors of that vector. LOF relies on the characterization of neighborhoods, which are defined by distances. Since we are interested in the way LOF is affected by different distance functions, we introduce in the next paragraphs those distances that were considered in the algorithm.

We focus our efforts in several relevant distances, both metric and non-metric, and their associated geometries. The first one is Information Geometry, represented by the Jensen-Shannon distance. The second approach of interest is the Riemmanian geometry, represented by Fisher-Hotelling-Rao metric distance. A third geometry of interest is the Hilbert projective geometry, achieved by the Hilbert metric distance. The fourth geometry is the norm geometry, represented by the \(L_{1}\) metric distance. Besides these distances and divergences, we explored the effect of the Hellinger, Aitchison, and Wasserstein distances in the anomaly detection algorithm.

The Aitchison distance is defined as the Euclidean distance once all points are log-radio transformed and centered. Formally, the Aitchison distance distance between points A and B is [27, 53]:

$$\begin{aligned} D_{\textrm{Aitch}}(A,B) = D_{\textrm{Eucl}}( \textrm{clr}(A), \textrm{clr}(B) ) \end{aligned}$$
(5)

Where \(D_{\textrm{Eucl}}\) is the Euclidean distance and the \(\textrm{clr}\) operation is defined as:

$$\begin{aligned} \textrm{clr}(A_{0}, A_{1},..., A_{N-1}) = ( \textrm{log}(A_{0}/g(A)), \textrm{log}(A_{1}/g(A)), ..., \textrm{log}(A_{N-1}/g(A)) ) \end{aligned}$$
(6)

This last equation is the inverse of the softmax function, commonly applied in machine learning. \(A_{i}\) is the \(i-th\) component of the compositional point A, and \(g(A) = (\pi A_{i})^{1/N}\) is the geometric mean of A.

The Wasserstein distance, also known as the Earth Mover’s Distance, compares pairs of distributions in terms of the effort of transforming one into the other [54, 55]. At a high level of abstraction, the Wasserstein distance measures the minimum amount of work required to transform one probability distribution into another. This quantity can be thought of as the amount of soil that needs to be moved from one distribution to another to transform it. Since the effort is affected by the size of the column of soil and by how far it requires to be transported (from the first bin to the last one, for example), it is a natural way to compare histograms, and thus, compositional data.

At a high level, the Wasserstein distance measures the minimum amount of work (or effort) required to transform one probability distribution into another. This "work" can be thought of as the amount of "stuff" that needs to be moved from one distribution to another to transform it.

More specifically, given two probability distributions A and B, the Wasserstein distance between them is defined as the minimum amount of "work" required to transform A into B, where "work" is defined as the product of the distance between each point in A and its corresponding point in B, weighted by the amount of probability mass being moved.

In more technical words, given two probability distributions A and B, the Wasserstein distance between them is defined as the minimum amount of work required to transform A into B, where work is defined as the product of the distance between each point in A and its corresponding point in B, weighted by the amount of probability mass being moved [56].

The Wasserstein distance for the continuous case is expressed as:

$$\begin{aligned} W_p(A, B) = \left( \inf _{\gamma \in \Gamma (A, B)} \int _{X \times Y} d(x, y)^p d\gamma (x, y)\right) ^{1/p} \end{aligned}$$
(7)

where p is a parameter that determines the order of the distance (usually p=1 or p=2), d(xy) is the distance between points x and y in the probability distributions, and \(\gamma \) is a transport plan that specifies how much "stuff" is moved from each point in A to each point in B.

The 1-Wasserstein distance provides a metric for the comparison of probability distributions. It is computed as the minimal cost of transport expended in transforming one distribution into a second one [56]. This transformation can be computed by means of optimal transport.

The optimal transport problem can be stated in the following manner. Let A and B be two given points in the probability simplex. In the probability simplex, every point can be represented by a histogram. Let x be the histogram associated to A, with bins indexed by i. Let y be the histogram associated to B, with bins indexed by j. If \(f_{ij}\) is the amount being transported from bin i to j, we want to find the value for the flows that minimizes the cost shown in Eq. (8).

$$\begin{aligned} \sum _{i}\sum _{j} f_{ij} d_{ij} \end{aligned}$$
(8)

Where \(d_{ij}\) is the distance between the values of the random variables for bins i and j in their respective histograms.

Optimization for the coefficients \(f_{ij}\), subject to mass conservation constraints, are computed in practice using the network simplex algorithm. Once they are found, the 1-Wasserstein distance is obtained by Eq. (9).

$$\begin{aligned} W(x,y) = \frac{\sum _{i}\sum _{j}f_{ij} d_{ij}}{\sum _{i} \sum _{j} f_{ij}} \end{aligned}$$
(9)

The Hellinger distance, also known as the Bhattacharyya distance, is a measure of distance between two probability distributions. It belongs to the family of f-divergences. f-divergences are statistical divergences or information divergences, and constitute a family of mathematical functions that measure the difference between two probability distributions. They were introduced by Csiszár in the 1960 s and have since been widely used in information theory, statistics, and machine learning [57].

The general form of an f-divergence between two probability distributions A and B is:

$$\begin{aligned} D_f(A | B) = \int f\left( \frac{dA}{dB}(x)\right) dB(x) \end{aligned}$$
(10)

The Hellinger distance, for discrete distributions, is defined as

$$\begin{aligned} D_{\textrm{Hellinger}}(A,B) = 1/\sqrt{2} \sqrt{ \sum _{i = 0}^{d} \left( \sqrt{A_{i}} - \sqrt{B_{i}}\right) ^{2} } \end{aligned}$$
(11)

The Hellinger distance has been widely applied in data science. For example, it has been used in the comparison of species distribution maps [58]. One of its advantages in certain contexts is that \(D_{Hellinger}\) emphazises the differences by individual attributes. This is a rather relevant aspect for compositional data.

The Jensen–Shannon distance is measure of the similarity between two probability distributions. It is also referred to as the symmetric Kullback–Leibler divergence. The Jensen–Shannon distance is calculated as the square root of the Jensen–Shannon divergence, which is in turn defined as the average of the Kullback–Leibler divergences between the two distributions and their average distribution. The Jensen–Shannon distance is given by the equation:

$$\begin{aligned} D_{\textrm{JS}(A,B)} = \left( \frac{1}{2} KL(A||C) + \frac{1}{2} KL(B||C)\right) ^{\frac{1}{2}} \end{aligned}$$
(12)

Where \(C = \frac{1}{2}(A+B)\) and KL is the Kullback–Leibler divergence.

In simpler terms, the Jensen–Shannon distance measures the distance between two probability distributions by comparing how much they differ from their average. It has a value between 0 (indicating that the distributions are identical) and 1 (indicating that the distributions are completely dissimilar) [59].

The Hilbert distance is a useful distance measure for high-dimensional data, where traditional Euclidean distance measures can become less informative due to the curse of dimensionality. The Hilbert distance is defined as [33]:

$$\begin{aligned} D_{\textrm{Hilbert}}(A,B) = log\frac{\max _{i \in {1..d}} \frac{A_{i}}{B_{i}} }{ \min _{j} \in {1..d} \frac{A_{j}}{B_{j}} } \end{aligned}$$
(13)

This distance is computed over convex data, which makes it convenient to work with in the probability simplex. The Hilbert distance is a relatively fast and efficient way to compute distances, especially compared to other distance measures that have a higher computational complexity, such as the Euclidean distance [60].

The Fisher–Hotelling–Rao is a statistical distance used to compare two multivariate distributions. It is defined as [33]:

$$\begin{aligned} D_{\textrm{FHR}}(A,B) = 2 \times \arccos \left( \sum _{i = 0}^{d} \sqrt{A_{i}B_{i}}\right) \end{aligned}$$
(14)

The FHR distance approximates the manifold in which points seem to be embedded [61]. This metric distance is an instance of Riemannian metric in the space of probability distributions. In the probability simplex, it allows for a comparison between points, since it provides a geodesic distance.

The effect of the distance function is shown in a probability simplex \(\Delta ^{2}\) in Fig. 3. Several concentric annulus were created so that the number of points on each of them is constant. Although the definition of an annulus do not vary, the effect of the distance function is clear since each distance function created annulus of rather distinct shapes. Since several anomaly detection algorithms rely on the concept of distance, it is only natural to wonder what is the most adequate, if any, choice of distance function.

Fig. 3
figure 3

Annulus under different distances—divergences. The shape of the obtained annulus is affected by the applied distance

5 Results

Anomaly detection algorithms identify a certain attribute that is present in the majority of the vectors in a dataset, at the time that is absent in a small proportion of the observations. Several anomaly detection algorithms rely, as the discriminant attributes, in distance-related aspects, such as density and neighborhhod. In order to examine the effect of the distance function in the capabilities of LOF (see Sect. 3) to detect anomalies in compositional data, we conducted three sets of experiments.

In the first group of experiment, points describing an annulus in the n–dimensional simplex were created. Then, a few additional points not fulfilling the criteria of being part of the annulus were added. The latter are to be identified as anomalies (see Fig. 4, top). In the second set of experiments, several normalized histograms, derived from a fixed probability distribution were created and embedded in the probability simplex. Then, a few histograms from a different probability distribution were added to the dataset (see Fig. 4, bottom). Again, the latter points are to be identified as anomalies. This line of testing follows the ideas of Hawkings in his relevant book [62]. A third group of experiments comes from the codon usage problem. The core idea is that the DNA of organisms can be represented as a histogram of the relative frequency of use of each of the 64 possible triplets or codons. Organisms from the same family, say, primates, define the base or usual set. A few organisms from a different category, viruses, for instance, are to be considered anomalies. From the three sets of experiments, we are interested in quantifying the impact of the assumed geometry of the data in order to detect anomalies. Since we know before hand the label of each vector, we are in position of evaluating the impact of the geometry, relying on a fixed anomaly detection algorithm.

Fig. 4
figure 4

A sketch of the test datasets. Top: in the first group, several annulus were created within the probability simple (blue points). A few points were added to the probability simplex, but not fulfilling the pattern criteria (red). The latter are to be identified as anomalies. Bottom: several histograms were generated from a fixed probability function. Each histogram consists of n bins and is embedded into the probability simplex. A few histograms obtained from a different probability function are included to function as anomalies. In both groups, the usual or regular class is from 5 to 10 times more abundant than the anomaly class

In all three sets of experiments, LOF was tested under the distance functions described in the previous section, but to remind the reader, the considered distances are: Euclidean, \(L_{1}\), Wasserstein, Cosine, Aitchison, Hellinger, Jensen–Shannon, Fisher–Hotelling–Rao (FHR), and Hilbert. From the lof score, a further step is needed in order to decide whether a point is an anomaly or not. Since in our controlled experiments vectors are either anomalies or expected or usual vectors, the decision is based on the expected value of the lof score. Let U be the set of all common or usual vectors, and A be the set of all anomalies. Let E(U) be the expected lof score of vectors in U, and E(A) be the expected lof score for vectors in A. This is computed in the traditional manner. Now, in order to compute the true positive TP, true negative TN, false positive FP and false negative FN rates, we need to compare the lof score of vector v, namely lof(v), with both E(A) and E(U).

Let d(vA) be the absolute difference between the lof score of vector v and the expected value of anomalous vectors (A): \(d(v,A) = |E(A) - lof(v)|\). Correspondingly, let d(vU) be the absolute difference between the lof score of vector v and the expected value of usual or common vectors (U): \(d(v,U) = |E(U) - lof(v)|\). The smallest difference will give the estimated class of vector v. If \(v \in A\), we should expect that \(d(v,A) < d(v,U)\). If this is the case, then the number of TP is increased, otherwise, FN is incremented. If, on the other hand, \(v \in U\), it is expected that \(d(v,U) < d(v,A)\). If this inequality is satisfied, TN is incremented, otherwise, FP is increased. From these rates, the performance metrics are computed as follows. Precision is given by \(TP/(TP + FP)\), recall is given by \(TP/(TP + FN)\), and accuracy is given by \((TP + TN) / (TP + TN + FP + FN)\).

Algorithm 2 displays the steps to evaluate the capabilities of LOF to detect anomalies, under different distance functions.

figure b

The first group of experiments was conducted as follows. Points in \(\Delta ^{d}\) were generated with similar statistical properties, such as density, and a few points that do not fulfill that property were added. The common or expected data were obtained from an annulus. A simple case can be shown in Fig. 4, top, where the majority of the points define an annulus, and a few additional points are included, located in different regions of the simplex. The points in the annulus have the common property of being located within a certain distance range to a fixed point, namely, the center. The latter points are to be considered as anomalies, since the distance that separates them to the center is 0. We created different geometric structures in several dimensions, as explained below.

Figure 5 shows the results for the first set of experiments. The expected precision, recall and accuracy for points defining an annulus in the probability simplex \(\Delta (d)\), with \(d = 3, 5, 10, 20\). The number of points in the annulus was \(100 \times d\), that is \(\#(U) = 100 \times d\), and the number of anomalies \(\#(A) = d\) for each case. In all cases, the parameter k for LOF was fixed to \(k = \#(U) + 2\). This fixation was settled for two reasons. The first one was to reduce the search space, and the second and most important one, was to focus our efforts in the effect of the assumed geometry in the detection of anomalies, and not to identify what was the best choice of parameters. The latter would be an extension of this contribution, but it is not relevant at this point. What is relevant, for this first set, is how the metrics are affected by the distance. It is observed that the Fisher–Hotelling–Rao and the Jensen-Shannon distances present the highest recall, followed by the Hellinger, Aitchison and cosine divergence. Specificity has a less stable pattern along the four considered cases. Specificity was higher in general than precision and recall, and again, Jensen-Shannon presented the highest rates. The same pattern is maintained for precision.

An interesting observation is that the Euclidean distance shows the lowest capabilities. This is hardly a surprise, since it is well known than in many machine learning and data sciences, the assumption of an Euclidean geometry is not the best choice. However, it is relevant to observe how far the performance of this distance is led behind as compared to more adequate choices.

Fig. 5
figure 5

Precision, recall and accuracy of anomaly detection for the dataset of points defining an annulus in the probability simplex \(\Delta ^{d}\), for \(d = 3, 5, 10, 20\)

Note that for the case of the annulus, it is quite clear what is the attribute that tells apart anomalies from expected or regular vectors. An annulus is defined as the set of points located within two concentric metric balls of radius \(\theta _{1}\) and \(\theta _{2}\), respectively. That is, all points located at a distance r from a center such that \(\theta _{1}< r < \theta _{2}\), assuming the center is at the origin, are part of the annulus. In the vast majority of relevant applications, however, it is almost never clear what such property of the data should be considered to tell apart anomalies from usual or regular data. What anomaly detection algorithms aim to is to approximate such quantity via a discriminant function, which is implicitly represented in the assumptions behind the algorithm. Following this criteria, two more demanding experiments were conducted.

The second set of experiments, following [62], consisted of creating a histogram from a pure or base distribution. This distribution is fixed, and a histogram is created from sampling from that distribution. Over \(n > 50\) of such histograms were created. Then, a few histograms from a different distribution were added to the dataset. The latter are to be identified as anomalies. Figure 4, lower panel, depicts the creation of this dataset. The probability functions that were considered are Gaussian, Geometric, Uniform, Gamma, Gumbel, Weibull, with different parameters, except for the uniform distribution.

The results for the second set of experiments are shown in Fig 6. There, the base or pure probability function was sampled and a normalized histogram of b bins was created, with \(b = 10, 20, 30, 40\). The parameter b defines the dimension of the probability simplex. The number of histograms for each case in U (the expected or usual class) was \(5 \times b\), whereas the number of anomalies was set to b. The rationale behind this selection of parameters is that we are interested in comparing the effects of the assumed geometry and not in the algorithm itself. That is, since the algorithm is held constant and only the distance function is varied, it is fair to vary the relative sizes of U and A as specified. The parameter k was mset to \(|u| + 2\). By making the number of anomalies equal to the number of bins, that is, by making the number of anomalies equal to the dimension of the probability simplex, we reduce the search space and focus on what we aim to prove: whether or not the assumed geometry has an effect in the anomaly detection process.

The distributions that were considered in this experiment were Gaussian, geometric, Gumbel, Weibull, Gamma, and uniform. For each case (except for the uniform), the relevant parameters were selected at random for every experiment. Figure 6 shows the aggregate results over all cases.

Fig. 6
figure 6

Precision, recall and accuracy for anomaly detection of the normalized histograms dataset. Each point in the simplex represents an histogram computed from a given probability function. A collection of U histograms are selected from a base or pure distribution (shown as the label of each panel), and a few histograms from a different distribution define the anomaly class A. LOF is then applied to the points in the simplex (\(A + U\)). It is shown the expected precision, recall and accuracy over all number of bins and considering all cases

Figure 6 summarizes the effect of the distance function in the performance of LOF. Several hundreds of experiments were conducted. For each one, A specific U and A sets were randomly selected. When the probability distribution was the same for the base and anomalous cases, different parameters were forced (except for the uniform distribution). As before, U defines the base class (non-anomaly) and A defines the anomaly class. For a fixed U class, several cases were conducted varying A and their relevant parameters. LOF was applied to the union of sets A and U, and from the lof value of each vector, the accuracy, recall and precision were computed. Then, for each base distribution, the expected recall, precision and accuracy are computed over all cases and bins.

It is observed that the uniform distribution reached the highest performance metrics for all considered distances. This is expected, since this distribution is the less similar to the the rest and comes from a completely different family. Since anomaly detection algorithms aim to identify data that are not similar to the rest within a dataset, when the base case consists of histograms from an uniform distribution, almost any histogram from almost any other distribution will be detected as not similar to the base ones. In other words, detecting as anomalies those histograms derived from a distribution different from the uniform is a relatively easy task under the considered algorithm and for all the considered distances. On the other hand, the Gaussian distribution shows the lowest performance metrics for all distances.

We are not interested in the specific details of telling apart distributions, but, once again, we aim to identify if a given distance is in general more suitable to be applied in order to detect anomalies. In that sense, the Fisher–Hotelling–Rao and Jensen–Shannon distances are the ones with the best performances for all distributions. In short, the anomalies can be better spotted, at least in this context, by distances that are derived from Information Geometry.

The third experiment is perhaps the most interesting one. We applied LOF to the set of codon usage along thousands of organisms. In this experiment, a relatively large group of elements (organisms) was sampled from the same phylum, and linked to the expected set U, whereas the anomaly candidates that constitute the set A were obtained from a different type. The expected or usual organisms may be, for example, the codon usage of several primates, whereas the anomalies are obtained from bacteria. By doing this, we assume that the second group is the anomaly class, and we are thus able to test the performance of LOF based on different distance functions.

For this group of experiments, it is shown the results of the analysis over the codon usage database [63, 64]. In nature, there are 20 amino acids, and four nucleotides. Amino acids are the building blocks of proteins [65]. In order for organism to code all possible amino acids, sequences of length three are needed. In this way, up to 64 (\(4^{3}\)) possible amino acids could be represented by an alphabet of four symbols (nucleotides) and sequences of length three. Sequences of shorter lengths cannot represent all 20 amino acids: Sequences of length one can code for only four amino acids, sequences of length two can code for up to 16 amino acids, so the minimum length of sequences of nucleotides is three [66].

The equivalence between the nucleotide code and the amino acid code is known as the genetic code [63]. Each of the 64 sequences is described as the relative frequency of its appearance within the genome. For example, the sequence ATC, which indicates adenine followed by thymine followed by cytosine, appears on average 20.8 times for every 1000 triplets in the human genome. From this perspective, each organism can be though of as a point in a probability simplex embedded in a space of dimension 64. The relative frequency of appearance of each possible codon or triplet is known as codon usage [67]. Phylogenetically related organisms tend to have a similar codon usage, whereas unrelated organisms are more likely to show a rather different codon usage. Since codon usage can be represented by normalized histograms, it is possible to validate in that dataset the approach described in this contribution.

Several relevant questions arise once biology is put in the frame work of data science and IG. In this contribution, however, we focus our efforts in detecting the geometries that are best suit to identify anomalies within the codon usage made by several organisms. Figure 7 shows the performance metrics of applying LOF under the specified distances to the codon usage dataset.

Fig. 7
figure 7

Precision, recall and accuracy for anomaly detection of the codon usage dataset. Each point in the simplex represents an histogram computed from a given probability function. A collection of U histograms are selected from a base or pure distribution, and a few histograms from a different distribution define the anomaly class A. LOF is then applied to the points in the simplex (\(A + U\)). It is shown the expected precision, recall and accuracy over all number of bins and considering all cases

We observe in Fig. 7 shows the performance metrics of LOF under the considered distances. Four groups of organisms were considered: Primates, bacteria, virus, and invertebrates. The size of each set is 418, 4918, 4097, and 3536 organisms, respectively. Several Monte Carlos iterations were performed. For each iteration and base or usual class, selected from the four phylum, a sample of size one tenth of the size of that group was selected. This defined the set U. Then, between \(\frac{|U|}{20}\) and \(\frac{|U|}{10}\) organisms from any of the remaining three classes were selected at random to be included in the set A. LOF was applied to the union of A and U, under the different distances, as in the case of the two previous experiments. 25 Monte Carlo experiments were performed for each of the four categories of organisms, and the expected perform metrics are shown in the figure.

Aligned with the previous two datasets, the Jensen-Shannon and Fisher–Hotelling–Rao distances achieved the best results. For this case, the Hellinger and Wasserstein distances offered results almost as good as those achieved by the Jensen-Shannon and the Fisher–Hotelling–Rao distances. Interestingly, the Euclidean distance does not perform well as compared to the rest. Also interesting is the observation that primates as the base or expected class were the case with the lowest performance metrics for all distances. This is of biological relevance, but it is a good opportunity to show what we think is an important corroboration via a mathematical approach: the primates are distributed in the probability simplex in such a manner that organisms belonging to other families cannot be identified as anomalies, no matter what geometrical assumptions are maintained. This is also the case, although to a lower extent, for invertebrates. Figure 8 shows the results of applying Principal Component Analysis (PCA) to the codon usage data of all 12,969 organisms, after a log-centered transformation. It is observed that indeed, primates are located in at least two groups, which makes it difficult for LOF to properly detect anomalies when the base class U is conformed or heterogeneous instances.

Fig. 8
figure 8

PCA, after a log-center transformation, of codon usage data. All organisms in the dataset reported in [63] were included in the visualization

The results described in this section describes the capabilities of LOF as an anomaly detection algorithm under different distances. We have shown that the performance metrics of the same algorithm varies substantially depending on the selected distance function. In particular, in order to give evidence of our initial hypothesis, the Euclidean distance is not the best choice when trying to identify anomalies in compositional data.

6 Discussion and conclusions

What we know nowadays as data science has been around for at least three decades, but its origins can be tracked to at least the works of Pearson in Statistics [68], Poincare in Geometry and Topology [69], Lloyd in vector quantization [70], Shannon in Information Theory [71], and many more contributors that have enriched the field and applied it to several practical problems. What is causing, in our opinion, a deep change in data science is the inclusion of a different set of mathematical tools with two main objectives. The first one is to conduct exploratory data analysis under more robust grounds. The second goal is to achieve more explainable models.

Exploratory data analysis is an instance of unsupervised learning. Unsupervised learning explores the distribution of vectors in the (high-dimensional) attribute space with the aim of finding relevant patterns and structures. One of such patterns is the set of vectors that do not fulfill a certain criteria that is common to the vast majority of the analyzed vectors. The vectors that are different from the majority are candidate anomalies. The anomaly detection problem is an open task basically because it is an ill-posed problem: The number of solutions exceeds the number of free parameters, and, more over, those parameters are not known. A vector can be an anomaly under certain assumptions, and not under a different set of constraints. In other words, it is not clear what is the correct criteria to compare vectors and it is not clear how to identify the threshold to decide whether vectors are different enough as to be considered anomalies or not.

In an equally critical path, the majority of the anomaly detection algorithms operate under certain assumptions about the underlying geometry of the feature space, that not always hold. Almost all anomaly detection algorithms that operate in terms of distance-based criteria assume an Euclidean geometry, regardless of the nature of the data. In this contribution, we have investigated the role of different geometries and the associated distance functions for one anomaly detection algorithm applied to compositional data. Compositional data is defined as a collection of elements whose descriptions or attributes add up to a fixed quantity, such as normalized histograms. Compositional data can be embedded in a probability simplex, and, derived from the properties of such structure, different geometries can be assumed. We tested Local Outlier Factor, a well-known anomaly detection algorithm, under different distance functions, over different compositional datasets.

We investigated the effect of several distance and divergences in the detection capabilities of Local Outlier Factor. This is a well-known anomaly detection algorithm, and, since its distance-based served our purposes of quantifying the effect of the chosen distance to detect anomalies. In particular, we tested the performance of this algorithm under different geometries and distances or divergences. First, we considered Information Geometry (IG) under the Jensen-Shannon distance. The second approach of interest was the Riemmanian geometry, represented by the Fisher-Hotelling-Rao metric distance. A third geometry of interest we relied on was the Hilbert projective geometry, achieved by the Hilbert metric distance. The fourth geometry was the norm geometry, represented by the \(L_{1}\) metric distance. Besides these distances and divergences, we explored the effect of the Hellinger, Aitchison, and Wasserstein geometries.

We conducted three sets of experiments in which the task was to identify anomalies via Local Outlier Factor under different geometries. In particular the datasets consisted of composituional data, which can be embedded in the probability simplex. From there, we answered the question of what points within the probability simplex are to be considered anomalies. In all three groups of experiments, it was observed that the Jensen-Shannon and the Fisher–Hotelling–Rao distances led to the best performance metrics. The Wasserstein, Hellinger, and Aitchison distances also displayed good results, almost all better than what was obtained when the Euclidean distance was considered.

Our hypothesis in this contribution was that distances obtained from Information Geometry were a better choice than the usual Euclidean distance to detect anomalies for compositional data. We have provided evidence that the hypothesis can be accepted, although, of course, a more formal proof is the next step.