Comparison-based centrality measures

Recently, learning only from ordinal information of the type “item x is closer to item y than to item z” has received increasing attention in the machine learning community. Such triplet comparisons are particularly well suited for learning from crowdsourced human intelligence tasks, in which workers make statements about the relative distances in a triplet of items. In this paper, we systematically investigate comparison-based centrality measures on triplets and theoretically analyze their underlying Euclidean notion of centrality. Two such measures already appear in the literature under opposing approaches, and we propose a third measure, which is a natural compromise between these two. We further discuss their relation to statistical depth functions, which comprise desirable properties for centrality measures, and conclude with experiments on real and synthetic datasets for medoid estimation and outlier detection.


Introduction
Assume we are given a finite dataset D = {x 1 , . . . , x n } in a metric space (X, d), but do not have access to an explicit representation or the pairwise distances. Instead, the only information available for any triplet (x, y, z) in D is the answer to the triplet comparison d(x, y) In many applications of data analysis, such as human intelligence tasks for crowdsourcing, only ordinal information as in Eq. (1) is naturally available. It is difficult for humans to determine the distance between items in absolute terms, and their answers will have a large variance. Relative statements, however, are easier to make and more consistent. For example in Fig. 1, image x is obviously central in the presented triplet, because it contains both snow and trees, but it is hard to quantify this relationship.
Lately, comparison-based settings receive increasing attention in the machine learning community; for an overview, see Kleindessner and von Luxburg [18,Section 5.1].
An important task in statistics is to summarize the distribution of the data into a few descriptive values. In this paper, we focus on measures of centrality, the most prominent examples in a standard setting being the mean, the median, and the mode. While those summary statistics are readily computed given an explicit representation of the dataset, it is much less obvious how to do so when only ordinal information is available, or whether it is possible at all.
Learning a centrality measure based on triplet comparisons already appears in the literature under two opposing approaches: Heikinheimo and Ukkonen [12] propose a score that penalizes points for being the outlier in a triplet, while Kleindessner and von Luxburg [18] reward points for being central in a triplet. The latter motivate their central-based score by relating it to k-relative neighborhood graphs and a statistical depth function, which cannot be done for the outlier-based score. Figure 1 shows a triplet with its corresponding central point and outlier. Both scores are intuitively appealing as a measure of centrality, but it is not obvious how they relate to other known centrality notions. Furthermore, there is a third kind of point in a triplet next to central point and outlier, namely the remaining point that opposes the middle side (compare y in Fig. 1). This point scores 0 for both scores, because they only use a binary distinction with respect to outlier or central point.
Our main contributions are summarized as follows: • Introduction of rank score. We propose the rank score, a natural compromise between outlier and central score that considers all three possible scenarios in a triplet. • Characterization of central points. We establish a theoretical connection between the three scores and their corresponding scoring probabilities. Based on this connection, we characterize the most central point with respect to each score for one-dimensional distributions.
In doing so, we improve on a result from Heikinheimo and Ukkonen [12] for the outlier score. • Statistical depth functions. We show that the rank score is not a statistical depth function, but nevertheless satisfies a weaker version of the violated properties. • Experiments. On two image datasets, we evaluate the three scores for the tasks of medoid estimation and outlier identification and find that their estimations are similar to the ones given by the average Euclidean distance. On synthetic datasets, we highlight two respective weaknesses of outlier and central score, which are mitigated by the rank score.
After a brief summary of related work in Sect. 2, we define the scores in Sect. 3 and establish the connection to their scoring probabilities. In Sect. 4, we characterize the central points for one-dimensional distributions, and then in Sect. 5 investigate the relation between our proposed rank score and statistical depth functions. We conclude with experiments on real and synthetic datasets in Sect. 6.

Related work
There exist various different approaches for learning from triplet comparisons on a multitude of tasks. An indirect approach is to first learn a Euclidean embedding for the dataset that satisfies the ordinal constraints and then use standard machine learning algorithms on the explicit representation. This can be done by a max-margin approach [1] or by maximizing the probabilities that all triplet constraints are satisfied under a stochastic selection rule using a Student-t kernel [30]. Other variants of this problem include the active setting [26] and learning hidden attributes in multiple maps [2]. There exists some theory for the consistency [3,14,15] of such embeddings, also in the case of noisy [13] and local [27] comparisons. This approach has a number of drawbacks: existing embedding algorithms do not scale well, they make the implicit assumption that the points lie in a Euclidean space, the quality of the embedding depends on the embedding dimension, and this intermediate step in the learning procedure introduces additional distortion; for a discussion, see Kleindessner and von Luxburg [18,Section 5.1.2]. For instance, our problem of determining the mean (or similarly the median or the mode) of the underlying distribution of a dataset could be tackled this way: we embed the points using one of the methods above, compute the mean in the Euclidean space, and then return the datapoint whose embedding is closest.
A more direct approach than embedding the dataset is the quantization of the ordinal information by learning a distance metric from a parametrized family with a max-margin approach [24] or by using kernels that only depend on triplet comparisons [17]. Lately, however, the trend is to avoid any such intermediate steps: Heikinheimo and Ukkonen [12] and Kleindessner and von Luxburg [18], which we consider in this work, find the medoid with score functions that are computed directly from triplet comparisons. Ukkonen et al. [29] build on this idea for nonparametric density estimation. As an alternative to a generic embedding approach, both scores provide parallelizable direct approaches to learning from triplet comparisons that are tractable on medium-sized datasets, where the embedding approach is not. For clustering, Ukkonen [28] formulates and approximates a variant of correlation-clustering [4] that takes a set of relative comparisons as input; hierarchical clustering is addressed in a setting, where one can actively query ordinal comparisons [7,33], and also in the passive setting [9]. To estimate the intrinsic dimension of the dataset, Kleindessner and Luxburg [16] propose two statistically consistent estimators that only require the k nearest neighbors for each point. Haghiri et al. [10] construct decision trees based on triplet comparisons to find those nearest neighbors and then extend this idea for classification and regression [11]. Another approach to classification is to aggregate triplet comparisons into weak classifiers and boosting them to a strong classifier [23].

Preliminaries
Let (X, d) be a metric space and D = {x 1 , . . . , x n } ⊂ X a finite set of n ∈ N points. For x ∈ X, let T x = {(x, y, z) | y, z ∈ D, x, y, z distinct} denote the set of all ordered triplets in D with x in the first position. For simplicity, we assume all distances between points in D to be distinct. We say that x is an outlier in (x, y, z) ∈ T x , if it lies opposite the shortest side in the triangle given by the three datapoints, that is, Similarly, x is central in (x, y, z) ∈ T x , if it lies opposite the longest side in the triangle, that is, We consider three different score functions that aim to measure the centrality of any datapoint with respect to a given dataset.
Definition 1 (Scores) Let x ∈ X be any point in the input space. With respect to a finite dataset D = {x 1 , . . . , x n } ⊂ X, we define the outlier score S O , the central score S C , and the rank score S R as The outlier score S O is a normalized version of the score considered in Heikinheimo and Ukkonen [12] and counts the number of triplets (x, y, z) ∈ T x in which x is the outlier; similarly, the central score S C is a normalized version of the score considered in Kleindessner and von Luxburg [18] and counts the number of triplets (x, y, z) ∈ T x in which x is the central point. The rank score S R , which we propose, compares the distance d (x, y) between the fixed point x and one test point y to the distance d (y, z) between both test points y and z. A corresponding crowdsourcing task would present the workers with two items A and B and then let them assign a third item C to the one which is more similar. From the perspective of an unordered triplet {x, y, z}, a point x scores 0 for being the outlier, 2 for being the central point (once in (x, y, z) and once in (x, z, y)), and 1 otherwise. With that, S R is the only of the three scores that depends on the order of the triple and distinguishes between all three possible cases. The name rank score was chosen because the score of x in a triplet {x, y, z} corresponds to the rank of d(y, z).
To get an intuition for the behavior of the three scores, we plot them as a function of x ∈ R 2 with respect to three different two-dimensional datasets in Fig. 2. As expected, points in the middle of the dataset score low for the outlier score S O and high for the central score S C . The rank score S R behaves similarly to the central score S C , but transitions more softly from high to low scores.
The rank score is closely related to the outlier and the central score, as they are all based on triplets. The following proposition shows that it can be viewed as a compromise between the two: points are rewarded for being the central point in a triplet, but also penalized for being the outlier.
This yields the claimed equality From now on, we will assume D = {X 1 , . . . , X n } to consist of i.i.d. (independent and identically distributed) samples drawn from some distribution P on X. That is, the samples are jointly independent and every X i is distributed as P. This allows us to treat the scores on D as random variables in order to analyze their statistical properties and relate them to other centrality measures such as mean, median, or mode. In our running example of a crowdsourcing task, the workers could be presented with random, independent samples from a large database of images X.
The i.i.d. assumption is strong, but central in statistical learning theory, because it allows for strong consistency results. For example in classification, i.i.d. data ensure that nearest neighbor classifiers and support vector machines asymptotically achieve the lowest possible risk [25,31].
An important quantity for our analysis is the probability of a fixed point scoring in a random triplet: Definition 2 (Scoring probabilities) Let x ∈ X and Y , Z ∼ P be two independent random variables distributed according to P. Then, for each score, we define the corresponding scoring probabilities Because of the normalization, the scoring probabilities are equal to the expected score in a random set.
Proposition 2 (Expected score) For a fixed x ∈ X and a set of i.i.d. random variables D = {X 1 , . . . , X n }, it holds that where the expectation is taken with respect to D. Furthermore, Proof Follows directly from the linearity of the expectation and Proposition 1.
The following theorem shows that all three scores concentrate around their mean in the large sample size limit. . . , X n } and q ∈ {q O , q C , q R } the corresponding scoring probability. Then, for n ≥ 2, ε > 0, and any x ∈ X, it holds that Equivalently, with probability greater than 1 − δ it holds that Proof We obtain Eq. (2) by using Chebyshev's inequality and bounding the variance with simple combinatorial arguments, and Eq. (3) is merely a reformulation. We refer to Section A of Appendix for a full version of the proof.
Our analysis of the scores as centrality measures requires that we restrict our attention to rotationally invariant (symmetric) distributions P. Intuitively, a rotationally invariant distribution has a center such that rotations around this center do not change the distribution. For example, the standard Gaussian distribution N(0, I d ) on R d is rotationally invariant around the origin. Section 4 requires this additional assumption to make the analysis of outlier and rank score tractable (yet still generalizes previous results from Heikinheimo and Ukkonen [12], which only considers univariate Gaussians); Sect. 5 investigates whether the scores are statistical depth functions, whose definition requires rotational invariance. Note, however, that this assumption is only needed for the theoretical analysis, which justifies the scores as centrality measures in simple settings. The scores can also produce meaningful results on more general data such as images, see Sect. 6.

Definition 3 (Rotational invariance)
Let X ∼ P be a random variable on R d for some d ∈ N. We say that P (or X ) is rotationally invariant around c, if X ∼ RX for any rotation R around c, where "∼" denotes "equal in distribution." A function f : Since the scoring probabilities are based on distances, it is not surprising that they inherit the rotational invariance from rotationally invariant distributions. Without loss of generality, the center of rotation is assumed to be the origin, because distances are invariant under translations.

Proposition 3 (Rotational invariance of scoring probabilities)
Let P be a distribution on R d for some d ∈ N, which is rotationally invariant around the origin according to Definition 3. Then, the scoring probabilities q O , q C , and q R are also rotationally invariant around the origin.
Proof Follows directly from the invariance of distances under rotations and the rotational invariance of P. For example, consider q O and any rotation R. For fixed x ∈ R d and Y , Z independently drawn from P, it holds that

Characterization of central points
The goal of this section is to understand the behavior of the score functions with respect to the underlying distribution of the data. In particular, we characterize the points that are declared as most central by the scores. For the outlier score, the most central point is the one that is penalized least often, that is, arg min x∈D S O (x). For central and rank score, this point is the one that is rewarded most often, that is, arg max x∈D S C (x) and arg max x∈D S R (x). Theorem 1 allows us to do the analysis for the corresponding scoring probabilities instead of the scores. Unfortunately, the scoring probabilities are tractable enough to do so only for onedimensional distributions.
The central scoring probability q C can be computed in closed form and analyzed without any further assumptions: Proposition 4 (Central scoring probability) Let P be a distribution with density function p > 0, F its cumulative distribution function, and m its median, i. e., F(m) = 1/2. Then, it holds that q C is increasing on (−∞, m], decreasing on [m, ∞), and m is the unique global maximum of q C .
In our particular case of d = 1, this can be reformulated to and since Y and Z are i.i.d. with cumulative distribution function F, we obtain Eq. (4). The derivative is given by and since p > 0 and F is increasing with F(m) = 1/2, the sign is In particular, q C is increasing on (−∞, m] and decreasing on [m, ∞), wherefore the unique global maximum of q C is given by m as claimed.
Our next result covers the outlier scoring probability q O in a more restrictive setting, yet generalizes Theorem 1 in Heikinheimo and Ukkonen [12], which only treats univariate Gaussian distributions. Proof Without loss of generality, we can assume m = 0 as argued before Proposition 3, which then yields the symmetry of q O . As Heikinheimo and Ukkonen [12] derived, the derivative of q O for fixed x ∈ R is given by For x < 0, we can lower bound a part of the second integral Λ : and with the symmetry of p around 0, we get that Using this lower bound in Eq. (5) yields . By the symmetry of q O around 0, it also has to be increasing on [0, ∞) and thus 0 is the unique global minimum as claimed.
Using Proposition 2, the previous two propositions can be combined to a corresponding statement about the rank scoring probability. Proof Follows directly from the decomposition of q R in Proposition 2 and the results on the other scoring probabilities q O and q C in Proposition 4 and Proposition 5.
We tried extending this approach of directly computing the scoring probabilities by definition to distributions on R d ; although closed-form representations for the rank scoring probability q R and its derivative for rotationally invariant distributions P are available, they are not tractable enough to infer on the monotonicity or global maximum of q R .

The rank score and statistical depth functions
The main motivation for the central score is its relation to the lens depth function [21], which is assumed to be a statistical depth function. It is important to note that this has yet to be proven, as Kleindessner and von Luxburg [18, Section 5.2] pointed out a mistake in the proof of properties P2 and P3 (defined below). The outlier score, on the other hand, is provably not related to a statistical depth function, because its scoring probability does not satisfy those two properties. A counterexample for this was already given by Heikinheimo and Ukkonen [12] for symmetric bimodal distributions in one dimension.
In this section, we show that the rank scoring probability also does not satisfy P2 and P3. However, Proposition 6 shows that it satisfies a weaker version for rotationally invariant distributions, which extends to the rank score in Theorem 2. For the remainder of this paper, we only consider X = R d for some d ∈ N.

Statistical depth functions
Denote by F the class of distributions on the Borel sets of R d and by F ξ the distribution of a given random vector ξ . The four desirable properties that an ideal depth function D : R d × F → R should possess as defined in Zuo and Sering [35] are P1. Affine invariance. The depth of a point x ∈ R d should not depend on the underlying coordinate system or, in particular, on the scales of the underlying measurements. For any random vector X in R d , any non-singular A ∈ R d×d and any b ∈ R d , it should hold that D(Ax +b, F Ax+b ) = D(x, F X ). P2. Maximality at center. For a distribution having a uniquely defined "center" (e. g., the point of symmetry with respect to some notion of symmetry), the depth function should attain maximum value at this center. This means D(θ, F) = sup x∈R d D(x, F) holds for any F ∈ F having center θ . P3. Monotonicity relative to deepest point. As a point x ∈ R d moves away from the "deepest point" (the point at which the depth function attains maximum value; in particular, for a symmetric distribution, the center) along any fixed ray through the center, the depth at x should decrease monotonically. This means for any F ∈ F having deepest point θ , Property P1 is not satisfied by any of the scores in this general form, but it holds for similarity transformations, because they preserve inequalities between distances. That is, whenever A = r Q for a scalar r ∈ R + and an orthogonal matrix Q ∈ R d×d . P1 also holds for non-singular A ∈ R d×d with respect to the Mahalanobis distance, which itself depends on the distribution [21,Theorem 1]. In this case, the transformation preserves the distances.
The rank scoring probability does not satisfy P2 and P3 and is therefore not a statistical depth function. As a counterexample, consider the uniform distribution on {−5, −3, 3, 5}, which is rotationally invariant around 0 according to Definition 3. For this distribution, it is q R (0) = 8/16 < 9/16 = q R (2), which violates P2, because q R is not maximal at the center of rotation 0, and P3, because it is non-decreasing on lines away from 0.
The rank scoring probability satisfies P4 by Lebesgue's dominated convergence theorem and 1 {d(x,y)<d(y,z)} → 0 as x → ∞ for any y, z ∈ R d .

Rank score is approximately decreasing
We now show that the rank scoring probability at least satisfies a weaker version of P2 and P3 in Proposition 6, which translates to the rank score itself in Theorem 2. First, we give an alternative formula for the rank scoring probability: Lemma 1 (Formula for rank scoring probability) Let x ∈ R d , Y a random variable with distribution P, which is rotationally invariant around the origin, and q R defined on P. Let S and R x be two independent random variables, distributed Proof Here, we provide a sketch of the proof; for a full version of the proof, we refer to Section B in Appendix. The idea is to use the definition of q R and rotate the appearing areas of integration such that they can be described in terms of the marginal distribution Y 1 alone. Because P is rotationally invariant by assumption, doing so does not change the value of q R . Analyzing the involved rotations, which determine the distribution of R x , concludes the proof.
Our next result is a weak version of properties P2 and P3 from Sect. 5.1 and states that the rank scoring probability q R is at least approximately decreasing away from the center of rotation.
Proposition 6 (q R is approximately decreasing) Let Y 1 be the first coordinate of Y , which is distributed as an around the origin rotationally invariant distribution P. Then, for any Proof To derive the upper bound (7), we use the reformulation of q R provided in Eq. (6) in combination with bounds for the auxiliary function R x . For a complete proof, we refer to Appendix, Section C.
The bound in Proposition 6 uses the tail of the marginal distribution of P. To control these tail probabilities, we consider a special class of distributions: Definition 4 (sub-Gaussian, [32,Section 3.4]) Let X ∼ P be a random variable on R d for some d ∈ N. We say that for all t ≥ 0 and suitable c x > 0.
Under the assumption that P is sub-Gaussian, we obtain a more specific bound for the rank scoring probability q R .
Corollary 2 (sub-Gaussian bound for q R ) Let Y 1 be the first coordinate of Y , which is distributed as an around the origin rotationally invariant and sub-Gaussian distribution P. Then, for any for all t ≥ 0 and suitable c y > 0. Under our assumption on P to be rotationally invariant around the origin, Y being sub-Gaussian is even equivalent to only requiring condition Eq. (8) for y = e 1 , because the distribution of any onedimensional marginal depends only on y . Furthermore, Y 1 is necessarily symmetric and we obtain Denoting c := c e 1 in Eq. (8) yields We use this upper bound in Eq. (7) to complete the proof.
Combining Corollary 2 and Theorem 1 yields the main result of this section, which shows that under reasonable assumptions on the distribution P, the rank score S R is approximately decreasing for points x far away from the center of symmetry. Theorem 2 (S R is approximately decreasing) Let P be an around the origin rotationally invariant and sub-Gaussian distribution on R d . Then, for suitable c > 0, any 0 < δ < 1 and x, x ∈ R d with x ≤ x , the rank score S R , defined on a dataset D = {x 1 , . . . , x n } of n ≥ 2 i.i.d. samples from P, satisfies with probability greater than 1 − δ that Therefore, with probability greater than 1 − δ (on the set as claimed. The bound in Eq. (9) depends on x and n, which are decoupled in two terms exp −c x 2 and √ 40/(δn). The latter goes to 0 as n −0.5 in the large sample size limit and accounts for the error made by using a finite set of samples D. Although we would like the first term to be 0, the counterexample in Sect. 5.1 shows that q R is not always decreasing as a function of the norm, wherefore some dependency on x is necessary. It vanishes exponentially for datapoints of large norm; in this case, however, we already have the statement S R (x) → 0 by the property P4 ("vanishing at infinity") of q R and Theorem 1. There is no explicit dependency on the other datapoint x besides x ≤ x , because this is sufficient for bounding the auxiliary function R x in Lemma 2 in Appendix.
As a consequence of Theorem 2, the score S R cannot blow up far away from the center of rotation with high probability. Since it uses high scores to infer centrality, this means that no obvious outliers are estimated to be central. A decreasing score S R would imply that our score always recovers the center of rotation as the most central point; in that sense, this proposition controls the distance of our estimation to the center.
We would like to improve the proposition by removing the x term for dimensions d > 1. As mentioned above, this term is necessary for d = 1, but simulations in higher dimensions suggest that the scoring probability q R is actually always decreasing.

Experiments
We have shown in Sect. 4 that the scores recover the median for one-dimensional symmetric distributions. But to what notion of centrality do the scores correspond in more general settings? We also observed in Sect. 5 that only the central score comes with the desirable property of being a statistical depth function. But does this imply that it performs better in practice? To answer these questions, we evaluate the scores on medoid estimation and outlier detection tasks on real and synthetic datasets. Medoid estimation aims to find the most central point in a dataset, which minimizes the average distance to other points. Outlier detection has the opposing goal of identifying points that are considerably different from the majority of the dataset. The experiments on real datasets in Sect. 6.1 show no notable difference between the three scores and suggest that they recover the Euclidean notion of centrality given by the average distance to the dataset. We then highlight some of the more subtle weaknesses of and differences between the scores on synthetic datasets in Sect. 6.2.

Image datasets
Datasets and preprocessing. Our first dataset is the wellknown MNIST database of handwritten digits [20] reduced to 300 randomly chosen images per digit. Since these random subsets are unlikely to include clearly visible outliers, each subset was expanded by the 5 images with largest average Euclidean distance within each class. Our other dataset NATURE [22] consists of outdoor scene photographs for 8 landscape and urban categories (coast, forest, highway, inside city, mountain, open country, street, tall building). The number of images per category ranges from 260 to 410.
Ideally, we would like to compute the scores within each category based on triplets obtained in a crowdsourcing task. However, the cubic number of triplets makes this infeasible even for medium-sized datasets (300 images yield ≈ 4,500,000 triplets). Because we want to avoid any finite sampling effects induced by using only a subset of triplets, we use two rough proxies for triplets labeled by humans: triplets are computed based on the Euclidean distances (1) in the pixel space and (2) for a feature embedding given by a neural network. For the latter, we use the AlexNet architecture [19]   pre-trained on ImageNet [6]. The feature embedding is then given by the network after removing the last three fully connected and softmax layers. This transfer-learning approach is motivated by the generality of representations learned in the first few network layers across different domains [5,34]. Note that although a Euclidean representation of the data is available, the scores can still only use the resulting triplet information to access the data. Experiment setup and evaluation metric. We compare the scores in medoid estimation and outlier detection tasks for each class of both datasets and for both Euclidean spaces. To quantify the distance between the top 5 rankings, we use a normalized version of the averaging footrule distance F avg proposed by Fagin et al. [8], where low distance implies that the rankings are similar. This distance generalizes Spearman's footrule, which is the L 1 -distance between two permutations, to top k lists. It considers both the order and the number of shared elements, and values range from 0 for identical rankings to 1 for rankings on disjoint elements. As an example, it is F avg ( [1,2,3,4,5], [1,3,4,5,8] Numbers are bold, if the respective unordered rankings agree on at least 4 out of the 5 images, and italic, if they agree on at most one. In every case, the rankings agree on at least one image. The low values in this table imply that all rankings are very similar and the expected distance between any fixed top 5 ranking and a random one on a domain with 300 elements is ≈ 0.98. Evaluation. Figure 3 shows the results on two classes for each dataset. The most striking observation is that the predictions for central points and outliers are almost consistent across all four centrality measures. This shows that all three scores have comparable performance in medoid estimation and outlier detection tasks, despite their differences in theoretical guarantees. Since the score-induced rankings agree with those of the average Euclidean distance baseline, this observation also suggests that the scores recover the Euclidean notion of centrality. This is in line with our onedimensional findings of Sect. 4: the scores recover the median of a distribution P as the most central point, and the median of a continuous distribution minimizes the expected distance E X ∼P [|x − X |] over all x ∈ R. These findings are supported by Table 1, which quantifies the distance between the rankings for all datasets. The small distances in Table 1a show that the top 5 rankings produced by the three scores are very similar, and the small distances in Table 1b show that these rankings agree with those given by the average Euclidean distance.
A second observation is that these findings are consistent across both feature representations of the datasets, the pixel space and the AlexNet embedding. Therefore, they cannot be dismissed as an artifact of the pixel space. Since triplets based on the AlexNet embedding might resemble the human notion of similarity more closely, it is possible that rankings on crowdsourced triplets display similar behavior.
The third observation concerns the quality of the estimated medoids and outliers and has already been made for the outlier score [12] and the central score [18]: estimated medoids are generic images of a class, while estimated outliers are more diverse. This also holds true for the rank score. For example, in Fig. 3d for the class forest, the central images are thematically homogeneous and all show green trees. The outliers contain different colors like green, red, and yellow and include diverse motifs like a person, trees, and water. The notion of centrality captured by the scores changes with the feature space, but the general concept of generic and diverse stays the same. For example, in Fig. 3a, the pixel space representation identifies almost exclusively thick digits as outliers, while the AlexNet embedding in Fig. 3b also includes thin, skewed digits.

Synthetic datasets
In this section, we partly repeat experiments from Kleindessner and von Luxburg [18] on two simple synthetic datasets for which we additionally include the rank score. First, we demonstrate the relation between the scores and the average distance. We then highlight some of their more subtle differences for the tasks of medoid estimation and outlier identification. Relation to average distance. Figure 4 shows the scores for a Gaussian dataset plotted against the average distance. As expected, the outlier score is roughly increasing as a function of the average distance while central and rank score are roughly decreasing. Hence the scores serve as a proxy for the average distance, which was already observed in Sect. 6.1. We can quantify this relationship with Spearman's rank correlation coefficient, which describes the monotonicity between two variables. The respective values for outlier, central, and rank score are 0.997, − 0.995, and − 0.999, which implies an almost perfect monotonic relationship. The corresponding tail ends of the scores therefore contain candidates for medoid and outliers. However, this does not tell us how many outliers there are, if any at all, because this would require to identify gaps in the average distance. The scores are less sensitive to such gaps, because they are based on relative rather than absolute information. This can be observed in Fig. 4 for the two rightmost average distance values, whose gap is not accompanied by a corresponding gap in the scores. This behavior is further discussed in the next paragraph under outlier identification. Medoid estimation and outlier identification. This paragraph highlights two respective weaknesses of outlier and central score. The rank score shows neither of them, as it is a compromise between outlier and central score.
For the first task of medoid estimation, we consider the circular dataset shown in Fig. 6. Each score tries to predict the medoid by returning the point at its corresponding tail end. That is, the outlier score returns the point with lowest score, whereas central and rank scores return the point with highest score. Central and rank scores correctly predict the origin, but the outlier score predicts the point indicated by the red circle. The heatmap for the outlier score shows that the area around the medoid is in fact a local maximum instead of the global minimum. This effect was already observed in one  dimension for bimodal symmetric distributions as discussed in Sect. 5 and attests to the fact that the outlier score is not a statistical depth function. A statistical depth function would declare points at heart of the dataset as central, even if this region is sparse, because it ignores multimodal aspects of distributions. Another example in which the outlier score respects the multimodality of a dataset is given by the mixture of Gaussians in Fig. 2.
For the second task of outlier identification, Fig. 5a shows points from a standard Gaussian with two clearly visible outliers added by hand. Similar to medoid estimation, each score now proposes candidates for outliers at its other tail end, that is, highest points for the outlier score and lowest points for central and rank score. Figure 5b shows that both outliers are correctly placed at the corresponding tail end for all three scores. However, only outlier and rank score show a clear gap between the scores of the outliers and the other points. The lack of such a gap makes it hard for the central score to estimate the amount of outliers.

Conclusion and future work
In this paper, we consider three comparison-based centrality measures on triplets, one of which we propose as a natural compromise between the other two. We provide a theoretical analysis of the scores for one-dimensional distributions to characterize their most central points and investigate their connection to statistical depth functions. We conclude with experiments for the tasks of medoid estimation and outlier detection: on image datasets, we demonstrate the behavior of the three scores and hint toward their connection to the average Euclidean distance. On synthetic datasets, we highlight two respective weaknesses of the existing two scores, which are mitigated by our proposed score.
As for future work, it is an open question whether the rank score is a statistical depth function for dimensions d ≥ 2. Similarly, fixing the proof for the lens depth function would solidify the motivation for the central score. We further plan to investigate and formalize the connection between the scores and the average Euclidean distance.

A Proof of Theorem 1
In this section, we prove the convergence of the scores to their corresponding scoring probabilities.

Theorem 1 (Concentration inequality for scores) Let S ∈ {S
O , S C , S R } be any score on a set of i.i.d. random variables D = {X 1 , . . . , X n } and q ∈ {q O , q C , q R } the corresponding scoring probability. Then, for n ≥ 2, ε > 0, and any x ∈ X, it holds that Equivalently, with probability greater than 1 − δ it holds that Proof All three scores are of the generic form where I (x, X i , X j ) is the corresponding indicator function taking values in {0, 1}. In order to use Chebyshev's inequality, we first upper bound the variance of S(x) using the known formula The variance and covariances in the formula above are upper bounded by 1, because the involved random variables only take values in {0, 1}. Since D consists of independent samples, the covariance is 0, if i, i , j, j are distinct. Therefore, the covariances are only summed over the set A of all remaining pairs. Denoting B := ((i, j) ∈ {1, . . . , n} 2 | i = j 2 the set of all possible pairs, we have

B Proof of Lemma 1
Lemma 1 (Formula for rank scoring probability) Let x ∈ R d , Y a random variable with distribution P, which is rotationally invariant around the origin, and q R defined on P. Let S and R x be two independent random variables, distributed as Proof Let x ∈ R d and denote the marginal density of Y 1 by p 1 . For clarity, we explicitly indicate the corresponding random variable at the density as in p = p Y . By definition, Z )). Since Y and Z are i.i.d. and δ is symmetric in its arguments, the right hand side is equal to P(d (x, Y ) < d (Y , Z )), and marginalizing over Y yields Because P is rotationally invariant around the origin, we can rotate the half-space H x (y) for fixed y ∈ R d , the area of integration for the inner integral, without changing its mass under P. In order to express q R (x) solely by a marginal distribution of P, we rotate the half-space with the unique rotation R that yields the rotated set z ∈ R d | z 1 < R x (y) for an appropriate value of R x (y). By doing so, we obtain A change of variables under the function y → R x (y) yields which we describe in terms of independent random variables S ∼ Y 1 and R x as , which we achieve by determining a closed form for R x (y).
Let ξ x (y) denote the unique point on ∂ H x (y) that satisfies R (ξ x (y)) = R x (y)e 1 ; this situation is depicted in Fig. 7. Because (y − x)/ y − x is the outward pointing normal vector for H x (y) before rotating and e 1 is the outward pointing normal vector after rotating, we get that Because ξ x (y) is the unique point that satisfies R (ξ x (y)) = R x (y)e 1 , we obtain Since ξ x (y), (x + y)/2 ∈ ∂ H x (y) are both points on the dividing hyperplane, they have to satisfy the equation 0 = ξ x (y) − (x + y)/2, y − x . Combined with Eq. (10), this yields or equivalently

C Proof of Proposition 6
The proof of Proposition 6 is based on the reformulation Eq. (6) of q R , which uses an auxiliary function R x . In order to bound q R , we first provide bounds for R x in the following lemma: Lemma 2 (Bounds for R x ) Let x, x ∈ R d with x = αx for some 0 ≤ α ≤ 1 and y ∈ R d with y = x, x . Then, it holds that x, y ≤ x, x ⇒ R x (y) ≤ R x (y) , x, y ≥ 3 2 (1 + α) x, x ⇒ R x (y) > R x (y) .
The first part of this lemma yields a tractable bound for the set {R x (Y ) ≤ R x (Y )}, whereas the second part tells us that the bound cannot be improved substantially.

Proof
For given x and y, we abbreviate c := x, x and d := y, y . By definition of R x , we have We distinguish between three different cases depending on the signs of both sides. Case I: α 2 c < d < c . Here, the left hand side in Eq. (11) is negative, whereas the right hand side is positive, and the inequality holds trivially.
Next, we want to use the inequality x, y ≤ c. We have and therefore x, y ≤ c yields which completes this case. Case III: d ≤ α 2 c . For the remaining case, we consider the function β → R βx (y) and show that it is decreasing on √ d/c, ∞ . Since d ≤ α 2 c implies √ d/c ≤ α and by assumption it is α ≤ 1, this yields the desired inequality R 1x (y) ≤ R αx (y). We have Expanding the terms inf yields f (β, x, y) = −β 3 c 2 + (3β 2 c + d) x, y − 3βcd (Cauchy-Schwarz) Therefore, by Eq. (14), the function β → R βx (y) is decreasing on √ d/c, ∞ , which completes the proof of this remaining case, and thus the first implication. For the second implication x, y ≥ 3/2(1 + α) x, x ⇒ R x (y) > R x (y), we go back to Eq. (13) and show f (α, x, y) ≥ 0. As before, we are in the case d ≥ x, y 2 c (Cauchy-Schwarz) thus it is g(α, c, d) ≥ 0 and using the inequality x, y ≥ 3/2 (1 + α) c in Eq. (13) yields where the inequality at ( * ) holds, because d ≥ 9 4 (1 + α) 2 c. By Eq. (12), this completes the proof.
With these bounds, we are now prepared to proof Proposition 6.
Proposition 6 (q R is approximately decreasing) Let Y 1 be the first coordinate of Y , which is distributed as an around the origin rotationally invariant distribution P. Then for any x, x ∈ R d with x ≤ x , it holds that Proof Let x ∈ R d and S, R x , and R 0 as in Lemma 1, where R x and R 0 are coupled via Y . Define E x := {R x ≤ R x }. By Lemma 1, it holds that By the first inequality of Lemma 2, we can upper bound P(E c x ) ≤ P( x, Y > x, x ). Lastly, Proposition 3 allows replacing x by x e 1 to obtain which completes the proof.