Simplicial and Minimal-Variance Distances in Multivariate Data Analysis

In this paper, we study the behaviour of the so-called k-simplicial distances and k-minimal-variance distances between a point and a sample. The family of k-simplicial distances includes the Euclidean distance, the Mahalanobis distance, Oja’s simplex distance and many others. We give recommendations about the choice of parameters used to calculate the distances, including the size of the sub-sample of simplices used to improve computation time, if needed. We introduce a new family of distances which we call k-minimal-variance distances. Each of these distances is constructed using polynomials in the sample covariance matrix, with the aim of providing an alternative to the inverse covariance matrix, that is applicable when data is degenerate. We explore some applications of the considered distances, including outlier detection and clustering, and compare how the behaviour of the distances is affected for different parameter choices.


Introduction
The Mahalanobis distance is one of the most useful tools in multivariate data science, underpinning a huge variety of practical data analysis methods. This distance measures the proximity of a point x ∈ R d to a d-dimensional set of points X = {x 1 , . . . , x N } ⊂ R d×N . It was introduced in Mahalanobis [27]. The Mahalanobis distance corresponds to the Euclidean distance in the standardized space where variables are uncorrelated and have unit variance. It has applications in cluster analysis [16,44], outlier detection [35], financial settings [40], text classification [39] and image processing [32,45], to name a few.
The Mahalanobis distance requires the inversion of the d × d sample covariance matrix, which we denote by W . This covariance matrix is commonly singular (or very close to being singular) in high-dimensional settings [13], which often causes instability when computing the Mahalanobis distance and sometimes makes it practically unusable.
The Moore-Penrose pseudo-inverse W − is commonly used in cases where the covariance matrix is not invertible, see Wei et al. [41] and Lahav et al. [22], for example. This pseudo-inverse is constructed using the nonzero eigenvalues and corresponding eigenvectors of the covariance matrix W , and satisfies the four Moore-Penrose conditions [17]. However, by using all nonzero eigenvalues, it can be adversely affected by very small eigenvalues, creating discrepancies between W − and the true inverse, if it exists [6,19]. The Moore-Penrose pseudo-inverse is unique, and so has no flexibility in these cases where eigenvalues close to zero can cause issues.
Knowing how to deal with high-dimensional data is extremely important, as our ability to collect and store large quantities of data continues to grow. Although there are benefits to having datasets with large amounts of information, analysing and understanding these datasets can present many challenges. For example, the concentration of points in high dimensions can cause distances between them to be (approximately) the same [5], yielding difficulties in determining whether points are 'close' or 'far away' to each other [1]. Given a high enough dimension, there will exist at least one subspace such that a significant number of points can be classified as being 'close' together [46]. Huge datasets create combinatorially intractable search spaces and extracting information or patterns may not be computationally possible. Such datasets also increase the probability of detecting spurious relations [38]. Existing rudimentary data processing algorithms do not scale to scenarios of degenerate data in big dimensions [2].
High-dimensional data is usually highly correlated [9,33]. Measures of proximity that typically work well in low-dimensional settings, such as the Euclidean and Manhattan distance, do not take correlations into consideration when measuring the scatter of the data. Using a distance measure like the Mahalanobis distance accounts for these correlations and therefore produces more accurate and meaningful results in such settings. However, the correlated variables also result in an intrinsic dimensionality that is lower than the given dimensionality of the data [46], causing degeneracy. As such, it is important to have methods that can account for correlations, that can effectively deal with the possibility of singular covariance matrices, and that are not limited by the problems of the Moore-Penrose pseudo-inverse.
A common solution to some of the problems outlined is to use an estimator which produces an invertible sample covariance matrix, rather than using the sample covariance matrix produced by the maximum likelihood estimator (MLE). Examples include methods which shrink the MLE sample covariance matrix towards some other target matrix: Ledoit and Wolf [25] use the identity matrix as the target, whereas Schäfer and Strimmer [37] allow for a user-defined target matrix. However, it is hard to know what the ideal target matrix is [24]. Other methods include shrinking only the largest and smallest eigenvalues [43], tapering off-diagonal elements to zero [3,15], applying 1 penalties directly to covariance matrix entries [10] and many others. We recommend Fan et al. [12] and Lam [23] for detailed overviews of such methods.
Methods which estimate the inverse covariance matrix directly, such as neighbourhood selection [29], modified Cholesky decompositions [21] and 1 regularization approaches [7,14], prevent inversion errors and can improve computation time. However, these methods still suffer from the potentially impractical assumption that the inverse covariance matrix should be sparse.
Pronzato et al. [34] introduced a family of so-called k-simplicial distances which, according to the claim in that paper, resemble the squared Mahalanobis distance and could be used in situations where the data is degenerate or close to being degenerate. These k-simplicial distances are defined as averaged volumes of all possible k-dimensional simplices formed by x and all points from the set X , for a user-defined k ∈ {1, . . . , d}. The distances are raised to a user-defined power δ > 0. For general δ > 0, the choice k = 1 gives the δ -distance in R d . For δ = 2, the parameter choices k = 1 and k = d give distances proportional to the squared Euclidean and Mahalanobis distances, respectively. The choice of parameter k indicates the dimension of the simplices used, and can be chosen in such a way as to avoid the problems of degeneracy and small eigenvalues, which cause issues in other methods, such as the Moore-Penrose pseudo-inverse. When δ = 2, there is a more efficient method of finding the k-simplicial distance using a matrix polynomial in W , rather than through computation of volumes of simplices, which is outlined in Sect. 2. For other values of the parameter δ, we demonstrate a sub-sampling method which improves computational speed, with very little observed change to the distances measured.
We also introduce a new distance, the k-minimal-variance distance, which constructs a matrix polynomial in W of degree k − 1. In this method, the constructed polynomial yields a squared distance with minimum variance. Like the k-simplicial distance, the k-minimal-variance distance can be used when W is singular, and produces a distance measure which accounts for correlations in the dataset. We explore the choice of the parameter k, and show that k can be relatively low to produce good results, making the k-minimal-variance distance a quick and viable alternative to the Mahalanobis distance. We will use examples to show how minimizing the variance of the distances can provide a more intuitive distance measure for use with correlated data.
We consider the performance of the k-minimal-variance distance and compare it to the k-simplicial distance, the Euclidean distance and the Mahalanobis distance. We also show in Sect. 5.2 that both distances proposed may produce more accurate results than the Euclidean and Mahalanobis distances when used for clustering applications.
The structure of this paper is as follows: Sect. 2 introduces and further studies the k-simplicial distance formulated in Pronzato et al. [34]. Section 3.1 explores the effects of the parameters k and δ in the k-simplicial distance through numerical examples. Section 3.2 introduces the sub-sampling method which reduces computation time and allows us to consider large sample sizes. Section 3.3 uses outlier detection examples to produce parameter recommendations for the k-simplicial distance. Section 4 introduces the k-minimal-variance distance. We compare the k-simplicial and k-minimal-variance distances to each other in both their efficiency at minimizing variance in Sect. 5.1 and compare their performance at clustering some real-life datasets against the Mahalanobis and Euclidean distances in Sect. 5.2. We give our conclusions about the distance measures in Sect. 6.

k-Simplicial Distances
Assume we are given a set of N points in d-dimensions X = {x 1 , . . . , x N } ⊂ R d×N , with no assumptions on how this set of points has been generated. The sample mean and covariance matrix associated with X are defined, respectively, as follows: The squared Euclidean distance between a point x ∈ R d and the set X is The squared Mahalanobis distance between x ∈ R d and X is defined by where W −1 is the inverse of the sample covariance matrix W . If the matrix W is singular then its inverse does not exist, and so the Moore-Penrose pseudo-inverse W − is often used in place of W −1 in (2). A pseudo-inverse is a type of generalized inverse, satisfying some but not necessarily of all the properties of an inverse matrix; the Moore-Penrose pseudo-inverse satisfies the four Moore-Penrose conditions [17]. When we use W − in the Mahalanobis distance we will call it the pseudo-Mahalanobis distance.
To compute the k-simplicial distance between x ∈ R d and X = {x 1 , . . . , x N } ⊂ R d×N , we calculate volumes of k-dimensional simplices raised to a given power δ > 0. Let r ≤ d be the intrinsic dimension of the dataset X , which is the rank of X when X is considered as a d × N matrix. The volumes of all k-dimensional simplices are zero for k > r and so it makes no sense to use k > r .
Let V k (x, z 1 , . . . , z k ) be the volume of a k-dimensional simplex with vertices x and z 1 , . . . z k ∈ R d . This volume can be computed by where det(A) is the determinant of the matrix A and |a| is the absolute value of the scalar a [34].
be the set of all ordered k-combinations of the indices in {1, . . . , N }. Define which is the average volume of all k-dimensional simplices created by the query point x and points in X , raised to the power of a user-defined scalar δ > 0. For given δ > 0 and 1 ≤ k ≤ r , the centre of the set X (that is, the k-simplicial multidimensional median) is defined as μ k,δ = arg min which may not be uniquely defined for δ ≤ 1. Recommendations for these parameters will be discussed in Sects. 3 and 5. We then define the k-simplicial outlyingness function by The function (5) is non-negative, has value 0 at the centre of the sample and is unitless; these are the required properties that an outlyingness function must possess, see Wilks [42]. For any δ > 0, we define the k-simplicial distance (here to the power of δ) from the query point x to the dataset X by where the constant c k,δ is chosen so that The normalization (7) is introduced to ensure consistency of the k-simplicial distances for different k. In the next section we see that, for δ = 2 and all k ≤ r , this normalization constant is c k,2 = 1/k. For δ = 2, we find constants c k,δ numerically from (7).
As shown in [34,Theorem 5], for δ = 2 and any eligible k we get μ k,δ = μ, the sample mean. Moreover, we obtain, similarly to [34,Sect. 3.1] We then define the squared k-simplicial distance (of order δ = 2) from x to the dataset X as The difference between (9) and the corresponding definition in [34,Eq. 17] is the introduction of the normalizing constant 1/k, which provides consistency of the distances for different k in the sense that (7) holds for δ = 2 and all k = 1, 2 . . . , r . The equality in (7) with δ = 2 directly follows from (8).
Direct evaluation of the distances given by (9), i.e. by empirical calculation of the volumes of all N k simplices, can be computationally time-consuming. Below, we consider an alternative method for the case with δ = 2, which is much faster and easier. Reducing this computational time when δ = 2 is considered in Sect. 3.2.
Let = {λ 1 , . . . , λ d } be the set of eigenvalues of the sample covariance matrix W defined in (1). The elementary symmetric function of degree k ≤ d associated with the set is given by with e 0 ( ) = 1. If k > r = rank(X ) then e k ( ) = 0 and the k-simplicial distance is always 0.
For k ≤ r , define the function and the associated matrix S k = q k (W ) e k ( ) .
As follows from [34,Sect. 3.2], for any k ≤ r , Note that the matrices S k , k = 1, . . . , r , are polynomials in the covariance matrix W . Since S 1 = I d /trace(W ), where I d is the identity d × d matrix, for k = 1 the squared distance (11) is equal to the squared Euclidean distance divided by the trace of the covariance matrix W : When k = d and W is invertible, we have S d = W −1 and therefore the squared k-simplicial distance (11) is equal to the squared Mahalanobis distance multiplied by a factor 1/d (for details see [34,Sect. 3.1]): We prove the following theorem comparing the variance of the squared Euclidean distance, Mahalanobis distance and k-simplicial distance with k = 2 and δ = 2.
Theorem 1 Assume X = {x 1 , . . . , x N } is a set of N normally distributed ddimensional vectors (data points) with sample mean μ and sample covariance matrix W , as defined in (1). Let = {λ 1 , . . . , λ d } be the set of eigenvalues of the matrix W , and assume rank(X ) = r ≤ d. Then where ρ 2 k,2 (x, X ) is the squared k-simplicial distance between the point x and set X as defined in (6) with δ = 2.
Proof We can write the k-simplicial distance between a point x and set X with k = 2, δ = 2 as (10). From (24) in Appendix A, the variance of the k-simplicial distance with k = 2, δ = 2 may be written as Let Consider the second-order elementary symmetric polynomial: Then (12) can be rewritten as: and similarly, using (25) from Appendix A, we may write Consider the denominator in (14). By the Cauchy-Schwartz inequality, and so it follows that Again using the Cauchy-Schwartz inequality for the denominator in (13), we have and so it follows It remains to show that The validity of the inequality in (15) does not depend on the change λ i → cλ i for all i and for any constant c > 0. Therefore, we can choose λ 1 , λ 2 , . . . , λ r such that r i=1 λ i = 1 and express quantities in (15) as moments of a random variable ξ concentrated on [0, 1] having values λ i with probabilities λ i .
Let τ j = E[ξ j ]. In this notation, we have the following properties: 500 points are sampled from a multivariate normal distribution with zero mean and diagonal covariance matrix, with eigenvalues given in the table. The rank r of the dataset is also given Using these properties, the inequality in (15) has the form Rearranging gives τ 3 1 + 2τ 2 − 2τ 2 1 − τ 3 ≥ 0, which is true for all probability measures on [0, 1].

Choosing k in the k-Simplicial Distance
The choice of the parameter k is integral to the performance of the k-simplicial distance. In this section, we show how different choices of k affect the distance through experimental results. We use three examples, in each of which we generate N = 500 points from a d-dimensional multivariate normal distribution, with zero mean and diagonal covariance matrix. The value of d and the eigenvalues of the covariance matrix used to generate the points are given in Table 1. We use the sample covariance matrix when computing our distance measures, and so the true eigenvalues of the matrix will differ slightly from those in the table.
For given values of k ≤ r = rank(X ), we find the k-simplicial distances between all points in the dataset to the dataset X itself, for both δ = 2 and δ = 1. Note that for distances using δ = 1, sub-sampling is used to find the distance, using the method described in Sect. 3.2. We compare the empirical cumulative distribution functions (CDF) produced by the k-simplicial distances in Figs. 1, 2 and 3. For examples with δ = 2 we also consider the squared pseudo-Mahalanobis distance multiplied by 1/r , which is equal to the k-simplicial distance with k = r .
Example 1 Eigenvalues = A . The CDFs for the distances measured for Dataset A using the k-simplicial distance with δ = 2 are given in Fig. 1a and indicate that the squared Euclidean distance (proportional to the k-simplicial distance with k = 1, δ = 2) produces a large range of distances with high variance, when compared to the distances produced when using other values of k. We see, in the δ = 2 case, low values of k (compared to the rank r = 9) begin to converge away from the squared Euclidean distance, and towards the squared pseudo-Mahalanobis distance quickly. Figure 1b shows a similar pattern for the distance with δ = 1. Example 2 Eigenvalues = B . The CDFs for the k-simplicial distances using δ = 2 on Dataset B are given in Fig. 2a. For relatively low values of k (compared to the rank r = 40), such as k = 10, we see the distances converging to those produced when k = r , i.e. the pseudo-Mahalanobis distance in the case where δ = 2. A similar profile is observed for δ = 1 in Fig. 2b, with CDFs of distances converging towards the CDF with k = r = 40 as k increases.
Example 3 Eigenvalues = C . The CDFs for the distances with δ = 2 are given in Fig. 3a. Again, we see for relatively low values of k (compared to rank r = 22) the distance measure converges towards the distance where k = r . Note that for the k-simplicial distances with δ = 1, k = 10 in Fig. 3b, the CDF lies underneath that of k = 22, as the distances produced are so similar. Figures 1, 2 and 3 all demonstrate that the k-simplicial distance transitions from the squared Euclidean distance multiplied by 1/trace(W ) to the squared Mahalanobis distance multiplied by 1/r for δ = 2 as k increases. A similar monotonic behaviour is shown for δ = 1. The eigenvalues of the covariance matrix have an effect on what an appropriate choice of k may be. It is important to ensure the most influential dimensions (that is, those with the largest eigenvalues) are all considered, by taking k larger than the number of large eigenvalues.
For example, consider Fig. 3. The two large eigenvalues in C result in k = 2 behaving similarly to k = 1, particularly in the δ = 2 case, whereas in Fig. 1, the CDF produced using the distance with k = 2 is very different to the CDF where k = 1, as there is only one large eigenvalue.
In general, we recommend using a value of k that is larger than the number of 'large' eigenvalues the covariance matrix W has, relative to the size of the other eigenvalues. This is easier to see when there is a clear elbow or 'drop-off' in the value of the eigenvalues. Otherwise, it can be appropriate to find the k-simplicial distances with several values of k and measure the best value according to some metric appropriate to the task. This is a common method for choosing a parameter value in many parameterdependent tasks, such as K -means clustering.
Not much performance gain is made by choosing a value of k that also encompasses the smaller eigenvalues. As an example of this, see Fig. 2, where there are seven 'large' eigenvalues, 33 'small' eigenvalues and 10 zero eigenvalues. Using k = 10 does not give a huge improvement in performance compared to using k = 5 (where performance is measured here by the minimizing of variance) but it is computationally more expensive.

Numerical Computation of k-Simplicial Distances Using Sub-Sampling
When δ = 2, the k-simplicial distance is calculated by averaging the volumes of all N k simplices formed with x and X . This can be computationally intensive, particularly for large N and d. To circumvent this problem, we can sample a subset of the simplices to reduce computation time to milliseconds. The size of the sub-sample of simplices depends on the user's wish for precision. This size does not have to be large to achieve practically accurate approximations, which will be demonstrated in the examples that follow, where we use less than 0.05% of all possible simplices when we use k = 3, and less than 0.0004% when using k = 4.
Let J be as defined in (3). To compute the k-simplicial distances, we have to compute the values of P k,δ (x, X ) defined in (4). The procedure to approximate these values is as follows. For any sampling proportion γ ∈ [0, 1], we form J (γ ) , a subset of J of size |J (γ ) | = γ × N k and approximate (4) with A simple but efficient way of constructing J (γ ) consists of taking random samples of size k without replacement from the set {1, 2, . . . , N }, see Blom [4]. This reduces computation time dramatically, and in examples that follow we see such sub-sampling is highly effective in producing results extremely close to those of the 'full' distance measure, in which we average the volumes over all available simplices.
In the following examples we revisit the sets of eigenvalues given in Table 1 and calculate the distances from all points in a dataset to the dataset itself, using the full sample (where possible) and then a smaller sample using 10,000 simplices. We generate data according to the procedure outlined in the beginning of Sect. 3.1. We compare the effect that different sampling sizes have on the distribution of distances calculated though investigating histograms and moments of these distances. Our analysis in Sect. 3.1 indicates that using low values of k gives good performance, and is less computationally intensive than using higher values, so we will use k = 3 and k = 4 in the sub-sampling examples that follow.

Example 1 Eigenvalues
= A . Figure 4a shows histograms of the distances between all points of dataset A to the dataset itself, as produced by the k-simplicial distance with δ = 2, k = 3. The blue solid histogram shows the 'full' distances with no sub-sampling (using polynomials), and the orange dotted histogram shows the subsampled distance with 10,000 simplices. This is repeated for other parameters in the rest of Fig. 4, as detailed in the captions.
These histograms show that the distribution of distances produced using a small sample of simplices is extremely similar to the distribution of distances produced using the full sample of simplices available. For the examples using δ = 1, we cannot produce the full distance directly as it requires the computation of the volume of 500 k simplices, so we compare the distances produced using a sub-sample of 10,000 simplices to the distances when using a larger sub-sample of simplices (1% of the total amount of simplices in the k = 3 case, 0.01% in the k = 4 case). In both cases, the distribution of the larger sample and the 10,000 simplex sample remain very similar. Table 2a and b also demonstrate this, with the summary statistics remaining close even for small samples.
Example 2 Eigenvalues = B . We again see in Fig. 5 and Table 3a and b that using a low number of simplices (compared to the full amount of simplices available, or a large sample) produces distances that are mostly the same as the full distance measure. This example illustrates that the sampling method is effective even in cases with a lot of small and zero eigenvalues.
Example 3 Eigenvalues = C . Figure 6 and Table 4a and b show that the number of small or zero eigenvalues does not influence the performance of the sub-sampling. Overall, we see sub-sampling is an effective way to drastically reduce computation Fig. 4 Histograms to compare the distribution of the k-simplicial distances from all points to the mean for eigenvalues A with different parameters k and δ for different sampling amounts. Blue, solid histograms show the distances produced using the full sample of simplices when k = 3, and a larger sample when k = 4. The orange dotted histograms show the distances produced when using a sample of 10,000 simplices.  time while maintaining the same results as the full k-simplicial distance. This means that using the distance with δ = 2 is much more accessible than it otherwise would be.

Outlier Labelling Example
In this section, we illustrate one potential application of the k-simplicial distance measure. The k-simplicial distance could be a useful tool in identifying outlying points in high-dimensional degenerate datasets, where the Euclidean distance struggles to measure distance meaningfully due to the sparse and correlated nature of the data, and the Mahalanobis relies on the inversion of a matrix possessing many small (and potentially zero) eigenvalues. We investigate how different values of the parameter k and the scalar power δ perform in identifying outliers. We perform the following experiments. We consider three 10-dimensional examples, each with different sets of data. Each dataset D i , i = {I , I I , I I I }, is made up of two clusters: D i = D i,1 + D i,2 . The first cluster D i,1 has 450 points, mean μ 1 as specified in Table 5 and covariance matrix produced by a matrix with eigenvalues as specified in the table, rotated by a rotation matrix. The second cluster D i,2 has 50 points, a different mean μ 2 but the same covariance matrix as D i,1 . By doing this, we  Datasets are made of two clusters of different sizes and different means, but the same covariance matrix test the robustness of the distances against rotations and correlations in the data, as well as its ability to tell two similar but separate clusters apart. We measure the distance of all points in the dataset D i to the largest cluster, D i,1 . We label the furthest 50 points from this cluster D i,1 as outliers for each dataset i. We consider how many of the points the k-simplicial distances correctly label as outliers from D i,1 for different values of k and δ. If it were to incorrectly label all the outlying points as inliers, we would have a minimum value of 400. If the distance correctly labels all points, we will get a value of 500. Table 6 contains the number of points correctly labelled by the distance measures using different values of k and δ for the k-simplicial distances. In Table 7, we provide the Area Under the Receiver Operating Characteristic Curve (AUC) score for the labels produced by the distances, for different values of k and δ. The AUC score measures the overall performance of a binary classifier, where a score of 1 indicates a perfect labelling and 0.5 is the minimum score [18].
Considering the k-simplicial distance with δ = 2, we see that the values of k which perform best are those slightly larger than the number of 'large' eigenvalues. For Dataset I , we have 4 'large' eigenvalues and values of k = 5, 6 perform best when using δ = 2. Similar results are shown in Datasets I I and I I I . Larger values of k begin  to break down when δ = 2 as they require the use of the smaller eigenvalues when forming the simplices. This indicates that lower values of k outperform the squared pseudo-Mahalanobis distance multiplied by 1/r . Distances using δ = 1 are more robust to the effect of degeneracy. The performance improves as k increases, but unlike the δ = 2 case, there is no breakdown in success once k encompasses the smaller eigenvalues too, making it less sensitive to the choice of k than the distance with δ = 2. These distances were computed using very low subsampling amounts, and so there is not considerable computational time disadvantage in using δ = 1 over δ = 2. Overall, this example illustrates that using δ = 1, even with sub-sampling, can give a more stable distance measure than using δ = 2 as k increases, particularly for outlier detection applications.

k-Minimal-Variance Distances
Assume that X is a normally distributed d-dimensional dataset, with sample mean μ and sample covariance matrix W . We now introduce a family of generalized squared distances from a point x to the set X , in the form where A is a matrix polynomial in W of user-defined degree k − 1 ≤ r , where r = rank(X ). From the moments of a quadratic form (see (23) in Appendix A), we have For given k ≤ r , we wish to find the matrix A such that trace(AW ) = d holds and Var ρ 2 A (x, X ) is minimized. The condition trace(AW ) = d ensures the identifiability of a solution from the minimization of the Lagrange function (20) and weights the solution towards W −1 (if W −1 exists).
We can motivate minimizing the variance of the distances by considering the CDFs in Figs. 1, 2 and 3. The Mahalanobis distance has the smallest variance out of all distances considered, and can be written as a (d − 1)-degree polynomial when the covariance matrix is non-singular. We aim to replicate the minimization of the variance of the distances produced through lower degree polynomials that also work in the case of a singular covariance matrix, making for a quicker and more versatile method.
Let A be a polynomial in W , expressed as The first moment of the distance (16)  Using this notation, the variance of the distance (16) can be written using (23) from Appendix A as Var For the matrix A * , the variance of the distance (16) is given by Proof The Lagrange function used to minimize the variance (18) with the constraint Note that θ S = trace(AW ).
Differentiating the Lagrange function with respect to θ and setting the result equal to 0, we get Y Y θ = ωS, which gives: θ * = ω(Y Y ) − S. The required value of ω is found from the unbiasedness condition trace(AW ) = d, Substituting θ * into (18) gives the variance of the distance as in (19).

Efficiency of k-Minimal-Variance Distances Compared to k-Simplicial Distances
We consider the efficiency of the k-minimal-variance distances (16) with A = A * as derived in Theorem 1 and k-simplicial distances (11) with δ = 2 relative to the (sometimes pseudo-)Mahalanobis distance. As we consider the Mahalanobis distance multiplied by 1/r to align with the k-simplicial distance, we must also consider the k-minimal-variance distance multiplied by 1/d for comparability.
We define the efficiency of the k-minimal-variance distances as Var ρ 2 r ,2 (x, X ) with Var ρ 2 A * (x, X ) derived in (19). We define the efficiency of the k-simplicial distances with δ = 2 as with Var ρ 2 k,2 (x, X ) stated in (24) in Appendix A. We generate N = 500 points X = {x 1 , . . . , x N } ⊂ R d×N from a d-dimensional multivariate normal distribution with zero mean and diagonal covariance matrix W with eigenvalues = {λ 1 , . . . , λ d }. We take the eigenvalues to be: Table 8 demonstrates the high-efficiency of the k-minimal-variance distances even for small k. Note that k − 1 is the order of the polynomial in W minimizing (18); in this example, linear and quadratic polynomials perform well. The efficiency of the k-simplicial distances improves as k gets larger but also has variance tolerably close to that of the Mahalanobis distance even for k significantly smaller than r .
For larger dimensions, with covariance matrices possessing a number of zero eigenvalues, the examples are more striking. Table 9 gives the results of performing the same exercise on the datasets described in Sect. 3.1, with eigenvalues given in Table  1. Table 9 shows that the k-minimal-variance distances start to have similar variance to the squared (pseudo-)Mahalanobis distance using much lower values of k than in the k-simplicial distance with δ = 2. For values as low as k = 2, we see the variance of the k-minimal-variance distance is much closer to that of the Mahalanobis distance than the k-simplicial distances. Table 9 Efficiencies (21) and (22) for different k, with three different sets of eigenvalues of the covariance matrix W as given in Table 1 in Sect. 3.1. We note little performance gain when choosing k > 3 in these examples, indicating that A * even with small k is often a good enough approximation to the inverse of the covariance matrix from the viewpoint of the distance generated by this matrix. From an efficiency perspective, the k-minimal-variance produces better results at a lower computational cost than the k-simplicial distance.

Comparison of Performances of the K-Means Clustering Algorithm with Different Distances
We compare the performance of the K -Means clustering algorithm [26] when applied using the Euclidean distance, the Mahalanobis distance, the k-minimal-variance distance and the k-simplicial distance (with δ = 2). We do this by applying K -Means to 5 real datasets, obtained from the UCI Machine Learning Repository [11], with the exception of the 'Digits' dataset, which was obtained through the Python package sklearn's data loading functions [31]. The details of these datasets are given in Table  10.
Each dataset was appropriately preprocessed: rows with missing values were removed, and the data was normalized such that each variable has values in range [0, 1]. It is important to note that the K used in the K -Means clustering algorithm is used to indicate how many clusters we seek, and is different to the k used in our distance measures. For each dataset, the choice of K in the K -Means algorithm is used as the 'true' number of clusters, given in Table 10, as these datasets are all fully labelled.
The K -Means algorithm is classically applied using the Euclidean distance, but research has shown success in applying the algorithm with the Mahalanobis distance to exploit the covariance structure of a dataset (see Gnanadesikan et al. [16], or more recently [30]). The method of applying K -Means with the Mahalanobis, k-minimalvariance or k-simplicial distances is given in Algorithm 1, following the algorithm given in Melnykov and Melnykov [8]. These distance measures require the covariance matrix of each cluster; and as such we require initial estimates of the clusters. We obtain these initial estimates by performing a few iterations of the K -Means algorithm using the Euclidean distance. Clearly, this initial estimate can have a large influence on the resulting clusters found by the other distances, and so we run the K -Means algorithm 1000 times for each distance.
Algorithm 1: Applying K -Means with the Mahalanobis, k-minimal-variance or k-simplicial distance Apply K -Means clustering with Euclidean distance for a small number of iterations; Calculate the centroids μ i of each cluster C i by taking the mean of all points in the cluster while Distance between old centroids and new centroids is less than some small chosen tolerance do Calculate the covariance matrix W i of each cluster C i ; Calculate the distance from each point to each cluster C i using the Mahalanobis, k-minimal-variance or k-simplicial distance with the corresponding covariance matrix W i and cluster centroid μ i ; Assign each point to its closest cluster; Calculate new centroids μ i for each new cluster end As this is a supervised task, we can compare the labels given by K -Means to the 'true' labels to assess the performance of the clustering algorithm. We do this using two external evaluation methods, namely the adjusted rand (AR) score [20,31] and the purity score [28]. We use these two different evaluation methods to corroborate the results. The adjusted rand score is calculated as follows.
Let L T be the vector of true labels, and let L P be the labels assigned by the K -Means clustering. Define a as the number of pairs of points in the same set in L T and in the same set in L P , i.e. the number of points whose labels are the same in L T and L P . Define b as the number of pairs of points in different sets in L T and in different sets in L P , i.e. the number of points whose labels are different in L T and L P . The unadjusted rand score is given by where ν is the total number of possible pairs in the dataset, without ordering. The unadjusted rand score does not account for the possibility that random label assignments can perform well, so we discount the expected rand score E[R] of random labellings by defining the adjusted rand score as .   To find the purity score, we proceed as follows: let T = {t 1 , t 2 , . . . , t m } be the set of 'true' clusters in the data, and let P = {p 1 , p 2 , . . . , p K } be the set of predicted clusters. The purity score measures the extent to which a predicted cluster p i only contains points from a single 'true' cluster t j : where N is the total number of points. That is, for each predicted cluster p i , we count the highest number of points from a single true cluster t j predicted to be in p i . These counts are summed and divided by the total number of observations. The purity score takes values in [0, 1], with 1 being a perfect clustering. Figure 7 gives the AR and purity scores for the K -Means clustering of each dataset using the varying distance measures being considered. For the k-minimal-variance and k-simplicial distances, we show the distances with values of k which produced the highest scores. Note that the pseudo-Mahalanobis distance is used in cases where the data is degenerate. The eigenvalues for each of the datasets in Table 10 can be found in Appendix B. The influence of these eigenvalues on the performance of the different distance measures is important, particularly when choosing values of k. When we discuss eigenvalues being 'close to zero', this is in relation to the largest eigenvalue of the dataset. There is no specific threshold for being 'close to zero', but the examples should give some intuition about choosing the parameter k. Table 11, 12 and 13 give the median AR score for each dataset by each method of clustering, and the standard deviation of these scores, with bold values denoting the highest score(s) out of all methods used. Although the AR score and the purity score measure distinct aspects of the success of a clustering, Fig. 7 shows that the same patterns emerge for both evaluation methods, and so we only consider the AR score in the tables.
Iris is a 4-dimensional dataset, with no extreme small eigenvalues in comparison to its largest eigenvalue. Figure 7a, b and Table 11 show that the Mahalanobis distance performs best, joint with the k-simplicial and k-minimal-variance distances when k = 4 (recall that these distances are equal to the Mahalanobis distance when k = d). The Iris dataset is low-dimensional and full-rank, and hence the Mahalanobis distance can use the true inverse of the covariance matrix. This example illustrates the performance gains that can be made in cluster analysis by taking the correlations in the data into consideration. Even when we use k = 3 in the k-simplicial and k-minimal-variance distances, we achieve better results than the Euclidean distance. Figure 7c, d and Table 12a consider the Wine dataset, and show that the Mahalanobis, k-minimal-variance and k-simplicial distances outperforms the Euclidean distance, again highlighting the importance of accounting for correlation. The Wine dataset has some very small eigenvalues compared to its largest eigenvalue, and as such the Moore-Penrose pseudo-inverse is likely to have been adversely impacted [19]. Choosing the k-minimal-variance or k-simplicial distance avoids this impact, and as such produces better clustering results. This example also highlights that the k-minimal-variance distance performs very well with lower values of k, whereas the k-simplicial distance requires a higher value of k to achieve its best results, as seen before in the efficiency evaluations in Sect. 5.1. This gives better computational-time for the best results when using the k-minimal-variance distance, but does make the distance more sensitive to a too-high choice of k, as seen by the decrease in AR scores in Table 12a.
The Image Segmentation dataset has a number of very large eigenvalues, some eigenvalues very close to zero, and five zero eigenvalues. Figure 7e and f shows that the Mahalanobis distance performs worse than the Euclidean distance, perhaps due to the effect of very small eigenvalues on the Moore-Penrose pseudo-inverse. The kminimal-variance and k-simplicial distance outperform the Mahalanobis distance here, as they are less likely to be adversely affected by these small eigenvalues. Table 12b shows that the k-minimal-variance distance attains the highest AR score out of all the distances, but does not improve greatly on the Euclidean distance.
The Digits and Protein datasets (Table 13a, and b, respectively) both have a substantial number of small and zero eigenvalues (see Appendix B), indicating why our distances perform better than the pseudo-Mahalanobis distance. The Mahalanobis distance does not add much performance gain compared to the Euclidean distance in these examples, but the correct choice of k in the k-minimal-variance or k-simplicial distances provides improvement.
In these examples, we see that K -Means with the k-minimal-variance distance reaches its best adjusted rand score with relatively low k, whereas the k-simplicial distance needs higher values of k to reach this. However, the k-simplicial distance is less likely to breakdown for too-high a choice of k, as we see in Table 12a and 13a with the k-minimal-variance. For the k-simplicial distance, the values of k that produce the best adjusted rand score roughly match with the number of 'larger' eigenvalues in the datasets.

Conclusion
In this paper, we have continued the work done by Pronzato et al. [34] in researching the k-simplicial distance. This distance considers the covariance structure of the data, but is less adversely affected by the presence of small eigenvalues than the Mahalanobis distance.
We have studied the choice of the parameter k in detail, through the use of numerical examples, illustrating that too low a choice of k does not give much improvement on the Euclidean distance, and too high a choice of k doesn't give much performance benefit over lower values, but does increase computational complexity. We recommend a choice of k influenced by the number of 'large' eigenvalues present, with respect to the other eigenvalues of the sample covariance matrix. We also discuss the benefits and limitations of different choices of the parameter δ: δ = 1 is more robust, and well suited to applications such as outlier detection, but does not have a fast method of full computation. If δ = 2, we have a significantly faster and easier method of producing the distance, but this is more likely to be influenced by the presence of outliers. We discuss the implementation of sub-sampling simplices, which greatly improves computation time when we are using the distance with δ = 1 with minimal changes to the results of the distance. We have also introduced a new measure of distance, namely the k-minimal-variance distance. Again, this distance is less affected by the presence of small eigenvalues than the Euclidean and Mahalanobis distances (for appropriate choices of the parameter k) but is highly influenced by the choice of the parameter k which needs to be chosen carefully to ensure good performance of the distance.
Overall, we show that the k-minimal-variance distance is more efficient at minimizing the variance of the distances for lower choices of k (and therefore has better computation time) than the k-simplicial distance, but that the k-simplicial distance is less likely to be negatively affected by too-high a choice of k. We have given several examples where the k-minimal-variance and/or the k-simplicial distances outperform the multivariate distances commonly used. Namely, these proposed distances perform well where other classical distances often fail: when data is correlated, degenerate and has small eigenvalues, as well as possibly zero-valued eigenvalues.

Conflict of interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Appendix A: Moments and Distributions of Distances
From Rencher and Schaalje [36], let y be a random vector with mean μ and covariance matrix W , and let A be a symmetric matrix of constants, then E y Ay = trace(AW ) + μ Aμ.
If y is normally distributed with sample mean μ and sample covariance matrix W , then Var y Ay = 2trace (AW ) 2 + 4μ AW Aμ.
Our distances considered in this paper are generalized squared distances of the form where x is normally distributed with sample mean μ and sample covariance matrix W . Therefore we replace y with (x − μ), which is normally distributed with zero mean, giving:

A1: Moments of the k-Simplicial Distance When (ı = 2)
For a point x ∈ X , where X is normally distributed with sample mean μ and sample covariance matrix W , consider ρ 2 k,2 (x, X ) = (x − μ) S k k (x − μ), i.e. the squared k-simplicial distance from the sample mean μ to a point x. We have

A2: Moments of the Squared Euclidean Distance Divided by Trace(W)
Applying the formulae for the moments of quadratic forms to the squared Euclidean distance divided by trace(W ), where we take A = I d trace(W ) , we have Var ρ 2 1,2 (x, X ) = 2trace

Appendix B: Eigenvalues of the Five Real-Life Datasets
Below are the eigenvalues of the real datasets used in the