Unsupervised Representation Learning with Minimax Distance Measures

We investigate the use of Minimax distances to extract in a nonparametric way the features that capture the unknown underlying patterns and structures in the data. We develop a general-purpose and computationally efficient framework to employ Minimax distances with many machine learning methods that perform on numerical data. We study both computing the pairwise Minimax distances for all pairs of objects and as well as computing the Minimax distances of all the objects to/from a fixed (test) object. We first efficiently compute the pairwise Minimax distances between the objects, using the equivalence of Minimax distances over a graph and over a minimum spanning tree constructed on that. Then, we perform an embedding of the pairwise Minimax distances into a new vector space, such that their squared Euclidean distances in the new space equal to the pairwise Minimax distances in the original space. We also study the case of having multiple pairwise Minimax matrices, instead of a single one. Thereby, we propose an embedding via first summing up the centered matrices and then performing an eigenvalue decomposition to obtain the relevant features. In the following, we study computing Minimax distances from a fixed (test) object which can be used for instance in K-nearest neighbor search. Similar to the case of all-pair pairwise Minimax distances, we develop an efficient and general-purpose algorithm that is applicable with any arbitrary base distance measure. Moreover, we investigate in detail the edges selected by the Minimax distances and thereby explore the ability of Minimax distances in detecting outlier objects. Finally, for each setting, we perform several experiments to demonstrate the effectiveness of our framework.


Introduction
Data is usually described by a set of objects and a corresponding representation. The representation can be for example the vectors in a vector space or the pairwise dissimilarities between the objects. In real-world applications, the data is often very complicated and a priori unknown. Thus, the basic representation, e.g., Euclidean distance, Mahalanobis distance, cosine similarity and Pearson correlation, might fail to correctly capture the underlying patterns or classes. Thereby, the raw data needs to be processed further in order to obtain a more sophisticated representation. Kernel methods are a common approach to enrich the basic representation of the data and model the underlying patterns [34,17]. However, the applicability of kernels is confined by several limitations, such as, i) finding the optimal parameter(s) of a kernel function is often very critical and nontrivial [29,27], and ii) as we will see later, kernels assume a global structure which does not distinguish between the different type of classes in the data.
A category of distance measures, called link-based measure [9,3], takes into account all the paths between the objects represented in a graph (where the edge weights indicate the respective pairwise dissimilarities). The path-specific • They extract the structures adaptively, i.e., they adapt appropriately whenever the classes differ in shape or type.
• They take into account the transitive relations: if object a is similar to b, b is similar to c, ..., to z, then the Minimax distance between a and z will be small, although their direct distance might be large. This property is particularly useful when dealing with elongated or arbitrarily shaped patterns. Moreover, if a basic pairwise similarity is broken due to noise, then their Mimimax distance is able to correct it via taking into account the other paths and relations. For example if the similarity of i and j is broken, then according to Minimax distances, they might still be similar via i → k → l → . . . → j.
• Many learning methods perform on a vector representation of the objects. However, such a representation might not be available. We might be given only the pairwise distances which do not necessarily induce a metric. Minimax distances satisfy the metric conditions and enable to compute an embedding, as we will study in this paper.
Contributions. Our goal is to develop a generic and computationally efficient framework wherein many different machine learning algorithms can be applied to Minimax distances, beyond e.g., K-nearest neighbor classification or the few clustering methods mentioned before. Within this unified framework, we consider computing both the all-pair pairwise Miniamx distances and the Minimax distances of all the objects to/from a fixed (test) object.
1. We first efficiently compute the pairwise Minimax distances between the objects, using the equivalence of Minimax distances over a graph and over a minimum spanning tree constructed on that. This approach reduces the runtime of computing the pairwise Minimax distances to O(N 2 ) from O(N 3 ).
4. In the following, we study computing Minimax distances from a fixed (test) object which can be used for instance in K-nearest neighbor search. For this case, we propose an efficient and computational optimal Minimax Knearest neighbor algorithm whose runtime is O(N ), similar to the standard K-nearest neighbor search, and it can be applied with any arbitrary base dissimilarity measure. 5. Moreover, we investigate in detail the edges selected by the Minimax distances and thereby explore the ability of Minimax distances in detecting outlier objects. 6. Finally, we experimentally study our framework in different machine learning problems (classification, clustering and K-nearest neighbor search) on several synthetic and real-world datasets and illustrate its effectiveness and superior performance in different settings.
The rest of the paper is organized as following. In Section 2, we introduce the notations and definitions. In Section 3, we develop our framework for computing pairwise Minimax distances and extracting the relevant features applicable to general machine learning methods. Here, we also extend our approach for computing and embedding Minimax vectors for multiple pairwise data relations, i.e., to collective Minimax distances. In Section 4, we extend further our framework for computing the Minimax K-nearest neighbor search and outlier detection. In Section 5, we describe the experimental studies, and finally, we conclude the paper in Section 6.

Notations and Definitions
A dataset can be modeled by a graph G(O, D), where O and D respectively indicate the set of N objects (nodes) and the corresponding edge weights such that D ij shows the pairwise dissimilarity between objects i and j. The base pairwise dissimilarities D ij can be computed for example according to squared Euclidean distances or cosine (dis)similarities between the vectors that represent i and j. In several applications, D ij might be given directly. 2 In general, D might not yield a metric, i.e. the triangle inequality may not hold. We have recently studied the application of Minimax distances to correlation clustering, where the triangle inequality does not necessarily hold for the base pairwise dissimilarities [16]. In our study, D needs to satisfy three basic conditions: 1. zero self distances, i.e. ∀i, D ii = 0, 2. non-negativity, i.e. ∀i, j, D ij ≥ 0, and 3. symmetry, i.e. ∀i, j, D ij = D ji .
We assume there are no duplicates, i.e. the pairwise dissimilarity between every two distinct objects is positive. For this purpose, we may either remove the duplicate objects or perturb them slightly to make the zero non-diagonal elements of D positive. The goal is then to use the Minimax distances and the respective features to perform the machine learning task of interest.
The Minimax (MM) distance between objects i and j is defined as where R ij (O) is the set of all paths between i and j. Each path r is identified by a sequence of object indices, i.e. r(l) shows the l th object on the path.

Features Extraction from Pairwise Minimax Distances
We aim at developing a unified framework for performing arbitrary numerical learning methods with Minimax distances. To cover different algorithms, we pursue the following strategy: 1. We compute the pairwise Minimax distances for all pairs of objects i and j in the dataset.
2. We, then, compute an embedding of the objects into a new vector space such that their pairwise (squared Euclidean) distances in this space equal their Minimax distances in the original space.  Notice that vectors are the most basic way for data representation, since they render a bijective mapping between the objects and the measurements. Hence, any machine leaning method which performs on numerical data can benefit from our approach. Such methods perform either on the objects in a vector space, e.g. logistic regression, or on a kernel matrix computed from the pairwise relations between the objects. In the later case, the pairwise Minimax distances can be used for this purpose, for example through an exponential transformation, or the kernel can be computed from the final Minimax vectors.

Computing pairwise Minimax distances
Previous works for computing Minimax distances (e.g. applied to clustering) use a variant of Floyd-Warshall algorithm whose runtime is O(N 3 ) [1,4], or combine it with K-means in an agglomerative algorithm whose runtime is O(N 2 |E| + N 3 log N ) [7]. To reduce such a computational demand, we follow a more efficient procedure established in two steps: i) build a minimum spanning tree (MST) over the graph, and then, ii) compute the Minimax distances over the MST.

I. Equivalence of Minimax distances over a graph and over a minimum spanning tree on the graph
We exploit the equivalence of Minimax distances over an arbitrary graph and those obtained from a minimum spanning tree on the graph, as expressed in Theorem 1. 3 Theorem 1. Given graph G(O, D), a minimum spanning tree constructed on that provides the sufficient and necessary edges to obtain the pairwise Minimax distances.
Proof. The equivalence is concluded in a similar way to the maximum capacity problem [19], where one can show that picking an edge which does not belong to a minimum spanning tree, yields a larger Minimax distance (i.e., a contradiction occurs). We here describe a high level proof. Let e MM ij denote the edge representing the Minimax distance between i and j. We investigate two cases: 1. There is only one path between i and j. Then, this path will be part of any minimum spanning tree constructed over the graph. Since, otherwise, if some edge for example e MM ij is not selected, then the tree will lose the connectivity, i.e. there will be less edges in the tree than N − 1. The same argument holds when there are several paths between i and j, but all share e MM ij such that e MM i,j is the largest edge on all of them (Figure 1(a)). Then e MM ij will be selected by any minimum spanning tree, as otherwise the tree will be disconnected.
2. There are several paths between i and j, whose largest edges are different. We need to show that only the path including e MM i,j is selected by a minimum spanning tree. To prove, consider two paths, one including e MM i,j and the other containing e alt i,j which is the largest edge on the alternative path ( Figure 1(b)). It is easy to see that the minimum spanning tree would choose e MM i,j instead of e alt i,j : existence of two different paths indicates presence of at least one cycle. According to the definition of a tree, one edge of a cycle should be eliminated to yield a tree. In the cycle including e MM i,j , according to the definition of Minimax distances, there is at least one edge not smaller 3 This result may lead to several simplifications for computing the base pairwise distance measure D and the respective graph G(O, D), where the graph does not need to be necessarily full. For example, instead of a full graph, we can compute a minimal (connected) graph that sufficiently includes a minimum spanning tree. For instance, one might use the K nearest neighbor graph, which is not necessarily full. On the other hand, we might have no idea about the pairwise dissimilarities. Then we may query them from a user and infer the rest by Minimax distances. In fact, this result proposes an efficient method for this purpose: due to equivalence of Minimax distances on a graph and on any minimum spanning tree on that, we may be able to query significantly smaller number of pairwise dissimilarities that sufficiently represent a MST. This can be significantly faster than querying or inferring all the pairwise distances to obtain D. than e MM i,j (which is e alt i,j ), as otherwise e MM i,j would not represent the Minimax distance (the minimum of largest gaps) between i and j. Thereby the minimum spanning tree algorithm keeps e MM i,j instead of e alt i,j , since this choice leads to a tree with a smaller total weight. Notice that the algorithm cannot discard both e MM i,j and e alt i,j , because then the connectivity will be lost.
On the other hand, it is necessary to build a minimum spanning tree, e.g. to have at least N − 1 edges selected given that the graph G is connected. Otherwise, the computed MST will lose its connectivity and thereby for some pairs of objects there will be no path to be computed. Therefore, to compute the Minimax distances D M M , we need to care only about the edges which are present in an MST over the graph. Then, the Minimax distances are written by where r ij indicates the (only) path between i and j. This result does not depend on the particular choice of the algorithm used for constructing the minimum spanning tree. The graph that we assume is a full graph, i.e. we compute the pairwise dissimilarities between all pairs of objects. For full graphs, the straightforward implementation of the Prim's algorithm [31] using an auxiliary vector requires O(N 2 ) runtime which is an optimal choice.

II. Pairwise Minimax distances over a MST
At the next step, after constructing a minimum spanning tree, we compute the pairwise Minimax distances over that.
A naive and straightforward algorithm would perform a Depth First Search (DFS) from each node to compute the Minimax distances by keeping the track of the largest distance between the initial node and each of the traversed nodes. A single run of DFS requires O(N ) runtime and thus the total time will be O(N 2 ). However, such a method might lead to visiting some edges multiple times which renders an unnecessary extra computation. For instance, a maximal edge weight might appear in many DFSs such that it is processed several times. Hence, a more elegant approach would first determine all the objects whose pairwise distances are represented by an edge weight and then assigns the weight as their Minimax distances. According to Theorem 1, every edge in the MST represents some Minimax distances. Thus, we first find the edge(s) which represent only few Minimax distances, namely only one pairwise distance, such that the respective objects can be immediately identified. Lemma 2 suggest existence of this kind of edges. Lemma 2. In a minimum spanning tree, for the minimal edge weights we have D M M ij = D ij , where i and j are the two objects that occur exactly at the two sides of the edge.
Proof. Without loss of generality, we assume that the edge weights are distinct. Let e min denote the edge with minimal weight in the MST. We consider the pair of objects p and q such that at least one of them is not directly connected to e min and show that e min does not represent their Minimax distance. In a MST, there is exactly one path between each pair of objects. Thus, on the path between p and q there is at least another edge whose weight is not smaller than the weight of e min , which hence represents the Minimax distance between p and q, instead of e min . Lemma 2 yields a dynamic programming approach to compute the pairwise Minimax distances from a tree. We first sort the edge weights of the MST via for example merge sort or heap sort [4] which in the worst case require O(N log N ) time. We then consider each object as a separate component. We process the edge weights one by one from the smallest to the largest, and at each step, we set the Minimax distance of the two respective components by this weight. Then, we remove the two components and replace them by a new component constructed from the combination (union) of the two. We repeat these two steps for all edges of the minimum spanning tree. A main advantage of this algorithm is that whenever an edge is processed, all the nodes that it represents their Minimax distances are ready in the components connected to the two sides of the edge. Algorithm 1 describes the procedure in detail.
Algorithm 1 uses the following data structures.
Finally, a new component is constructed and added to component list by combining the two base components f irst side and scnd side. The ID of this new component is used as the ID of its members in component id.

Embedding of pairwise Minimax distances
In the next step, given the matrix of pairwise Minimax distances D M M , we obtain an embedding of the objects into a vector space such that their pairwise squared Euclidean distances in this new space are equal to their Minimax distances in the original space. For this purpose, in Theorem 3, we investigate a useful property of Minimax distance measures, called the ultrametric property [25] which can be used to prove the existence of such an embedding. Proof. First, we investigate that the pairwise Minimax distances D M M constitute an ultrametric. According to [25], the conditions to be satisfied are: By assumption D is symmetric, therefore, any path from i to j will also be a path from j to i, and vice versa. Thereby, their maximal weights and the minimum among different paths are identical.
Then, according to the definition of Minimax distance, the path from i to k and then to j must be used for computing the Minimax distance

a contradiction occurs.
On the other hand, ultrametric matrices are positive definite [6,37] and positive definite matrices induce an Euclidean embedding [33].
Notice that we do not require D to induce a metric, i.e. the triangle inequality is not necessarily fulfilled. After satisfying the feasibility condition, there exist several ways to compute a squared Euclidean embedding. We exploit a method motivated in [42] and further analyzed in [36] that is known as multidimensional scaling [28]. This method works based on centering D M M to compute a Mercer kernel which is positive semidefinite, and then performing an eigenvalue decomposition, as following: 4 A is defined as Here, d shows the dimensionality of the Minimax vectors, i.e. the number of Minimax features. The Minimax dimensions are ordered according to the respective eigenvalues and thereby we might choose only the first most representative ones, instead of taking them all.

Embedding of collective Minimax pairwise matrices
We extend our generic framework to the cases where multiple pairwise Minimax matrices are available, instead of a single matrix. Then, the goal would be to find an embedding of the objects into a new space wherein their pairwise squared Euclidean distances are related to the collective Minimax distances over different Minimax matrices. Such a scenario might be interesting in several situations: • There might exist different type of relations between the objects, where each renders a separate graph. Then, we compute several pairwise Minimax distances, each for a specific relation.
• Minimax distances might fail when for example few noise objects connect two compact classes. Then, the interclass Minimax distances become very small, even if the objects from the two classes are connected via only few outliers. To solve this issue, similar to model averaging, one could use the higher order, i.e. the second, third, ... Minimax distances, e.g., the second smallest maximal gap. Then, there will be multiple pairwise Minimax matrices each representing the k th Minimax distance. • In many real-world applications, we encounter high-dimensional data, the patterns might be hidden in some unknown subspace instead of the whole space, such that they are disturbed in the high-dimensional space due to the curse of dimensionality. Minimax distances rely on the existence of well-connected paths, whereas such paths might be very sparse or fluctuated in high dimensions. Thereby, similar to ensemble methods, it is natural to seek for connectivity paths and thus for Minimax distances in some subspaces of the original space. 5 However, investigating all the possible subspaces is computationally intractable as the respective cardinality scales exponentially with the number of dimensions. Hence, we propose an approximate approach based on computing Minimax distances for each dimension, which leads to having multiple Minimax matrices.
In all the aforementioned cases, we need to deal with (let say M ) different matrices of Minimax distances computed for the same set of objects. Then, the next step is to investigate the existence of an embedding that represents the pairwise collective Minimax distances. To proof the existence of such an embedding for M = 1, we used the ultrametric property of the respective Minimac distances and then concluded the positive definiteness. However, as shown in Theorem 4, the sum of M > 1 Minimax matrices does not necessarily satisfy the ultrametric property.
However, we need to show that Thus, if we can approximate max{   Proof. According to Theorem 3, each D M M (m) yields an L 2 2 embedding, which implies that W M M (m) is positive definite [36,42]. On the other hand, sum of multiple positive definite matrices (i.e. W cM M ) is a positive definite matrix too [18]. Thus, the eigenvalues of W cM M are non-negative and it induces an L 2 2 embedding.
Thereby, instead of summing the D M M (m) matrices, we sum up the W M M (m)'s and then perform eigenvalue decomposition, to compute an embedding wherein the pairwise squared Euclidean distances correspond to the collective pairwise Minimax distances. As will be mentioned in the experiments, we perform the above computations on the squared pairwise base dissimilarities .
Efficient calculation of dimension-specific Minimax distances. In this paper, we particularly study the use of collective Minimax embedding for high-dimensional data (called the dimension-specific variant). The dimensionspecific variant may require computing pairwise Minimax distances for each dimension, which can be computationally expensive. However, in this setting, the objects stay in an one-dimensional space. A main property of one-dimensional data is that sorting them immediately gives a minimum spanning tree. We, then, compute the pairwise distances for each pair of consecutive objects in the sorted list to obtain the edge weights of the minimum spanning tree. Finally, we compute the pairwise Minimax distances from the minimum spanning tree.

One-To-All Minimax Distance Measures
In this section, we study computing Minimax distances from a fixed (test) object which can be in particular used in K-nearest neighbor search. In this setup, we are given a graph G(O, D) (as the training dataset), and a new object (node) v. O indicates a set of N (training) objects with the corresponding pairwise dissimilarities D, i.e., D denotes the weights of the edges in G. The goal is to compute the K objects in O whose Minimax distances from v are not larger than any other object in O. We assume there is a function d(v, O) which computes the pairwise dissimilarities between v and all the objects in O. Thereby, by adding v to G we obtain the graph

Minimax K-nearest neighbor search
We introduce an incremental approach for computing the Minimax K-nearest neighbors of the new object v, i.e. we obtain the (T + 1) th Minimax neighbor of v given the first T Minimax neighbors. Thereby, first, we define the partial neighbors of v as the first Minimax neighbors of v over graph Theorem 6 provides a way to extend N P T (v) step by step until it contains the K nearest neighbors of v according to Minimax distances. Proof. Let us call this new (potential) neighbor u. Let e u * indicate the edge (with the smallest weight) connecting u to {v ∪ N P T (v)} and p ∈ {v ∪ N P T (v)} denote the node at the other side of e u * . We consider two cases: the weight of e u * is not larger than the Minimax distance between v and p. Then, which has a smaller Minimax distance to v than u. Thereby: . Moreover, it can be shown that there is no other path from v to u whose maximum weight is smaller than D p,u . Because, if such a path exists, then there must be a node p ∈ N P T (v) which has a smaller distance to an external node like u . This leads to a contradiction since u is the closest neighbor of N P T (v).
(b) For any other unselected node u we have D M M v,u ≥ D p,u . Because computing the Minimax distance between v and u requires visiting an edge whose one side is inside N P T (v), but the other side is outside N P T (v). Among such edges, e u * has the minimal weight, therefore, D M M v,u cannot be smaller than D p,u . Finally, from (a) and (b) we conclude that D M M v,u ≤ D M M v,u for any unselected node u = u. Hence, u has the smallest Minimax distance to v among all unselected nodes.
Theorem 6 proposes a dynamic programming approach to compute the Minimax K-nearest neighbors of v. Iteratively, at each step, we obtain the next Minimax nearest neighbor by selecting the external (unselected) node u which has the minimum distance to the nodes in {v ∪ N P T (v)}. This procedure is repeated for K times. Algorithm 2 describes the steps in detail. Since the algorithm performs based on finding the nearest unselected node, therefore vector dist is used to keep the minimum distance of unselected nodes to (one of the members of) {v ∪ N P T (v)}. Thereby, the algorithm finds a new neighbor in two steps: i) extension: it picks the minimum of dist and adds the respective node to N P T (v), and ii) update: it updates dist by checking if an unselected node has a smaller distance to the new N P T (v). Thus dist is updated by dist = min(dist, D min ind,: ), except for the members of N P T (v).
Computational complexity. Each step of Algorithm 2, either the extension or the update, requires an O(N ) running time. Therefore the total complexity is O(N ) which is the same as for the standard K-nearest neighbor method. The standard K-nearest neighbor algorithm computes only the first step. Thus our algorithm only adds a second update step, thereby it is more efficient than the message passing method [22] that requires more visits of the objects and also builds a complete minimum spanning tree in advance on G(O, D). According to Theorem 1, computing pairwise Minimax distances requires first computing a minimum spanning tree over the underlying graph. Thereby, here, we study the connection between Algorithm 2 (i.e., Minimax K-NN search) and minimum spanning trees. For this purpose, we first consider a general framework for constructing minimum spanning trees called generalized greedy algorithm [10]. Consider a forest (collection) of trees {T p }. The distance between the two trees T p and T q is obtained by The nearest neighbor tree of T p , i.e. T * p , is obtained by Then, the edge e * p represents the nearest tree from T p . It can be shown that e * p will belong to a minimum spanning tree on the graph as otherwise it yields a contradiction [10]. This result provides a generic way to compute a minimum spanning tree. A greedy MST algorithm at each step, i) picks two candidate (base) trees where one is the nearest neighbor of the other, and ii) combines them via their shortest distance (edge) to build a larger tree. This analysis guarantees that Algorithm 2 yields a minimum spanning subtree on {v ∪ N P T (v)} which is additionally a subset of a larger minimum spanning tree on the whole graph G + (O ∪ v, D ∪ d(v, O)). In the context of generalized greedy algorithm, Algorithm 2 generates the MST via growing only the tree started from v and the other candidate trees are singleton nodes. A complete minimum spanning tree would be constructed, if the attachment continues for N steps, instead of K. This procedure, then, would yield a Prim minimum spanning tree [31]. This analysis reveals an interesting property of the Prim's algorithm: The Prim's algorithm 'sorts' the nodes based on their Minimax distances from/to the initial (test) node v.
Computational optimality. This analysis also reveals the 'computational optimality' of our approach (Algorithm 2) compared to the alternative Minimax search methods, e.g. the method introduced in [22]. Our algorithm always expands the tree which contains the initial node v, whereas the alternative methods might sometimes combine the trees that do not have any impact on the Minimax K-nearest neighbors of v. In particular, the method in [22] constructs a complete MST, whereas we build a partial MST with only and exactly K edges. Any new node that we add to the partial MST, belongs to the K Minimax nearest neighbors of v, i.e., we do not investigate/add any unnecessary node to that.

One-to-all Minimax distances and minimum spanning trees
A straightforward generalization of Algorithm 2 gives sorted one-to-all Minimax distances from the target (test) v to the all other nodes. We only need to run the steps (extension and update) for N times instead of K. In Theorem 1, we observed that computing all-pair Minimax distances is consistent/equivalent with computing a minimum spanning tress, such that a MST provide all the necessary and sufficient edge weights for this purpose. Then later we have proposed an efficient algorithm for the one-to-all problem that also yields computing a (partial) minimum spanning tree (according to Prim's algorithm). Therefore, here we study the following question: Is it necessary to build a minimum spanning tree to compute the 'one-to-all' Minimax distances, similar to 'all-pair' Minimax distances?
We reconsider the proof of Theorem 6 which has led us to Algorithm 2. The proof investigates two cases: We investigate the first case in more detail. Let us look at the largest Minimax distance selected up to step T , i.e., the edge with maximal weight whose both sides occur inside {v ∪ N P T (v)} and its weight represents a Minimax distance. We call this edge e max T . Among different external neighbors of {v ∪ N P T (v)}, any node u ∈ {O \ N P T (v)} whose minimal distance to {v ∪ N P T (v)} does not exceed the weight of e max T can be selected, even if it does not belong to a minimum spanning tree over G + . Because, anyway the Minimax distance will be still the weight of e max T .
In general, any node u ∈ {O \ N P T (v)}, whose Minimax distance to a selected member p, i.e. D M M p,u , is not larger than the weight of e max T , can be selected as the next nearest neighbor 7 . This is concluded from the following property of Minimax distances: where p is an arbitrary object (node). In this setting we then have,  Figure 3, where K is fixed at 2. After computing the first nearest neighbor (i.e., p), the next one can be any of the remaining objects, as their Minimax distance to v is the weight of the edge connecting p to v (Figure 3(a)) or p to u (Figure 3(b)). Thereby, one could choose a node such that the partial MST on {v ∪ N P(v)} is not a subset of any complete MST on the whole graph. Therefore, computing one-to-all Minimax distances might not necessarily require taking into account construction of a minimum spanning tree.
Thereby, such an analysis suggests an even more generic algorithm for computing one-to-all Minimax distances (including Minimax K-nearest neighbor search) which dose not necessarily yield a MST on the entire graph. To expand N P T (v), we add a new u whose Minimax distance (not the direct dissimilarity) to {v ∪ N P T (v)} is minimal. Algorithm 2 is only one way, but an efficient way, that even sorts one-to-all Minimax distances.

Outlier detection with Minimax K-nearest neighbor search
While performing K-nearest neighbor search, the test objects might not necessarily comply with the structure in the train dataset, i.e. some of the test objects might be outliers or belong to other classes than those existing in the train dataset. More concretely, when computing the K nearest neighbors of v, we might realize that v is an outlier or irrelevant with respect to the underlying structure in O. We study this case in more detail and propose an efficient algorithm which while computing Minimax K nearest neighbors, it also detects whether v could be an outlier. Notice that the ability to detect the outliers will be an additional feature of our algorithm, i.e., we do not aim at proposing the best outlier detection method, rather, the goal is to maximize the benefit from performing Minimax K-NN while still having an O(N ) runtime. 7 We notice that in the second case, i.e. when Dp,u > D M M v,p , we need to add an unselected node whose Minimax to v (to one of the nodes in N PT (v)) is equal to Dp,u. This situation might also yield adding an edge which does not necessarily belong to a MST over G + . The argument is similar to the case when Dp,u ≤ D M M v,p .
Thereby, we follow our analysis by investigating another special aspect of Algorithm 2: as we extend N P T (v), the edge representing the Minimax distance between v and the new member u (i.e., the edge with the largest weight on the path indicating the Minimax distance between v and u) is always connected to v. For this purpose, we reconsider the proof of Theorem 6. When u is being added to N P T (v), two special cases might occur: 1. u is directly connected to v, i.e. p = v.
2. u is not directly connected to v (i.e. p = v), but its Minimax distance to v is not represented by D p,u , i.e., D p,u < D M M v,p . 2. updated i = 1, otherwise.
At the beginning, N P T (v) is empty. Thereby, updated is initialized by a vector of zeros. At each update step, whenever we modify an element of the vector dist, we then set the corresponding index in updated to 1. At each step of Algorithm 2, a new edge is added to the set of selected edges whose weight is D p,u . As mentioned before, v is labeled as an outlier if, i) some of the edges are directly connected to v, and some others indirectly, and ii) no indirect edge weight represents a Minimax distance, i.e. the minimum of the edge weights directly connected to v (stored in min direct) is larger than the maximum of the edge weights not connected to v (stored in max indrct). The later corresponds to the condition that D M M v,p > D p,u . Thereby, whenever we pick the nearest neighbor of {v ∪ N P T (v)} in the extension step, we then check the updated status of the new member: 1. If updated min ind = 1, we then update max indrct by max(max indrct, dist min ind ).
2. If updated min ind = 0, we then update min direct by min(min direct, dist min ind ).
Finally, if min direct > max indrct and max indrct = −1 (max indrct is initialized by −1; this condition ensures that at least one indirect edge has been selected), then v is labeled as an outlier. Algorithm 3 describes the procedure in detail. Compared to Algorithm 2, we have added: i) steps 11-15 for keeping the statistics about the direct and indirect edge weights (i.e., via updating max indrct and min direct), ii) an additional step in line 20 for updating the type of the connection of the external nodes to the set {v ∪ N P T (v)}, and iii) finally checking whether v is an outlier 8 .

Experiments
We experimentally study the performance of Minimax distance measures on a variety of synthetic and real-world datasets and illustrate how the use of Minimax distances as an intermediate step improves the results. In each dataset, each object is represented by a vector. 9

Classification with Minimax distances
First, we study classification with Minimax distances. We use Logistic Regression (LogReg) and Support Vector Machines (SVM) as the baseline methods and investigate how performing these methods on the vectors induced from Minimax distances improves the accuracy of prediction. With SVM, we examine three different kernels: i. linear (lin), ii. radial basis function (rbf), and iii. sigmoid (sig), and choose the best result. With Minimax distances, we only use the linear kernel, since we assume that Minimax distances must be able to capture the correct classes, such that they can be then discriminated via a linear separator.

Experiments with synthetic data
We first perform our experiments on two synthetic datasets, called: i) DS1 [2], and ii) DS2 [38], which are shown in Figure 5. The goal is to demonstrate the superior ability of Minimax distances to capture the correct class-specific structures, particularly when the classes have different types (DS2), compared to kernel methods. Table 1 shows the accuracy scores for different methods. The standard SVM is performed with three different kernels (lin, rbf and sig), and the best choice which is the rbf kernel is shown. As mentioned, with Minimax distances, we only use the linear kernel. We observe that performing classification on Minimax vectors yields the best results, since it enables the method to better identify the correct classes. The datasets differ in the type and consistency of the classes. DS1 contains very similar classes which are Gaussian. But DS2 consists of classes which differ in shape and type. Therefore, for DS1 we are able to find an optimal kernel (rbf, since the classes are Gaussian) with a global form and parametrization, which fits with the data and thus yields very good results. However, in the case of DS2, since classes have different shapes, then a single kernel is not able to capture correctly all of them. For this dataset, LogReg and SVM with Minimax vectors perform better, since they enable to adapt to the class-specific structures. Note that in the case of DS1, using Minimax vectors is equally good to using the optimal rbf kernel. Remember that, to Minimax vectors, we apply only SVM with a linear kernel. Figures 5(c) and 5(d) show the accuracy and eigenvalues w.r.t. different dimensionality of Minimax vectors. As mentioned earlier, the dimensions of Minimax vectors are sorted according to the respective eigenvalues, since a larger eigenvalue indicates a higher importance. By choosing only few dimensions, the accuracy attains its maximal value. We will elaborate in more detail on this further on.
We note that after computing the matrix of pairwise Minimax distances, one can in principle apply any of the embedding methods (e.g., PCA, SVD, etc) either to the original pairwise distance matrix or to the pairwise Minimax distance matrix. Therefore, our contribution is orthogonal to such embedding methods. For example on these synthetic datasets, when we apply PCA to the original distance matrix, we do not observe a significant improvement in the results (e.g., 0.7838 on DS1 and 0.6671 on DS2 with LogReg).

Classification of UCI data
We then perform real-world experiments on twelve datasets from different domains, selected from the UCI repository [26]: 10 (1) Balance Scale: contains 625 observations modeling 3 types of psychological experiments.
(2) Banknote Authentication: includes 1372 images taken from genuine and forged banknote-like specimens (number of classes is 2). There are in total 560 measurements. (11) Skin Segmentation: the original dataset contains 245, 057 instances generated using skin textures from face images of different people. However, to make the classification task more difficult, we pick only the first 1, 000 instances of each class (to decrease the number of objects per class). The target variable is skin or non-skin sample, i.e. the number of classes is 2. We call these datasets respectively UC1, UCI2, ..., UCI12. Accuracy scores. Table 2 shows the results for different methods applied to the datasets, when 60% of the objects are used for training. We have repeated the random split of the data for 20 times and report the average results. The scores    and the ranking of different methods are very consistent among different splits, such that the standard deviations are low (i.e., maximally about 0.016). We observe that often performing the classification methods on Minimax vectors improves the classification accuracy. In only very few cases the standard setup outperforms (slightly). In the rest, either the Minimax vectors or the dimension-specific variant of Minimax vectors yield a better performance. In particular, the Minimax variant is more appropriate for low dimensional data, whereas the dimension-specific Minimax variant outperforms on high-dimensional data. We elaborate more on the choice between them in the 'model order selection' section. Table 3 shows the results when only 10% of the objects are used for training. For this setting, we observe a consistent performance with the setting where 60% of data is used for training, i.e. the use of Minimax vectors (either the standard variant or the dimension-specific one) improves the accuracy scores. In this setting, only with UCI12 the non-Minimax method performs better. However, even on this dataset, the dimension-specific Minimax performs very closely to the best result. Such a consistency is observed for other ratios of train and test sets too.
Model order selection. Choosing the appropriate number of dimensions for Minimax vectors (i.e. their dimensionality) constitutes a model order selection problem. We study in detail how the dimensionality of the Minimax vectors affects the results. Figure 6 shows the accuracy scores for Minimax-LogReg applied to four of the datasets w.r.t. different number of dimensions (the other datasets behave similarly). The dimensions are ordered according to their importance (the value of respective eigenvalue). Choosing a very small number of dimensions might be insufficient since we might lose some informative dimensions, which yields underfitting. By increasing the dimensionality, the method extracts more sufficient information from the data, thus the accuracy improves. We note that this phase occurs for a very wide range of choices of dimensions. However, by increasing the number of dimensions even more, we might include non-informative or noisy features (with very small eigenvalues), where then the accuracy stays constant or decreases slightly, due to overfitting. However, an interesting advantage of this approach is that the overfitting dimensions (if there exists any) have a very small eigenvalue, thus their impact is negligible. This analysis leads to a simple and effective model order selection criteria: Fix a small threshold and pick the dimensions whose respective eigenvalues are larger than the threshold. The exact value of this threshold may not play a critical role or it can be estimated via a validation set in a supervised learning setting.
Model selection. According to our experimental results, very often, either the standard Minimax classification or the dimension-specific variant outperform the baselines. In a supervised learning setting, the correct choice between these two Minimax variants (i.e., the model selection task) can be made via measuring the prediction ability, e.g. the accuracy score. However, this approach might not be applicable for unsupervised learning, e.g. clustering, where the ground truth is not given. We investigate a simple heuristics which can be employed in any arbitrary setting. When we perform eigen decomposition of the centered matrix (W M M or W cM M ), we normalize the eigenvalues by the largest of them such that the largest eigenvalue equals 1. Then, the variant that yields a shaper decay in eigenvalues is supposed to be potentially a better model. A sharper decay of the eigenvalues indicates a tighter confinement to a low dimensional subspace, i.e. lower complexity in data representation. This possibly yields a better learning, and thereby a higher accuracy score.
To analyze this heuristics, we observe that in our experiments on Cloud, Glass Identification, Haberman Survival, Perfume and Skin Segmentation either the accuracy scores are not significantly distinguishable or there is no consistent ranking of the two different variants (Minimax or dimension-specific Minimax) such that it is difficult to decide which one is better. For these datasets the eigenspectra are very similar as shown for example for Cloud dataset in Figure 7(a). Among the remaining datasets, the heuristics works properly on Balance Scale, Contraceptive Method, Hayes Roth, Lung Cancer and User Knowledge datasets, where two sample eigenspectra are shown in Figures 7(b) and 7(c). Finally, the heuristics fails only on the two Banknote Authentication and Ionosphere datasets. However, for these datasets the accuracy scores as well as the decays are rather similar, as shown for example for Banknote Authentication in Figure 7(d). Efficiency. As a side study, we also investigate the runtimes of computing pairwise Minimax distance via either based on computing an MST (MST-MM) or the Floyd-Warshall algorithm (FW-MM). The results (in seconds) have been shown in Table 4 for some of the datasets. We observe that MST-MM yields a significantly faster approach to computer the pairwise Minimax distances. The difference is even more obvious when the dataset is larger. We note that the both methods yield the same Minimax distances. We observe consistent results on the other datasets.

Experiments on clustering of document scannings
We study the impact of performing Minimax distances on clustering of scannings of the documents collocated by a large document processing corporation. The dataset contains the vectors of 675 documents each represented in a 4096 dimensional space. The vectors contain metadata, the document layout, tags used in the document, the images We use adjusted Rand score [20] and adjusted Mutual Information [39] to evaluate the performance of different methods. Rand score computes the similarity between the estimated and the true clusterings. Mutual Information measures the mutual information between the two solutions. Note that we compute the adjusted version of these criteria, such that they yield zero for random clustering. Table 5 shows the results on the three datasets. The number at the front of different dimension-specific (DimMM) variants indicates b. We repeat the DimMM variant 10 times and report the mean(std) results. We observe: i) computing Minimax distances enables GMM to better capture the underlying structures, thus yields improving the results. ii) The dimension-specific variant improves the clusters even further via extracting appropriate structures in different subspaces. iii) However, the first improvement, i.e., the use or not use of Minimax distances, has a more significant impact than the choice between the standard or the dimension-specific Minimax variants. Finally, we note that DimMM is equivalent to the standard Minimax if b = 4096. These datasets are well-known and publicly accessible, e.g. via sklearn.datasets. 11 For the 20 newsgroup datasets, we obtain the vectors after eliminating the stop words. Cosine similarity is known to be a better choice for textual data. Thus, for each pair of document vectors x i and x j , we compute their cosine similarity and then their dissimilarity by 1 − cosine(x i , x j ) to construct the pairwise dissimilarity matrix D. For non-textual datasets, we directly compute the pairwise squared Euclidean distances. Thereby, we construct graph G(O, D). Note that the method in [23] is not applicable to cosine-based dissimilarities. We compare our algorithm (called PTMM) against the following methods.
compute the pseudo inverse of the Laplacian only for the training dataset and then would extend it to contain the out-of-sample object.
We perform the experiments and compute accuracy in leave-one-out manner, i.e. the i-th object is left out and its label is predicted via the other objects. In the final voting, the nearest neighbors are weighted according to inverse of their dissimilarities with the test object. We calculate the accuracy by averaging over all N objects. M -fold cross validation could be used instead of the leave-one-out approach. However, the respective ranking of the runtimes of different methods is independent of the way the accuracy is computed.   Figure 8 illustrates the runtime of different measures on 20 news group datasets. Our algorithm for computing the Minimax distances (i.e. PTMM) is significantly faster than MPMM. PTMM is only slightly slower than the standard method and is even slightly faster than Link. The main source of computational inefficiency of MPMM comes from requiring a MST which needs an O(N 2 ) runtime. PTMM theoretically has the same complexity as STND (i.e. O(N )) and in practice performs only slightly slower which is due to one additional update step. As mentioned L + is considerably slower, even compared to MPMM (see Table 6).
To demonstrate the generality of our algorithm, we have performed experiments on datasets from other domains. Figure 10 shows the runtime of different methods on non-textual datasets (IRIS, OLVT, and DGTS). We observe a consistent performance behavior: PTMM runs only slightly slower than STND and much faster than MPMM. Link is slightly slower than PTMM.
The usefulness of Minimax distances for K-NN classification has been investigated in previous studies, e.g. in [22,23]. However, for the proof of concept, in Figures 9 and 11, we illustrate the accuracy of different methods on the textual and non-textual datasets. Note that PTMM and MPMM yield the same accuracy scores. They differ only in the way they compute the Minimax nearest neighbors of v, thereby we have compared them only w.r.t runtime. Consistent to the previous study, we observe that Minimax measure is always the best choice, i.e. it yields the highest accuracy. We emphasize that this is the first study that demonstrates the effectiveness and performance of Minimax K-nearest neighbor classification beyond Euclidean distances (i.e. cosine similarity, a measure which is known to be more appropriate for textual data). As mentioned earlier, the previous studies [22,23] always build the Minimax distances on squared Euclidean measures.
Scalability. Finally, we study the scalability of different methods on larger datasets from two-moons data. As shown in [23] only L + and Minimax measures perform correctly on this data which has a non-linear geometric structure. Table 7 compares the runtime of these two methods for K = 5. PTMM performs significantly faster than the alternatives, e.g. it is three times faster than MPMM and 250 times faster than L + for 10, 000 objects.

Outlier detection along with Minimax K-NN search
At the end, we experimentally investigate the ability of Algorithm 3 for detecting outliers while performing K-nearest neighbor search. We run our experiments on the textual datasets, i.e., on COMP, REC, SCI, and TALK. We add the unused documents of 20 news group collection to each dataset to play the role of outliers or the mismatching data source. This ourlier dataset consists of the 1, 664 documents of 'alt.atheism', 'misc.forsale', 'soc.religion.christian' categories. For each document, either from train or from outlier set, we compute its K nearest neighbors only from the documents of the train dataset and check whether it is predicted as an outlier or not.
We compare our algorithm (called here PTMMout), against the two others: STND and Link. These two algorithms do not directly provide outlier prediction. Thereby, in order to perform a fair comparison, we compute a Prim minimum spanning tree over the set containing v and the objects selected by the K-NN search, where v is the initial node. Then we check the same conditions to determine whether v is determined as an outlier. We call these two variants respectively STNDout and LINKout.
We obtain the rate of correct outlier prediction for the outlier dataset, called rate 1, by the ratio of the number of correct outlier reports to the size of this set. However, an algorithm might report an artificially high outlier number. Thereby, similar to precision-recall trade-off, we investigate the train dataset too and compute the ratio of the number of non-outlier reports to the total size of train dataset and call it rate 2. Finally, similar to F-measure, we compute the harmonic average of rate 1 and rate 2 and call it o rate, i.e.
(12) Figure 12 shows the o rate scores for different datasets and for different values of K. For all datasets, we observe that PTMMout yields significantly a higher o rate compared to the other algorithms. Moreover, the o rate of PTMMout is considerably stable and constant over different K, i.e. the choice of optimal K is much less critical. For STNDout and LINKout, o rate is maximal for only a fairly small range of K (e.g. between 30 and 50) and then it sharply drops down to zero. Figure 13 illustrates an in-depth analysis of rate 1 and rate 2 for the COMP dataset. Generally, as we increase K, rate 1 decreases but rate 2 increases. The reason is that as we involve more neighbors in N P(v), then the chance of having an edge whose both sides are inside N P(v) and its weight represents a Minimax distance increases. This case might particularly occur when N P(v) contains objects from more than one class. Thereby, the probability that an object is labeled as a non-outlier increases. However, this transition is very sharp for STNDout and LINKout, such that for a relatively small K they label every object (either from train set or from outlier set) as a non-outlier. Moreover, as mentioned before, even for small K, STNDout and LINKout yield significantly lower o rate than PTMMout.

Conclusion
We developed a framework to apply Minimax distances to any learning algorithm that works on numerical data, which takes into account both generality and efficiency. We studied both computing the pairwise Minimax distances for all pairs of objects and as well as computing the Minimax distances of all the objects to/from a fixed (test) object. For the all-pair case, we first employed the equivalence of Minimax distances over a graph and over a minimum spanning tree constructed on that, and developed an efficient algorithm to compute the pairwise Minimax distances on a tree. Then, we studied computing an embedding of the objects into a new vector space, such that their pairwise Minimax distance in the original data space equals to their squared Euclidean distance in the new space. We then extended our approach to the cases wherein there are multiple pairwise Minimax matrices. In the following, we studied computing Minimax distances from a fixed (test) object which can be used for instance in K-nearest neighbor search. Moreover, we investigated in detail the edges selected by the Minimax distances and thereby augmented the Minimax K-nearest neighbor search with the ability of detecting outlier objects. For each setting, we demonstrated the effectiveness of our framework via several experimental studies.