# Learning Distance Measures

**DOI:**https://doi.org/10.1007/978-1-4899-7993-3_614-2

## Synonyms

## Definition

**q**∈

*ℜ*

^{ n }(to be classified) and an arbitrary object

**x**∈

*ℜ*

^{ n }means to learn a weighted

*p*-norm distance metric on the Euclidean space of the input measurement variables (or features):

where *p* > 0 and *W*(**q**) ∈ *ℜ* ^{ n × n } is a matrix of weights reflecting the relevance or importance of features at the query **q**. If *W*(**q**) depends on the query point **q**, the resulting distance measure is *local*; otherwise, if *W* is invariant to the query, a *global* distance measure is obtained.

## Historical Background

The problem of learning distance measures from data has attracted considerable interest recently in the data mining and machine learning communities. Different methodologies have been developed for supervised, unsupervised, and semi-supervised problems. One of the earliest work that discusses the problem of clustering simultaneously both points and features is Hartigan, 1972 [9]. A model based on direct clustering of the data matrix and a distance-based model are introduced, both leading to similar results.

## Foundations

The ability to classify patterns is certainly one of the key features of intelligent behavior, whether it is humans’ or animals’. This ability emerged with the biogenetic evolution for survival purposes, not only of individuals but also of entire species. An individual receives sensory information that must be processed to perceive, and ultimately act, possibly for orientation in the environment, distinction between edible and poisonous food, or detection of dangerous enemies.

Machine perception and classification as features of artificial systems serve similar but more constrained purposes. Machine classification aims to provide artificial systems with the ability to react to situations and signals that come from the environment to perform specific tasks. As such, pattern classification is a fundamental building block of any cognitive automata.

Recent developments in data mining have posed new challenges to pattern classification. Data mining is a knowledge discovery process whose aim is to discover unknown relationships and/or patterns from a large set of data that make it is possible to predict future outcomes. As such, pattern classification becomes one of the key steps in attempting to uncover the *hidden knowledge* within the data. The primary goal is usually predictive accuracy, with secondary goals being speed, ease of use, and interpretability of the resulting predictive model.

The term *pattern* is a common word and means something exhibiting some form of regularity, able to serve as a model representing a concept of what was observed. As a consequence, a pattern is never an isolated observation, but rather a collection of observations connected in time or space (or both). A pattern exhibits, as a whole, a certain structure indicative of the underlying concept. The pattern classification task can then be seen as the task of inferring concepts from observations. Thus, designing a pattern classifier means defining a mapping from a measurement space into the space of possible meanings, that are viewed as finite and discrete target points.

From this perspective, it makes no difference what kind of observations are considered and to what kind of meanings they may be linked. The same approach can be used to recognize written text, spoken language, objects, or any other multidimensional signals as well. The selection of meaningful observations from a specific domain is a *feature extraction* process. From a theoretical viewpoint, the distinction between feature extraction and classification is arbitrary, but nevertheless useful. In general, the problem of feature extraction is much more domain dependent than the problem of classification.

### Statistical Approach

A characteristic of patterns in the context of classification is that every concept (or class) may have multiple representative points in the measurement space. For example, for the task of character recognition from their images, there exists a potentially unlimited plurality of ways to design character images that correspond to the same character. Therefore, the very core of pattern classification is to cope with variability. The difficulty of the task depends on the degree to which the representatives of a class are allowed to vary and how they are distributed in the measurement space. This observation brings together two intrinsic components of the pattern classification task: the *statistical* component and the principle of *learning from examples*.

The problem of classification can be seen as one of partitioning the feature space into regions, one region for each category. Ideally, one would like to arrange this partitioning so that no decisions is ever wrong. This objective may not be achievable for two reasons. The distributions of points of different classes in the measurement space overlap; thus, it is not possible to reliably separate one class from the other. Moreover, even if a rule that does a good job of separating the examples can be found, one has no guarantee that it will perform as well on new points. In other words, that rule may not *generalize* well on data never seen before. It would certainly be safer to consider more points, and check how many of those are correctly classified by the rule. This suggests that one should look for a classification procedure that aims at minimizing the *probability of error*. The problem of classification then becomes a problem in statistical decision theory.

### Challenges

While pattern classification has shown promise in many areas of practical significance, it faces difficult challenges from real world problems, of which the most pronounced is Bellman’s *curse of dimensionality* [1]. It states the fact that the sample size required to perform accurate prediction in problems with high dimensionality is beyond feasibility. This is because in high dimensional spaces, data become extremely sparse and are apart from each other. As a result, severe bias that affects any estimation process can be introduced in a high dimensional feature space with finite samples.

Consider, for example, the rule that classifies a new data point with the label of its closest training point in the measurement space (*1-Nearest Neighbor rule*). Suppose each instance is described by 20 attributes, but only three of them are relevant to classifying a given instance. In this case, two points that have identical values for the three relevant attributes may nevertheless be distant from one another in the 20-dimensional input space. As a result, the similarity metric that uses all 20 attributes will be misleading, since the distance between neighbors will be dominated by the large number of irrelevant features. This shows the effect of the curse of dimensionality phenomenon, that is, in high dimensional spaces distances between points within the same class or between different classes may be similar. This fact leads to highly biased estimates. Nearest neighbor approaches are especially sensitive to this problem.

In many practical applications things are often further complicated. In the previous example, the three relevant attributes for the classification task at hand may be dependent on the location of the query point, i.e., the point to be classified, in the feature space. Some features may be relevant within a specific region, while other features may be more relevant in a different region.

*a*, dimension

*X*is more relevant, because a slight move along the

*X*axis may change the class label, while for query

*b*, dimension

*Y*is more relevant. For query

*c*, however, both dimensions are equally relevant. Capturing such information, therefore, is of great importance to any classification procedure in high dimensional settings.

It is important to emphasize that the curse of dimensionality is not confined to classification. It affects any estimation process in a high dimensional feature space with finite examples. Thus, clustering equally suffers from the same problem. The clustering problem concerns the discovery of homogeneous groups of data according to a certain similarity measure. It is not meaningful to look for clusters in high dimensional spaces as the average density of points anywhere in input space is likely to be low. As a consequence, distance functions that equally use all input features may not be effective.

### Adaptive Metric Techniques

This section presents an overview of relevant work in the literature on flexible metric computation for classification and clustering problems.

### Adaptive Metric Nearest Neighbor Classification

In a classification problem, one is given *l* observations **x** ∈ *ℜ* ^{ n }, each coupled with the corresponding class label *y*, with *y* = 1 , … , *J*. It is assumed that there exists an unknown probability distribution *P*(**x**, *y*)from which data are drawn. To predict the class label of a given query **q**, the class posterior probabilities \( {\left\{P\left(j|\mathbf{q}\right)\right\}}_{j=1}^J \) need to be estimated.

*K*nearest neighbor methods are based on the assumption of smoothness of the target functions, which translates to locally constant class posterior probabilities

*P*(

*j*|

**q**), that is:

*P*(

*j*| (

**q**+

*δq*)) ≃

*P*(

*j*|

**q**), for ‖

*δ*

**q**‖ small enough. Then, \( P\left(j|\mathbf{q}\right)\simeq \ \frac{{\displaystyle {\sum}_{\mathbf{x}\in N\left(\mathbf{q}\right)}P\left(j|\mathbf{x}\right)}}{\left|N\left(\mathbf{q}\right)\right|} \), where

*N*(

**q**) is a neighborhood of

**q**that contains points

**x**that are "close" to

**q**, and |

*N*(

**q**)| denotes the number of points in

*N*(

**q**). This motivates the estimates

where 1() is an indicator function such that it returns 1 when its argument is true, and 0 otherwise.

The assumption of smoothness, however, becomes invalid for any fixed distance metric when the input observation approaches class boundaries. The objective of locally adaptive metric techniques for nearest neighbor classification is then to produce a modified local neighborhood in which the posterior probabilities are approximately constant.

The techniques proposed in the literature [5 *,* 6 *,* 10] are based on different principles and assumptions for the purpose of estimating feature relevance locally at query points, and therefore weighting accordingly distances in input space. The idea common to these techniques is that the weight assigned to a feature, locally at a given query point **q**, reflects its estimated relevance to predict the class label of **q**: larger weights correspond to larger capabilities in predicting class posterior probabilities. As a result, neighborhoods get constricted along the most relevant dimensions and elongated along the less important ones. The class conditional probabilities tend to be constant in the resulting neighborhoods, whereby better classification performance can be obtained.

### Large Margin Nearest Neighbor Classifiers

The previously discussed techniques have been proposed to try to minimize bias in high dimensions by using locally adaptive mechanisms. The "lazy learning" approach used by these methods, while appealing in many ways, requires a considerable amount of on-line computation, which makes it difficult for such techniques to scale up to large data sets. Recently, a method (called LaMaNNA) has been proposed which, although still founded on a query based weighting mechanism, computes off-line the information relevant to define local weights [3].

The technique uses support vector machines (SVMs) as a guidance for the process of defining a local flexible metric. SVMs have been successfully used as a classification tool in a variety of areas [13], and the maximum margin boundary they provide has been proved to be optimal in a structural risk minimization sense. The decision function constructed by SVMs is used in LaMaNNA to determine the most discriminant direction in a neighborhood around the query. Such direction provides a local feature weighting scheme. This process produces highly stretched neighborhoods along boundary directions when the query is close to the boundary. As a result, the class conditional probabilities tend to be constant in the modified neighborhood, whereby better classification performance can be achieved. The amount of elongation-constriction decays as the query moves farther from the vicinity of the decision boundary. This phenomenon is exemplified in Fig. 1 by queries *a*, *a* ^{′} and *a* ^{′′}??????.

### Adaptive Metrics for Clustering and Semi-Supervised Clustering

Adaptive metric techniques for data without labels (unsupervised) have also been developed. Typically, these methods perform clustering and feature weighting simultaneously in an unsupervised manner [2 *,* 4 *,* 7 *,* 8 *,* 12]. Weights are assigned to features either globally or locally.

The problem of feature weighting in K-means [11] clustering has been addressed in [12]. Each data point is represented as a collection of vectors, with "homogeneous" features within each measurement space. The objective is to determine one (global) weight value for each feature space. The optimality criterion pursued is the minimization of the (Fisher) ratio between the average within-cluster distortion and the average between-cluster distortion.

COSA (Clustering On Subsets of Attributes) [7] is an iterative algorithm that assigns a weight vector (with a component for each dimension) to each data point. COSA starts by assigning equal weight values to each dimension and to all points. It then considers the *k* nearest neighbors of each point, and uses the resulting neighborhoods to compute the dimension weights. Larger weights are credited to those dimensions that have a smaller dispersion within the neighborhood. These weights are then used to compute dimension weights for each pair of points, which in turn are utilized to update the distances for the computation of the *k* nearest neighbors. The process is iterated until the weight values become stable. At each iteration, the neighborhood of each point becomes increasingly populated with data from the same cluster. The final output is a pairwise distance matrix based on a weighted inverse exponential distance that can be used as input to any distance-based clustering method (e.g., hierarchical clustering).

LAC (Locally Adaptive Clustering) [4] develops an exponential weighting scheme, and assigns a weight vector to each cluster, rather than to each data point. The weights reflect local correlations of data within each discovered cluster, and reshape each cluster as a dense spherical cloud. The directional local reshaping of distances better separates clusters, and allows for the discovery of different patterns in different subspaces of the original input space.

Recently, to aid the process of clustering data according to the user’s preferences, a semi-supervised framework has been introduced. In this scenario, the user provides examples of similar and dissimilar points, and a distance metric is learned over the input space that satisfies the constraints provided by the user [14].

## Key Applications

Almost all problems of practical interest are high dimensional. Thus, techniques that learn distance measures have significant impact in fields and applications as diverse as bioinformatics, security and intrusion detection, document and image retrieval. An excellent example, driven by recent technology trends, is the analysis of microarray data. Here one has to face the problem of dealing with more dimensions (genes) than data points (samples). Biologists want to find "marker genes" that are differentially expressed in a particular set of conditions. Thus, methods that simultaneously cluster genes and samples are required to find distinctive "checkerboard" patterns in matrices of gene expression data. In cancer data, these checkerboards correspond to genes that are up- or downregulated in patients with particular types of tumors.

## Cross-References

## Recommended Reading

- 1.Bellman R. Adaptive control processes: Princeton University Press; 1961.Google Scholar
- 2.Blansch A, Ganarski P, Korczak J. Maclaw: a modular approach for clustering with local attribute weighting. Pattern Recognit Lett. 2006; 27(11):1299–1306.Google Scholar
- 3.Domeniconi C, Gunopulos D, Peng J. Large margin nearest neighbor classifiers. IEEE Trans Neural Netw. 2005; 16:899–909.Google Scholar
- 4.Domeniconi C, Gunopulos D, Yan S, Ma B, Al-Razgan M, Papadopoulos D. Locally adaptive metrics for clustering high dimensional data. Data Mining Knowl Discov J. 2007; 14:63–97.Google Scholar
- 5.Domeniconi C, Peng J, Gunopulos D. Locally adaptive metric nearest neighbor classification. IEEE Trans Pattern Anal Mach Intell. 2002; 24:1281–85.Google Scholar
- 6.Friedman J. Flexible metric nearest neighbor classification. In: Technical Report, Department of Statistics, Stanford University, 1994.Google Scholar
- 7.Friedman J, Meulman J. Clustering objects on subsets of attributes. Technical Report, Stanford University, 2002.Google Scholar
- 8.Frigui H, Nasraoui O. Unsupervised learning of prototypes and attribute weights. Pattern Recognit. 2004; 37(3):943–52.Google Scholar
- 9.Hartigan JA. Direct clustering of a data matrix. J Am Stat Assoc. 1972; 67(337):123–9.Google Scholar
- 10.Hastie T, Tibshirani R. Discriminant adaptive nearest neighbor classification. IEEE Trans Pattern Anal Machine Intell. 1996; 18:607–15.Google Scholar
- 11.Jain A, Mutty M, Flyn P. Data clustering: a review. ACM Comput Surv. 1999; 31(3).Google Scholar
- 12.Modha D. and Spangler S.. Feature weighting in K-means clustering. Mach Learn. 2003; 52(3):217–37.Google Scholar
- 13.Shawe-Taylor J, Pietzuch FN. Kernel methods for pattern analysis. London: Cambridge University Press; 2004.Google Scholar
- 14.Xing E, Ng A, Jordan M, Russell S. Distance metric learning, with application to clustering with side-information. Advances in NIPS, vol. 15, 2003.Google Scholar