1 Introduction

Processing and analyzing data from different sources has become nowadays a problem of capital importance in many application domains e.g. images, video, multimedia information or biological data. Many of these give rise to data descriptors that keep growing due to better capacity of the corresponding sensor systems. In this context, distance-based methods are gaining importance over other classical feature-based methods, that use data descriptors or features.

Once an appropriate measure is given or has been learned, the well-known and studied family of distance-based methods can be used to tackle either classification, regression, estimation or clustering problems using only pairwise comparisons between objects.

The systematic search of an appropriate distance measure to be jointly used with either general or specific distance-based classification schemes (mainly Nearest Neighbor rules) has been studied almost as early as the corresponding classification methods [1, 2]. Nevertheless, the so-called distance metric learning (DML) has suffered a kind of rebirth recently as optimization methods and formulations coming from other neighboring fields have been applied to this particular problem [35].

In the most basic case, DML aims at learning an appropriate Mahalanobis-like metric from data i.e. obtain an appropriate positive semidefinite (PSD) matrix that induces a quadratic norm in the original feature space. This basic approach can be generalized in multiple ways but many methods start by defining an appropriate (usually convex) criterion function and then use a convenient procedure to optimize it [6, 7]. These criterion functions respond to the intention of keeping similar objects close and dissimilar objects apart at the same time. Metric learning implies optimization procedures whose computational burden is at least quadratic on the amount of training data. Moreover, the constraints that need to be imposed on both the relative object similarities and the metric matrix lead to computationally expensive optimization methods, specially when dealing with large-scale problems [8]. The need to overcome such computationally expensive tasks has raised the interest in other indirect ways of achieving the goals of DML [9, 10]. In this context, it is worth mentioning the family of methods that use an online approach [11]. These methods are specially relevant when data becomes available in an incremental form i.e. they are given sequentially or are obtained from a stream.

Incremental and online methods [5, 10] usually aim at optimizing a convenient criterion over a single instance (usually pairs of objects) which is made available for learning at every time step in the corresponding algorithm. In this case, inherent problems to other metric learning techniques turn more complicated; usual problems related to positive semidefiniteness constraints that are common to batch methods still apply; and new problems arise. First, different ways of sequentially enforcing additional constraints may lead to algorithms that require different amounts of computation. Also, the performance of the final solution may deviate significantly from the ideal (global) goal depending on the particular instances used in the last few iterations. Finally, parameter tuning gets considerably harder than in the batch case, according to recent preliminary results carried out in this context [12].

2 The problem

2.1 Notation

Let us assume there are objects \(x_i\), \(i=1,2,\ldots \) conveniently represented in the \(d\)-dimensional vector feature space, that is \(x_i\in \mathbb R ^{d}\). Let us also assume that there is a similarity relation among these objects such that given any pair, \((x_i,x_j)\), they can be considered either as similar or dissimilar. As we identify each pair of objects with indices \(i\) and \(j\), we use the label \(y_{ij}\) to indicate whether this pair is similar (\(y_{ij}=+1\)) or dissimilar (\(y_{ij}=-1\)).

We could try to learn a predictor for this relation over the whole representation space, provided that some of these labels were available. Instead of this, a closely related problem is considered. This consists of learning a distance function that is compatible with the labeling.

2.2 Foundations

A real symmetric function \(d:\mathbb R ^{d}\times \mathbb R ^{d}\longrightarrow \mathbb R \) that satisfies nonnegativity and triangle inequality is called a pseudometric. It is a distance if it also satisfies the identity property i.e. it gives zero if and only if the two objects are the same. A wide range of different pseudometrics (including the Euclidean distance) can be represented by using a PSD matrix \(M\) as follows:

$$\begin{aligned} d_{ij}^{M}=d^{M}(x_{i},x_{j})=(x_{i}-x_{j})^\top M(x_{i}-x_{j}). \end{aligned}$$
(1)

Note that the pseudometric function is \(\sqrt{d^M}\), but by convenience and without loss of generality only its squared version will be explicitly considered. It must be also emphasized that in the context of the present paper and as in many other related works, we will refer to square pseudometric functions when we use the term distance function.

2.3 Problem formulation

A distance function gives us a way of comparing objects. Hence, the function should yield small values for similar objects; and larger ones for dissimilar pairs. Given a set of pairs of objects that are known to be either similar or dissimilar, we can pose the problem of obtaining the best distance function that is compatible with these given pairs. Ideally, the distance function obtained should keep all similar pairs strictly closer than all dissimilar pairs. This situation is illustrated in Fig. 1a. This will be referred to as the separable case. In a more realistic case in which samples cannot be fully separated in this way, it is still possible to aim for a distance function that makes most similar pairs be roughly closer than most dissimilar pairs. We refer to this as the non-separable case (see Fig. 1b).

Fig. 1
figure 1

Normalized histogram of distance values of similar (red) and dissimilar (blue) pairs of objects corresponding to bivariate Gaussians whose separation between means is 1. a separable case, \(\sigma =0.02\), b non-separable case, \(\sigma =0.1\)

3 Metric learning

3.1 Metric learning using all labelled instances at once

3.1.1 Separable case

When all labelled pairs are given at once and solutions separating the data exist, one can think of an ideal solution as the one that maximizes the separation margin according to the Structural Risk Minimization principle [3]. As the absolute values of the distance depend on multiplicative constants in the matrix \(M\), this problem can be reformulated in a similar way as it is done with support vector machines [4]. In particular, a fixed value for this margin is set and the goal becomes to minimize the Frobenius norm [13, 14] (or any other regularizer [5]) of the metric matrix.

If we denote by \(b \in \mathbb R \) the threshold value that effectively separates the similar and dissimilar samples, and fix the value of the margin to \(2\), the following constraints apply:

$$\begin{aligned}&d_{ij}^{M}\le b-1,\quad \text{ if} y_{ij}=1, \\&d_{ij}^{M}\ge b+1,\quad \text{ if} y_{ij}=-1, \end{aligned}$$

or, in a more compact way

$$\begin{aligned} y_{ij}\left(b-d_{ij}^{M}\right)\ge 1. \end{aligned}$$
(2)

These linear constraints, together with the regularization criterion and the PSD constraint lead to a quadratic optimization problem that can be tackled and solved in a number of ways [15, 16] and results in the metric matrix \(M\) and the value \(b\) that effectively separates both kinds of distance values.

3.1.2 Non-separable case

Regardless of other considerations, the constraints in Eq. (2) can be modified to take into account the non-separable case by introducing a (nonnegative) slack variable for each labelled pair. The margin constraints then become

$$\begin{aligned} y_{ij}\left(b-d_{ij}^{M}\right)\ge 1-\xi _{ij}. \end{aligned}$$
(3)

This needs to be coupled with the inclusion of a new term in the optimization criterion to also minimize the values of the slack variables. This is usually referred to as establishing a soft margin to separate both kind of distance values and it is equivalent to minimizing the hinge loss associated with all pairs that violate the (hard) margin separation. Given a model \((M,b)\) and a pair \((i,j)\), the corresponding hinge loss is given as

$$\begin{aligned} \ell _{ij}^{(M,b)}= \max \left\{ 0,p_{ij}^{(M,b)}\right\} = \max \left\{ 0,1-y_{ij}\left(b-d_{ij}^{M}\right)\right\} \end{aligned}$$

where \(p_{ij}^{(M,b)}\) is the signed loss predicted for the pair \((i,j)\) when using the model \((M,b)\). Consequently, the constraint in Eq. (3) can be then expressed as \(p_{ij}^{(M,b)}\le \xi _{ij}\) or equivalently as \(\ell _{ij}^{(M,b)}\le \xi _{ij}\).

3.2 Online formulation using margins

Instead of considering constrained optimization using all available instances, the DML problem can be solved in a more convenient way both from the point of view of computation and robustness by using an online learning approach [10, 17].

At each time step, \(k\), a new optimization problem is formulated and solved using only a particular instance (a labeled pair \((i,j)\)) that is made available to the system. The problem makes use of the model \((M^{k},b^{k})\) learned at the previous step to produce a new model that takes also the new instance into account.

3.2.1 Separable case

In the separable case this is done by minimizing a convenient measure of the distance between the previous and the new model, subject to the restriction that the new instance must fall on the correct side of the (hard) margin [10]. In particular, this can be written as

$$\begin{aligned} \min _{M,b}&\frac{1}{2}\bigg \Vert M-M^{k}\bigg \Vert _\mathsf{Fro }^{2} + \frac{1}{2}(b-b^{k})^{2} \end{aligned}$$
(4)
$$\begin{aligned} s.t.&\ell _k^{(M,b)}=0, \end{aligned}$$
(5)

where \(\ell _k\) refers to the hinge loss on the \(k\)th pair, \((x_i,x_j)\) and \(\Vert \cdot \Vert _\mathsf{Fro }\) is the Frobenius norm. An obvious consequence of this formulation is that the model only needs to be updated if the new instance yields a strictly positive loss when using the previous model.

3.2.2 Non-separable case

In the non-separable case, the new instance is allowed to violate the margin condition but it is penalized by using a parameter \(C\) that leads to the following optimization problem.

$$\begin{aligned}&\min _{M,b,\xi } \frac{1}{2}\bigg \Vert M-M^{k}\bigg \Vert _\mathsf{Fro }^{2} + \frac{1}{2}(b-b^{k})^{2} + C\xi , \end{aligned}$$
(6)
$$\begin{aligned}&s.t.\;\; \ell _k^{(M,b)}\le \xi , \end{aligned}$$
(7)
$$\begin{aligned}&\xi \ge 0. \end{aligned}$$
(8)

In this formulation, the (nonnegative) slack variable, \(\xi \), allows a particular instance to violate the constraint. An alternative way of posing the problem is to remove the non-negativity constraint on the slack variable [Eq. (8)] and consider an squared penalty term in Eq. (6).

All the above formulations including both separable and inseparable cases are known as passive–aggressive (PA) learning formulations and were introduced in [11]. Following the taxonomy introduced by the authors, we will refer to these as PA (separable case), PAI (non-separable, linear penalty) or PAII (non-separable, quadratic penalty). These formulations have been previously applied to DML or closely related problems [10, 17, 18].

It has been shown [10] that in all previous cases the solution of these problems has a closed-form solution that can be written as the following update rule

$$\begin{aligned} M^{k+1}&= M^{k} - \tau y_{ij}(x_{i}-x_{j})(x_{i}-x_{j})^{\top }, \end{aligned}$$
(9)
$$\begin{aligned} b^{k+1}&= b^{k} + \tau y_{ij} , \end{aligned}$$
(10)

where the value of \(\tau \) is called the step length and depends on the particular formulation. The corresponding values of \(\tau \) for the three formulations considered are as follows.

$$\begin{aligned}&\mathrm{PA}\!:\tau _0 = \frac{\ell _k}{1+\Vert (x_{i}-x_{j})(x_{i}-x_{j})^{\top }\Vert _\mathsf{Fro }^{2}},\end{aligned}$$
(11)
$$\begin{aligned}&\mathrm{PAI}\!:\tau _{1} = \min \left\{ C, \tau _0 \right\} \!,\end{aligned}$$
(12)
$$\begin{aligned}&\mathrm{PAII}\!:\tau _{2} = \frac{\ell _k}{1+\frac{1}{2C}+\Vert (x_{i}-x_{j})(x_{i}-x_{j})^{\top }\Vert _\mathsf{Fro }^{2}}. \end{aligned}$$
(13)

These step lengths are dependent on the hinge loss incurred by the new \(k\)-pair, according to the currently available \(k\)th predictive model \(\ell _k=\ell _k^{(M^k,b^k)}\). If the pair similarity is consistent with the model at the previous iteration, we have \(\ell _k=0\) and then \(\tau _n=0\). This implies no training at this iteration (passiveness). On the other hand, when the prediction violates the (hard or soft) margin, the model gets updated either strictly (\(\tau _0\)) or controlled by the aggressiveness parameter \(C\). Hence, \(C\) controls how strongly the algorithm adapts the model to each example pair.

In the particular case of metric learning, two additional constraints are needed in the definition of the problem [10]. First, the matrix \(M\) must be PSD, \(M\succeq 0\). Second, and as a consequence, the value of the threshold \(b\) must be above 1, \(b\ge 1\).

In all previous works, these constraints are not considered in the formulation of the optimization problem. On the contrary, they are simultaneously enforced in a separate step, once the optimization problem has been solved and both matrix \(M\) and the threshold \(b\) have been computed. The final PSD matrix is the closest PSD approximation; and the final threshold is defined as \(\max (1,b)\). Note that this computation only needs to be performed when \(y_{ij}=1\) (the new training pair is a similar one). Even in this case, the computational burden can be partially alleviated by considering that Eq. (9) is a rank-one update on a PSD matrix and using the method proposed in [10].

3.3 Online metric learning using least squares

A slightly different alternative formulation using least squares [19] is also possible for the previous metric learning problem [12]. Instead of forcing a soft margin by penalizing the deviation from the ideal conditions, it is possible to force similar and dissimilar distance values to fall close to the “representative” values \(b-1\) and \(b+1\), respectively. To this end, one can sequentially minimize the corresponding squared error. This corresponds to reformulating the previous (PAII version) optimization problem as:

$$\begin{aligned}&\min _{M,b,\xi _k} \frac{1}{2}\Vert M-M^{k}\Vert _\mathsf{Fro }^{2} + \frac{1}{2}(b-b^{k})^{2} + C\xi _k^2 , \end{aligned}$$
(14)
$$\begin{aligned}&s.t. \quad p_{ij}^{(M,b)}= \xi _k. \end{aligned}$$
(15)

The main change in this formulation is that the inequality constraint in Eq. (7) has been changed to an equality and the signed loss function, \(p_{ij}^{(M,b)}= 1-y_{ij}\left(b-d_{ij}^{M}\right) \), now measures how far the distance value is from its corresponding ideal value (\(b-1\) or \(b+1\)).

The corresponding online optimization problem can now be tackled in a similar way as the previous ones. This also leads to a closed-form solution (see Appendix 5), that consists of the same update rule expressed in Eqs. (9) and (10), but with a different step length given by

$$\begin{aligned} \tau _{\text{3}} = \frac{p_k}{1+\frac{1}{2C}+\Vert (x_{i}-x_{j})(x_{i}-x_{j})^{\top }\Vert _\mathsf{Fro }^{2}}. \end{aligned}$$
(16)

where \(p_k=p_{ij}^{(M^k,b^k)}= 1-y_{ij}\left(b^{k}-d_{ij}^{M^{k}}\right)\) is the signed loss predicted using the \(k\)th learned model. Note that \(\tau _{3}\) can now take negative values and it holds that \(\tau _{2}=\max \{0,\tau _{3}\}\). Consequently, the corresponding algorithm can be considered as more aggressive because it implies a larger number of updates. As the approach shares the structure and part of the aims of the PA approaches, it will be referred to as PALS (passive–aggressive least squares) in this work. Only when it holds that \(p_k=0\), a passive step is performed. This only occurs when the distance value takes exactly its desired value. This is in contrast to pure passive-aggressive approaches, which perform a passive step when \(\ell _k=\max (0,p_k)=0\), i.e. \(p_k\le 0\).

One can see that all previous learning algorithms and in particular the corresponding step lengths, \(\tau _n\), are closely related to each other. In particular, we can write these step lengths as a function of \(C\) and \(p_k\) to put forward their interdependences.

$$\begin{aligned}&\text{ PALS}\!: \tau _3 ( C , p_k) = \frac{ p_k}{1+\frac{1}{2C}+\Vert X_{ij}\Vert _\mathsf{Fro }^{2}} \in \ ]\text{-}\infty , +\infty [ , \\&\text{ PAII}\!: \tau _2 (C, p_k) = \max (0, \tau _3 ( C , p_k) ) \in \ [0, +\infty [ , \\&\text{ PA}\!: \tau _0 ( p_k) = \lim _{C\rightarrow \infty } \tau _2 ( C , p_k) \in \ [0, +\infty [ , \\&\text{ PAI}\!: \tau _1 ( C , p_k) = \min ( C, \tau _0 ( p_k) ) \in \ [0, +\infty [ , \end{aligned}$$

where \(X_{ij}=(x_{i}-x_{j})(x_{i}-x_{j})^{\top }\).

It is possible to show a graphical illustration to clarify how the different step lengths relate to each other. In Fig. 2a, the values of \(\tau _n\) are plotted as a function of \(p_k\) for a given value of \(C\). It can be clearly seen that \(\tau _1\) is a saturated version of \(\tau _o\) while \(\tau _2\) corresponds to a milder version of the (unsaturated) \(\tau _1\). On the other hand, \(\tau _3\) equals \(\tau _2\) in the positive case but keeps leading to corrections on the model also in the negative case. In Fig. 2b, the different values of \(\tau _n\) are plotted for two different values of \(p_k\) as a function of \(C\). Note that these two values of \(p_k\) lead to two different asymptotic values of the form \(v=\frac{p_k}{1+\Vert (x_{i}-x_{j})(x_{i}-x_{j})^{\top }\Vert _\mathsf{Fro }^{2}}\). It can be observed that \(\tau _3\) is an approximation of \(\tau _0\) for large values of \(C\). In both illustrations it can be observed that \(\tau _1\) and \(\tau _2\) may lead to either stronger or milder corrections depending on the prediction values and the parameter \(C\).

Fig. 2
figure 2

Different step lengths corresponding to separable PA, PAI, PAII and PALS. a As a function of the signed loss for the current pair, \(p_k\) for a given value of \(C\). The value \(C^{\prime }\) shown corresponds to \(C\cdot (1\!+\!\Vert (x_{i}-x_{j})(x_{i}-x_{j})^{\top }\Vert _\mathsf{Fro }^{2})\). b As a function of \(C\) for two different values of \(p_k\) that lead to two different asymptotic values (\(v_1\) and \(v_2\), see text). Both the positive and negative cases (\(p_k>0\) and \(p_k\le 0\)) are shown for each value

3.4 Constraint satisfaction and tuning

The particular performance of the above algorithms may depend on two very important practical considerations related to PSD constraint satisfaction and tuning. It has been commented in previous sections that PSD is enforced after the optimization step is performed. In particular, the only negative eigenvalue is neglected as soon as it appears at each iteration. This leads to a reasonable approximation to the original problem [10]. Nevertheless, it is also possible to postpone the PSD corrections for a fixed number of steps or even until the end of the iterative process [18]. This strategy has an obvious computational benefit since even with specialized incremental algorithms, recomputing the eigendecomposition of the metric matrix is a computationally expensive operation. Apart from this, allowing indefinite matrices in the iterative optimization process leads in practice to a kind of concentration of information in the negative eigenvectors that are implicitly neglected. We have experimented here with two possibilities for each online metric learning algorithm. We call these the positive (marked with symbol +) and negative (denoted by symbol -) versions. In the former case, PSD corrections are applied at each iteration. In the latter, indefinite matrices are left as such until the end of the whole process. Indeed, much sparser metric matrices are produced in the negative versions, with a prediction power slightly below their positive counterparts. Other options (whose behavior is not reported in this paper) lead to performance results and computational burdens that lie somewhere in between the two previous ones.

Another very important question involves the appropriate tuning of the aggressiveness parameter, \(C\). The optimal value of \(C\) may depend on many different aspects, such as how separable the problem is. In contrast to the same concept in binary classification, this separability in distance values is difficult to analyze and manage. All these considerations get worse as the optimal value of \(C\) may depend on the particular pair because in the online learning case we have a different optimization problem at each iteration. In the present work, the optimal value of parameter \(C\) has been fixed for all iterations but in a different way for each algorithm. To this end, a quick round on a validation set has been performed, and the parameter that leads to the best results has been selected. Two alternative criteria have been considered, namely a) the value that minimizes the accumulated loss and b) the one that maximizes classification performance of the learned metric matrix when used in the \(k\)-nearest neighbor rule using the best \(k\) in the range from \(1\) to \(25\).

It is also worth mentioning the dependence of the result on the initial model given to the online algorithms. In general, one can think of starting from a random model, a previous one (e.g. obtained using batch learning on a reduced sample) or use the empty (zero) model. Using a previously learned model may accelerate finding a good solution but may seriously bias the behavior of the online algorithm. A random model may force the algorithm to start far away from optimal regions of the solution space but conditioning the sparseness

figure a1

of the model. Despite the obvious disadvantages of using the empty model, this scheme offers the advantage of starting from a neutral, sparse and unbiased model. Moreover, in this case, convenient bounds on the accumulated loss of the model can be proved [10].

Finally, the algorithmic scheme which is common to the three online algorithms considered both in their positive or negative versions is shown in Algorithm 1.

4 Experiments and results

In order to compare and assess the relative benefits and disadvantages of the online algorithms considered in this paper, an exhaustive experimentation has been carried out. To this end, a number of experiments focused on showing their behavior as online processes have been designed. Also, different metrics have been used in combination with \(k\)-nearest neighbor rules to evaluate the goodness of the feature space in the classification task. In addition, the computational load of each method has been stored in order to show and compare their execution in a fair setup.

4.1 Datasets

To conduct the experimental evaluation, a total of 15 databases have been used. In particular, 12 different databases from [20]; 2 databases (spam, balance) from [21]; and one more realistic repository previously used in CBIR tasks [22] have been used. To clarify some abbreviations, databases soybean are represented as soyS and soyL. These refer to soybean Small and soybean Large, respectively. Also the morphological features in the digits dataset is called mor. Art100 refers to a commercial collection called “Art Explosion”, distributed by the company Nova DevelopmentFootnote 1. Images in this collection were manually classified according to subjective image similarity, according to judgments issued by real users. To perform more meaningful classification experiments, only classes with more than 100 elements have been considered in this collection. In all databases, objects are considered similar only if they share the same class label. A summary of the particular characteristics of each dataset used in the comparative study is shown in Table 1. Repositories have been approximately sorted in order of relative complexity, by considering both the number of samples and the dimensionality. For consistency reasons, this order has been respected in all figures and tables that refer to a same experiment on different datasets.

Table 1 Characteristics of databases: size (\(n\)), dimension (\(d\)), number of classes (\(c\)) and number of training pairs used (\(r\))

In the experiments, all databases have been split randomly into two equally sized disjoint subsets. These are used for training and testing, respectively.

4.2 Experimental settings

The experimentation setup in this work has been fixed as suggested in [5] and other previous preliminary studies [12]. Also, the Information theoretic metric learning algorithm (ITML) [5] has been considered in this work as a baseline for comparison purposes. The ITML algorithm has been used as suggested in [5], using the software made available by the authors that has its own parameter tuning mechanism. This algorithm has been shown to competitively compare to many other recent metric learning algorithms and can be considered as a good representative of the state of the art.

To ensure that all the methods considered (including ITML) use the same amount of information, a set \(P\) composed of \(r=40c(c-1)\) non-repeated, random pairs has been selected for training (with \(c\) the number of classes in the dataset). To cope with the diversity in the datasets and guarantee a fair comparison, a simple method to stop the online algorithms has been used. This method simultaneously accounts for the number of classes and the total number of samples of each specific dataset. In particular, the set \(P\) is shuffled and all pairs are sequentially provided to the algorithm. The set is given to the algorithm at least twice and then this process is repeated until a maximum number of iterations has been reached. This maximum number of iterations has been established as the minimum between 20 % of all possible pairs of training samples (as in previous studies [17]) and 50 times the number of pairs in \(P\) (which is by far more than enough for larger databases with small number of classes). This means that at least \(t\) steps are executed, with \(t=\min (\lfloor \frac{n(n-2)}{40}\rfloor ,50r)\). This value has been proved as a good trade-off between computational cost and performance.

In order to automatically tune an appropriate value for \(C\), each fixed set of pairs is used to train a model for each version of each online method. This model is trained by fixing \(t=40c(c-1)\) which corresponds to feeding each pair only once. Besides, the final model is validated over the whole training set.

An appropriate range of exponentially spaced values in the range \([10^{-4},10^2]\) has been considered as the parameter space. Both criteria described in Sect. 3.4 have been attempted, and the two criteria led to very similar results. Hence, only results obtained by maximizing classification performance are reported in the present work.

All online models have been initialized with an empty model, that is \(b=0\) and a zero metric matrix as suggested in [11]. All results presented are the average of ten independent runs with different random initializations to obtain different training and test sets, but taking care of using exactly the same data for each one of the algorithms considered.

4.3 Performance evaluation

4.3.1 Online predictive comparison

Apart from using the algorithms to obtain a metric matrix for distance-based classification, their behavior as online learning methods has been assessed in the experiments. To this end, several loss and performance measures have been considered throughout the learning process. First, Fig. 3 shows the averaged predictive 0-1 loss defined as

$$\begin{aligned} \ell =\frac{1}{t} \sum _{k=1}^t \mathrm * {sgn}(\ell _k) = \frac{1}{2t}\sum _{k=1}^{t}\left|y_{k}-\hat{y}_{k}\right|, \end{aligned}$$

where \(t\) is the learning sequence length, \(y_k\) is the true label of the pair supplied at the \(k\)th step, and \(\hat{y}_{k}\) is the label predicted by using the \((k-1)\)th model.

Fig. 3
figure 3

Predictive error of the online algorithms

This measure illustrates the behavior of the different online algorithms throughout time when discriminating between similar and dissimilar objects. Only 6 out of the 15 databases considered are shown but they are representative of all different behaviors observed in the whole set of experiments. In particular, the behavior on the databases nist16 and spam are very similar to the one shown for ionosphere. This similarity is also observed for ecoli, malaysia, mor and satellite with regard to Art100, which is shown. Databases chromo and breast show the same behavior as that of wine and glass, respectively. Finally, soybean small (soyS) and soybean large (soyL) also exhibit a very similar behavior.

Summarizing, all online learning algorithms have lead to reasonably good behavior in the experiments, according to the loss measure specified above. In 12 out of 15 databases, the negative versions of the methods have led to better results than their positive counterparts. And in five of them the worst negative version is better than all positive ones (glass, ionosphere, breast, spam and nist16). On the other hand, LS methods have exhibited a significantly worse behavior in the wine, chromo, balance, ecoli, mor, satellite, malaysia,soybean and Art100 databases. This difference in behavior is even more evident in the positive versions of the algorithms.

4.3.2 Dimensionality behavior evolution

To better understand how the online algorithms behave, the averaged effective dimension of the metric matrix as the optimization goes on is shown in Fig. 4 for the particular case of the ionosphere database. It can be observed that all online algorithms (which start from a null matrix) very quickly incorporate new dimensions and rapidly converge to a particular dimensionality.

Fig. 4
figure 4

Averaged effective dimension obtained using online algorithms at each iteration on the ionosphere database

The difference between positive and negative versions rely on the fact that negative versions converge to a sparser matrix because (hopefully unimportant) information concentrates in the negative eigenvectors of the corresponding metric matrix. Figure 5a and b display how the final metric matrices behave when used in a 2-NN classifier using train and test data, respectively. These two figures illustrate that the behavior on training data is representative of what happens with test data. In both cases, the matrices obtained with the positive and negative versions still contain uninformative dimensions. It can be seen that roughly the same results could be obtained if the matrices were cut to the best ten eigenvectors. The differences among methods are not significant and the absolute minimum seen on the training data does not have an exact correspondence in the test data. Even so, the behavior with training data can be considered as a good indicator about the intrinsic dimensionality for this problem.

Fig. 5
figure 5

Averaged 2-NN classification errors obtained as more eigenvectors from the obtained metric matrices are considered in order of importance for the experiments in Fig. 4

4.3.3 Classification task

To measure the quality of the final outcome of the different algorithms, the corresponding matrices, \(M\), have been used to construct a \(k\)-NN classifier. The classification error using up to the first 25 neighbors has been computed and the best results for each database and method are shown in Table 2.

Table 2 Average classification errors and best number of neighbors (in brackets)

The classification errors obtained with all algorithms including ITML indicate a good classification performance in the context of the experimentation carried out in this work. A multiple comparison Friedman test [23] has not revealed any significant differences between the classification results obtained with ITML and any of the six other online alternatives proposed. Nevertheless, it can be concluded that all algorithms lead to very competitive results. It is also worth noting the fact that the combination of the LS approach with indefinite matrices has led to the best results in three out of fifteen databases. These are precisely the three largest ones, according to both size and dimensionality. In addition, using the Euclidean distance was the best option for one of the databases, ecoli, which provides an example where DML does not help improving classification results.

4.4 Computational burden measures

All experiments on all databases have been run on the same computer. In particular, an AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ has been used. All different runs have been restricted to using a single CPU to obtain more accurate and machine-independent measurements.

Averaged CPU times in seconds for all databases, along with their corresponding standard deviations are shown in Table 3. The same running times have also been plotted in Fig. 6, using a logarithmic scale in the time axis to graphically illustrate the relative merits of each algorithm on each database. For completeness, and to better evaluate how the CPU time of the different algorithms compare to each other in all databases, a multiple comparison Friedman test has also been performed in this case. The resulting average ranks (from faster to slower) are shown in Table 4. The more important pairwise comparisons among the methods and their adjusted \(p\) values according to the Holm post-hoc test are given in Table 5.

Table 3 Averaged CPU running times in seconds along with standard deviations (in brackets)
Fig. 6
figure 6

Averaged CPU running times for each algorithm in all databases

Table 4 Average ranks of the methods according to the Friedman test (\(\alpha =0.05\))
Table 5 Results of the Holm post hoc test (\(\alpha \,=\,0.05\))

From the running times shown in Table 4, one can conclude that all online algorithms are quite computationally competitive with regard to ITML. The lowest running times are always obtained when using the negative versions of the online algorithms. Significant differences have been found between the positive and negative versions of a same method. This occurs even in the case of PALS that needs about twice the number of updates than the other online algorithms. Significant differences also exist between all negative versions and the ITML. These differences increase with the size of the database. It is worth noting that in all cases, the CPU time of the online negative version is below 40% of that taken by the ITML algorithm. Finally, the negative version of the PALS algorithm is considerably slower than the other negative online algorithms in the two largest databases. In particular, it is about 3 times slower for nist16 and twice as slow in the case of Art100. On the contrary, CPU times for all negative online algorithms were very similar in all other databases.

5 Concluding remarks and further work

Several online learning algorithms adapted to solve the problem of learning a Mahalanobis-like distance matrix have been considered in this work. These algorithms are derived from the well-known passive-aggressive schema in which a term that measures closeness to the current learned model gets mixed with a term that enforces the particular constraints. All of them have been formulated under a common framework that has been extended to substitute the soft margin criterion by a least square condition.

In addition, both positive and negative versions of the online algorithms have been exhaustively tested. In the first case, the PSD constraint on the learned matrix has been enforced at each iteration; in the second, the PSD constraint has not been considered when learning the matrix. From the classification results obtained, it can be safely stated that all the online algorithms considered in this work have the potential to arrive at very similar and competitive results with regard to the state of the art. In fact, no significant differences in classification performance where found between any of the online algorithms tested. On the other hand, the running times needed by the different algorithms vary depending on the size of the database. For small sizes, all online algorithms are very efficient. For larger databases, the positive versions lead to a relatively high computational effort. Differences in running time between the positive version of an algorithm and its negative counterpart have been shown significant in all cases.

A major contribution of this work is the proposal to use indefinite matrices in distance learning, as an efficient alternative to the so called alternating projection methods. Apart from the computational benefits obtained, results suggest that the use of indefinite matrices yield more sparse solutions. This leads to a recommendation to use the negative versions of the algorithms that offer significantly lower running times while exhibiting roughly the same performance.

A number improvements on the presented algorithms are still possible. First, the convergence criterion could be further improved to reduce the training times of all online algorithms. Moreover, introducing an adaptive tolerance on the online process could also lead to a significantly lower number of matrix updates. This could have a significant impact, specially in the case of PALS algorithms. More importantly, we are considering adding specific constraints to control the sparseness of the matrix in order to both reduce computation time and improve the quality of the learned metrics.