1 Introduction

Metric learning has been widely studied in machine learning due to its importance in many machine learning algorithms (Xing et al. 2003; Weinberger and Saul 2009; Davis et al. 2007; Huang et al. 2009, 2011; Ying et al. 2009; Ying and Li 2012). The objective of metric learning is to learn a proper metric function from data, usually a Mahalanobis distance defined as \(d_{A}(\mathbf{x},\mathbf{y})=\sqrt{(\mathbf{x}-\mathbf{y})^{\top}A(\mathbf{x}-\mathbf{y})}\), while satisfying certain extra constraints called side-information, e.g., similar (dissimilar) points should stay closer (further).

In this paper, we consider an extended metric learning problem where there exist several correlated metric learning tasks simultaneously. Two traditional solutions could be exploited for the problem. The first one is to learn a metric for each task individually. Unfortunately, this approach is likely to be over-fitting, especially when the training samples of some tasks are insufficient. On the other hand, the second solution suggests to learn a single global metric for all the tasks. Since the essential discrepancies among the tasks are neglected by this method, the performance is limited. To attack this problem, multi-task learning (MTL) has recently received considerable attention (Caruana 1997; Evgeniou and Pontil 2004; Argyriou et al. 2008; Argyriou and Evgeniou 2008; Zhang et al. 2008; Zhang and Yeung 2010a). MTL learns an individual model for each task but trains them jointly. Joint training of multiple tasks enables information sharing among tasks, which helps improve the performance of each task.

Despite its good performance, MTL has been rarely applied to the multiple metric learning problems. To our best knowledge, only recently Parameswaran and Weinberger (2010), Zhang and Yeung (2010a), and Yang et al. (2011) developed a multi-task metric learning framework separately. Parameswaran and Weinberger (2010) proposed a novel multi-task framework called the Large Margin Multi-Task Metric Learning (mtLMNN) which is a combination of the Large Margin Nearest Neighbor (LMNN) (Weinberger and Saul 2009) and the Regularized Multi-task Learning (RegMTL) (Evgeniou and Pontil 2004). It assumes that the Mahalanobis matrix of each task is composed of a common part and a task-specific part. By minimizing the Frobenius norm of the task-specific part, each metric could be constrained to be similar to a common one so that different tasks may share information from each other. On the other hand, Zhang and Yeung (2010b) proposed to combine the Multi-task Relationship Learning (MTRL) (Zhang and Yeung 2010a) with the Regularized Distance Metric Learning (RDML) (Jin et al. 2009). By introducing a regularization item with a task covariance matrix, the relationship among tasks can be learned, which provides the potential for better sharing information among the tasks. In addition, Yang et al. (2011) also did some work in this topic by assuming that the metrics of all tasks share a common subspace.

However, all the above mentioned methods have certain limitations. For Yang et al. (2011), since the formulation is not convex, the global optimal solution is not guaranteed. Besides, the assumption of the common subspace may be too strict to be used in some cases. The other two methods exploited vector-based divergence measures to describe the task relationship. Specifically, if we concatenated all columns of each matrix as a vector, in Parameswaran and Weinberger (2010), Frobenius norm between two matrices simply presents the Euclidean distance, while, in Zhang and Yeung (2010a), the regularization applied a matrix-variate normal prior distribution to the vectors. However, we will show that these methods designed for vector variables do not apply to the positive semi-definite Mahalanobis matrices directly and will lead to inaccurate information propagation among tasks. This deficiency will further limit the performance improvement.

For example, the squared Frobenius norm of two Mahalanobis matrices are used to measure the discrepancy of two metrics, but we can show in Fig. 1 that it is not a proper measure for metrics. There are three figures associated with different distance metrics, determined by a Mahalanobis matrix A 1, B, and A 2 respectively for each figure (from left to right). To visualize the Mahalanobis metric in the Euclidean space (Xing et al. 2003), we transform each point x i to \(\hat{\mathbf{x}}_{i}=A^{1/2}\mathbf{x}_{i}\) when plotting so that the Euclidean distance of any pair of transformed points \(d(\hat{\mathbf{x}}_{i},\hat{\mathbf{x}}_{j})\) is exactly the Mahalanobis distance of the original points d A (x i ,x j ). Geometrically observed, the metric B is obviously more similar to A 2 than to A 1. However, when calculating the similarity using the squared Frobenius norm of the Mahalanobis matrix difference, surprisingly, B is more similar to A 1 than to A 2! This shows that minimizing Frobenius norm of matrix difference cannot preserve the geometry and hence may not be appropriate for measuring the divergence of metrics.

Fig. 1
figure 1

Illustration of geometry preserving property. The geometry property between \(d_{A_{2}}\) and d B is better preserved than the one between \(d_{A_{1}}\) and d B . Besides, the relative distance of d(x 1,y 1) and d(x 1,y 2) is preserved from B to A 2 but not preserved to A 1

In contrast to the above methods, in this paper, we engage the Bregman matrix divergence (Dhillon and Tropp 2008) to design a more general regularized framework for multi-task metric learning. On one hand, this general framework exploits a more general family of matrix divergences. We show that it naturally incorporates the mtLMNN (using the Frobenius norm) as a special case. On the other hand and more importantly, by exploiting a special Bregman divergence called von Neumann divergence (Dhillon and Tropp 2008) (denoted by D vN(A,B)) as the regularization, the new multi-task metric learning method has the capability to well preserve the geometry when transferring information from one metric to another. The geometry preserving property is important because (1) data usually live in a geometric vector space in the traditional learning tasks and (2) metric learning is also usually conducted in a geometric vector space, e.g., Euclidean space. In this sense, to guarantee a better performance, it is necessary to preserve the data geometry, e.g., those relative constraints such as sample x i should be more similar to sample x j than sample x k , when transferring information among tasks.Footnote 1

We define the geometry preserving probability to measure the geometry preserving property of two metrics mathematically from the statistical point of view. Then a series of theoretical analysis is provided to show that our new multi-task metric learning method usually leads to a higher geometry preserving probability and has the capability to better preserve geometry. This enables more appropriate information propagation among tasks and hence provides potential for further raising the performance. In addition to the geometry preserving property, the new multi-task framework with the von Neumann divergence remains jointly convex, provided that any convex metric learning is used. This is one of the major advantages against other non-convex formulations, e.g., the model proposed in Yang et al. (2011). Extensive experimental results on one synthetic data set and five real data sets (across very different disciplines) also verify that our proposed algorithm can consistently outperform the current methods. Especially, a toy example in Fig. 4 of Sect. 6.1 can show the advantage of our method more intuitively.

This paper is an extension of our earlier conference paper (Yang et al. 2012), which first proposed the concept of geometry preserving property and used to improve multi-task metric learning problems. This journal version significantly extends the previous paper both theoretically and empirically. It reviews the related methods and summarizes their strengths and weaknesses, explains the motivation in more details, enhances the theoretical analysis in a stricter way with complete proofs of all theorems, and expands the experimental results by comparing with more methods on more datasets.

The rest of this paper is organized as follows. In Sect. 2, we provide the notations used in the paper. In Sect. 3, we review the related work. In Sect. 4, we then present the novel multi-task metric learning framework with Bregman matrix divergence, the concept of geometry preserving probability, the proposed learning method and optimization algorithm. We present theoretical analysis in Sect. 5 and experimental evaluation in Sect. 6. Finally, we give concluding remarks in Sect. 7.

2 Notations and problem definition

In this section, we first present the notations used in the paper and then introduce the problem definition of multi-task metric learning.

A metric defined on set \(\mathbb{X}\) is a function \(d:\mathbb{X}\times\mathbb{X}\rightarrow\mathbb{R}_{+}\triangleq[0,+\infty )\) satisfying positiveness, symmetry, and triangle inequality (Burago et al. 2001). In this paper, we focus on the Mahalanobis metric defined on \(\mathbb{R}^{m}\) by a symmetric positive semi-definite (SPSD) matrix A as \(d_{A}(\mathbf{x},\mathbf{y})=\sqrt {(\mathbf{x}-\mathbf{y})^{\top}A(\mathbf{x}-\mathbf{y})}\) where A is called Mahalanobis matrix. Denoting the set composed of all metrics on \(\mathbb{X}\) by \(\mathcal{F}_{\mathbb{X}}\) and given any pair of metrics \(d_{A},d_{B}\in\mathcal{F}_{\mathbb{X}}\), a divergence function \(D:\mathcal {F}_{\mathbb{X}}\times\mathcal{F}_{\mathbb{X}}\rightarrow\mathbb{R}_{+}\) is defined to measure the discrepancy of d A and d B . Since the Mahalanobis metric d A is parameterized by the Mahalanobis matrix A, we denote D(d A ,d B )≜D(A,B) for short.

Assume that there are T related metric learning tasks. For each task-t, its training data set \(\mathcal{X}_{t}=\{\mathbf{x}_{tk}\in \mathbb{R}^{m}\}_{k=1}^{N_{t}}\) contains N t data points where m is the dimension. The side-information defining a set of constraints on the learned metric d t can be generally formulated as \(d_{t}\in\mathcal {C}_{t}(\mathcal{X}_{t})\). For instance, in the Generalized Sparse Metric Learning (GSML) (Huang et al. 2009) and the LMNN, \(\mathcal{C}_{t}\) is defined as a triplet set \(\mathcal{T}_{t}=\{(i,j,k)\}\) which provides side-information as relative constraints such that x ti is more similar to x tj than to x tk under the new metric and thus

$$ \mathcal{C}_t(\mathcal{X}_t)= \bigl\{d\in \mathcal{F}_\mathbb{X}\mid d(\mathbf{x}_{ti},\mathbf {x}_{tj})\leq d(\mathbf{x}_{ti},\mathbf{x}_{tk}),\ \forall(i,j,k)\in\mathcal {T}_t \bigr\}. $$

In the Informative Theoretical Metric Learning (ITML) (Davis et al. 2007), \(\mathcal{C}_{t}\) is defined as a similar pair set \(\mathbb {S}_{t}\) and a dissimilar pair set \(\mathbb {D}_{t}\) which provides side-information such that similar (dissimilar) pairs should stay closer (further) than a upper bound u (lower bound l) respectively and thus

$$ \mathcal{C}_t(\mathcal{X}_t)= \left\{d\in \mathcal{F}_\mathbb{X}\left \vert \begin{aligned} &d(\mathbf{x}_{ti}, \mathbf{x}_{tj})\leq u,\ \forall(i,j)\in \mathbb {S}_t; \\ &d(\mathbf{x}_{ti},\mathbf{x}_{tj})\geq l,\ \forall(i,j) \in \mathbb {D}_t. \end{aligned}\right. \right\}. $$

The objective of multi-task metric learning is to learn T proper Mahalanobis matrices \(\{A_{t}\}_{t=1}^{T}\) jointly, which is significantly different from single-task metric learning where each Mahalanobis matrix is learned independently.

3 Related work

There have been some attempts to combine multi-task learning with metric learning. Based on different assumptions about the relationship among tasks, the researchers proposed some interesting models of multi-task metric learning.

3.1 Multi-task large margin metric learning

The first multi-task metric learning method is the mtLMNN model proposed by Parameswaran and Weinberger (2010). Motivated by the RegMTL (Evgeniou and Pontil 2004), the mtLMNN assumes that the Mahalanobis matrix of the t-th task is composed of a common part and a task-specific part as \(A_{t}=A_{0}+\hat {A}_{t}\). Exploiting further the squared Frobenius norm (Horn and Johnson 1985) of the task-specific part \(\|\hat{A}_{t}\| _{\mathrm{F}}^{2}\) as the regularization term, mtLMNN encourages the similarity between each task and a common one. This approach indeed shows better performance in several real data sets. However, this method suffers from two drawbacks which we explain at the end of Sect. 4.1, which will further limit its performance in real applications.

3.2 Zhang and Yeung’s method

Zhang and Yeung (2010b) proposed a multi-task metric learning approach by assuming the matrix composed of vectorized Mahalanobis matrices of all tasks follows a matrix-variate normal distribution (Zhang and Yeung 2010a; Gupta and Nagar 2000). It first concatenates all columns of each A t to form a vector \(\tilde{A}_{t}=\textup{vec} (A_{t} )\) and then engages the MTRL (Zhang and Yeung 2010a) regularization \(\tilde{A}\varOmega^{-1}\tilde{A}^{\top}\) to couple different tasks, where \(\tilde{A}= [\textup{vec} (A_{1} ),\ldots,\textup{vec} (A_{T} ) ]\). It applies a matrix-variate normal prior distribution

$$ q(\tilde{A})=\mathcal{MN}_{m^2\times T}(\tilde{A}|\mathbf {0}_{m^2\times T}, \mathbf{I}_{m^2}\otimes\varOmega) $$

to \(\tilde{A}_{t}\)’s (Zhang and Yeung 2010a) and the task relationship Ω can finally be obtained together with all the metrics. This approach has demonstrated some desirable properties as the task relationship can be learned automatically, but there are two irrationalities of the prior distribution applied to \(\tilde{A}\):

  • The expectation of each \(\tilde{A}_{t}\) is a zero vector, which is apparently designed for vector-based variables rather than Mahalanobis matrices being symmetric semi-positive definite. For example, A and −A are assigned with equal prior probability, which is improper since at most one of them is possible to be a feasible Mahalanobis matrix.

  • Vectorization of a matrix discards some structure information.

Moreover, the authors surprisingly failed to validate it empirically.

Actually, mtLMNN also applies a multi-variate normal distribution to the vectorized Mahalanobis matrices. In contrast to Zhang and Yeung (2010b)’s method which predefines the mean and learns the task relationship, mtLMNN predefines the task relationship Ω as the Laplacian matrix of an all connected graph (Chung 1997).

3.3 Multi-task metric learning based on common subspace

Yang et al. (2011) proposed their multi-task metric learning method based on the assumption that all the metrics share a common low-dimensional subspace. Supposing \(A_{t}=L_{t}^{\top}L_{t}\) and the transformation matrix L t has the decomposition L t =R t L 0, all tasks are coupled by the common matrix L 0, which has fewer rows than its columns. Hence it actually defines a common subspace for all the tasks, while R t defines the metric in this subspace for each task. With alternating optimization, the subspace L 0 and all the metrics R t can be solved simultaneously. However, this assumption is sometimes too strict. In addition, this model involves a non-convex optimization and hence cannot guarantee the global solution.

4 Geometry preserving multi-task metric learning

In this section, we first detail our proposed novel framework, and then show the importance of preserving geometry among samples when sharing the side-information among tasks. The concept of geometry preserving probability is then proposed to provide a mathematical criterion that measures the capability to preserve the geometry relationship, i.e., the relative distance of samples between two metrics. Following that, we introduce our method that exploits von Neumann divergence to regularize the relationship among multiple tasks. Finally, we present a practical algorithm to solve the involved optimization problem.

4.1 General framework

In this section, we propose a general framework for multi-task metric learning including mtLMNN as a special case.

Assume that a common metric d c is defined and the metric of each (the t-th) task d t is correlated with d c by a regularization D(d t ,d c ). All the metrics are coupled by this common metric. In the case of the Mahalanobis metric, the regularization can be also written as D(A t ,B), where A t and B correspond to the t-th task and the common one respectively. Then the novel framework can be formulated as

(1)

where L is a loss function of the training samples depending on side-information and the metric learning method, \(\mathcal{X}_{t}\) represents the set of training samples of the t-th task, D(⋅,⋅) is the divergence function to correlate two metrics, and \(\mathcal{C}_{t}(\mathcal{X}_{t})\) is the set of feasible A t determined by side-information. The predefined metric A 0 provides a prior for the common metric and we can usually use the Euclidean distance, i.e., A 0=I m . In a lot of cases, there may not exist a feasible solution to strictly satisfy all the constraints defined as \(\mathcal {C}_{t}(\mathcal{X}_{t})\) and thus the soft constraints are used instead by reformulating the inequality constraints as loss functions. For example, the constraint \(d_{A}^{2}(\mathbf{x}_{i},\mathbf {x}_{k})-d_{A}^{2}(\mathbf{x}_{i},\mathbf{x}_{j})\geq1\) is reformulated as a loss \([1+d_{A}^{2}(\mathbf{x}_{i},\mathbf{x}_{j})-d_{A}^{2}(\mathbf {x}_{i},\mathbf{x}_{k}) ]_{+}\) where [z]+=max(z,0). Then, denoting the loss function including the soft constraints as \(\tilde {L}(A_{t},\mathcal{X}_{t})\), the framework becomes

$$ \min_{\{A_t\},B}\ \sum_t{ \bigl( \tilde{L}(A_t,\mathcal{X}_t)+\gamma D(A_t,B) \bigr)}+\gamma_0 D(A_0,B)\quad \mbox{s.t.} \;A_t\succeq\mathbf{0}. $$

In our framework, all metrics are correlated with each other because the model assumes that each \(d_{A_{t}}\) is encouraged to be similar to a common metric d B by minimizing D(A t ,B). Thus it plays a role of measuring the discrepancy of two metrics so that the less D(A t ,B) is, the more closely A t and B are correlated. Therefore, by minimizing D(A t ,B), the information is enforced to be shared between A t and B, and the definition of D(⋅,⋅) determines the type of shared information. Since multi-task learning improves the performance of each task by utilizing the information propagated from others, the choice of D(⋅,⋅) is critical to the performance of this framework.

There is a family of discrepancy measures for two Hermitian matrices called Bregman matrix divergence (Dhillon and Tropp 2008), which is defined as

$$ D_\phi(A,B)=\phi(A)-\phi(B)-\mathrm{tr} \bigl( \bigl(\nabla\phi(B)\bigr)^\top (A-B) \bigr), $$
(2)

where \(\phi:\mathcal{H}\rightarrow\mathbb{R}\) is a strictly convex, differentiable generating function of a Hermitian matrix variable, and tr(A) is the trace of A. Furthermore, if ϕ(A) depends only on the eigenvalues of A, it is called a spectral function (Lewis 1996), and D ϕ (A,B) is the spectral Bregman matrix divergence (Kulis et al. 2009). In this case, ϕ can be written as a composition ϕ(A)=(φλ)(A), where λ(A) is the function that lists the eigenvalues in algebraically decreasing order and φ is a strictly convex function on \(\mathbb{R}^{m}\).

By choosing different φ, we obtain some famous types of matrix divergences (Kulis et al. 2009). If the squared 2-norm φ(x)=x x=∥x2 is used, we have \(\phi (A)=\|A\|_{\mathrm{F}}^{2}\) and \(D_{\phi}(A,B)=\|A-B\|_{\mathrm{F}}^{2}\) is the squared Frobenius norm of their difference; if the entropy φ(x)=∑ i x i logx i x i is used, we have ϕ(A)=tr(AlogAA)Footnote 2 and D ϕ (A,B) is the von Neumann divergence, which we will discuss in detail later.

Note that the mtLMNN is a special case of our framework with \(D_{\phi}(A_{t},B)=\|A_{t}-B\|_{\mathrm{F}}^{2}\) and replacing A t 0 with A t B0. By rewriting it in this form, its main drawbacks are much clearer:

  1. 1.

    The constraints A t B are unnecessarily strong for A t to be a Mahalanobis matrix, which implies that the distance of any task has to be further than the distance defined by the common metric.

  2. 2.

    Frobenius norm of Mahalanobis matrix difference is inadequate to measure the discrepancy of two metrics, and thus minimizing the Frobenius norm of matrix difference cannot preserve the geometry relation of data defined by the metrics. We have illustrated it with an example and will explain it theoretically in Sect. 5.

4.2 Regularization and geometry preserving probability

In this section, we show our motivation to define a proper regularization D(A t ,B) which enables side-information propagate among tasks more appropriately. Then the concept of geometry preserving probability is proposed to measure whether the side-information is well propagated.

On one hand, in this framework, a smaller D(A t ,B) implies more side-information shared. Noting that the metric is learned by satisfying a set of constraints from side-information, metric learning can also be regarded as a process to embed the side-information into the learned metric. Thus closely correlated metrics should contain similar side-information and minimizing D(A t ,B) should encourage side-information to propagate between A t and B.

On the other hand, the side-information is usually formulated as a set of constraints on the relative distance of the samples (Ying and Li 2012). For example, the GSML and the LMNN define the side-information directly by constraints on the relative distances of samples in a triplet set. For the ITML, although it defines an upper bound for the distances of similar pairs and a lower bound for the distances of dissimilar pairs respectively, a set of constraints on their relative distances are also implicitly defined by the relation of the two bounds.

From the above observations, a proper D(⋅,⋅) for multi-task metric learning should have the following property: the less D(A t ,B) is, the more constraints about relative distances are satisfied by both A t and B. Focusing on the t-th task and fixing the common metric B, we obtain the subproblem

$$ \min_{A_t}\ \tilde{L}(A_t, \mathcal{X}_t)+\gamma D(A_t,B)\quad \mbox{s.t.} \;A_t\succeq0, $$
(3)

which aims to find an A t that is correlated with B while satisfying the side-information of its own task. According to the above discussion, it is equivalent to solving such a metric A t : on one side, it satisfies the constraints from side-information of the t-th task; on the other side, it preserves the geometry relationship (relative distances) of the samples measured by B as better as possible, which we call as “geometry preserving property”.

To illustrate the geometry preserving property, recall the example shown in Fig. 1. There are two pairs of randomly selected points (x 1,y 1),(x 1,y 2). Since d B (x 1,y 1)<d B (x 1,y 2), if we want a metric d A which is similar to d B , it is desirable that d A makes the same judgement on the relative distance, i.e. d A (x 1,y 1)<d A (x 1,y 2). Obviously, such a relative distance relationship for (x 1,y 1),(x 1,y 2) is preserved between A 2 and B but not preserved between A 1 and B. Analogously, there are also two pairs (x 1,y 1),(x 1,y 3), whose relative distance relationship is preserved between both A 1,B and A 2,B. Since there are more relationships preserved for A 2,B, we say the geometry preserving property of them is better, which is also consistent with our intuition that B is more similar to A 2 than to A 1.

Based on the idea, we propose the concept of geometry preserving probability to measure the geometry preserving property mathematically. It is defined as the probability that the relative distance of arbitrary two pairs of random points is preserved for the two metrics.

Definition 1

(Geometry Preserving Probability)

Suppose \(\mathbf{x}_{1},\mathbf{y}_{1}\in\mathbb{X}\) and \(\mathbf {x}_{2},\mathbf{y}_{2}\in\mathbb{X}\) are two pairs of random points following a certain distribution defined by probability density f(x 1,y 1,x 2,y 2).

If two metrics d A and d B defined on \(\mathbb{X}\) are used to compare the distances between each pair of points d(x 1,y 1) and d(x 2,y 2), the probability of that d A and d B make the same judgement about their relative distance is called geometry preserving probability of metrics d A and d B with distribution f. It is denoted by PG f (d A ,d B ) (Probability of Geometry Preserving) with mathematical formulation shown in (4).

(4)

where (x 1,y 1,x 2,y 2)∼f and ∧ denotes the logical “and” operator.

By this definition, the higher PG f (d A ,d B ) is, the better the geometry relation is preserved between d B and d A . In the example of Fig. 1, using randomly generated samples, the PG can be estimatedFootnote 3 as \(\mathrm{PG}_{f}(d_{A_{1}},d_{B})\approx0.805<\mathrm {PG}_{f}(d_{A_{2}},d_{B})\approx1.000\) for some distribution f, which shows the geometry is better preserved between B and A 2 than between B and A 1.

In the following parts, we will discuss the proposed method with von Neumann divergence. Theoretical analysis is given in Sect. 5, which will show that our method is more likely to make PG f (d A ,d B ) higher and thus can better preserve geometry.

4.3 Multi-task metric learning with von Neumann divergence

We propose to use the von Neumann divergence (Dhillon and Tropp 2008; Kulis et al. 2009) as the regularization in framework (1) and obtain our multi-task metric learning method.

Assuming the spectral decomposition of A is A=VΛV , the matrix logarithm of A is defined as \(\log A=V\log\varLambda V^{\top}=\sum_{i}{\log\lambda_{i}(\mathbf{v}_{i}\mathbf{v}_{i}^{\top})}\), where logΛ is the diagonal matrix containing the logarithm of eigenvalues.

Then the von Neumann divergence is defined as

$$ D_\mathrm{vN}(A,B)=\mathrm{tr} (A\log A-A\log B-A+B ). $$
(5)

If either A or B is low-rank, the von Neumann divergence is unavailable due to its zero eigenvalues. In this case, the von Neumann divergence is defined as

$$ D_\mathrm{vN}(A,B)=D_{\mathrm{vN},U}(A,B)=D_\mathrm{vN} \bigl(U^\top AU,U^\top BU\bigr), $$

where U is an m×r column orthogonal matrix such that range(B)⊆range(U), and this definition is independent of the choice of U (Kulis et al. 2009). Please refer to Kulis et al. (2009) for more detail about the treatment of the low-rank case.

The von Neumann divergence has been widely used in quantum information theory (Nielsen and Chuang 2010). It plays the role of relative entropy between two quantum density operators, which are mathematically represented as SPSD matrices just like the Mahalanobis matrices. Exploiting the von Neumann divergence as the regularization between Mahalanobis matrices A and B, the geometry relationship of samples measured by B is more liable to be preserved as measured by A. More strictly, it will encourage a higher geometry preserving probability PG f (A,B). We will detail the theoretical analysis in Sect. 5.

The von Neumann divergence has a nice property that it is jointly convex with both two arguments (Tropp 2012; Bauschke and Borwein 2001) as shown in Theorem 1.

Theorem 1

(Joint convexity of von Neumann divergence)

The von Neumann divergence (5) is jointly convex, which means that for SPD matrices \(\{A_{i},B_{i}\}_{i=1}^{n}\) and \(\{ p_{i}\in[0,1]\}_{i=1}^{n}\) satisfying i p i =1, the following inequality holds.

$$ D_{\mathrm{vN}} \biggl(\sum_i{p_iA_i}, \sum_i{p_iB_i} \biggr) \leq\sum_i{p_iD_\mathrm{vN}(A_i,B_i)}. $$

This theorem can be derived from the Lindblad’s Theorem (Lindblad 1973). A detailed proof can be seen in Tropp (2012), Bauschke and Borwein (2001).

Therefore, given a convex metric learning algorithm, it can be extended to a jointly convex multi-task metric learning problem using our method. We solve it by the alternating optimization method. At the initial state, B is set to A 0. Then each A t and B are solved alternately with other variables fixed. A global optimal solution is guaranteed due to its convex nature. We elaborate the optimization in the next subsection.

4.4 Optimization

4.4.1 Fix B and optimize on A t ’s

When B is fixed, the optimization problem about A t ’s is decoupled into T individual single-task metric learning subproblems (3) and there is an additional regularization D vN(A t ,B) for each of them. Given that the original metric learning optimization is convex, this subproblem is also convex and they can be solved separately.

If the problem is solved utilizing a gradient descent method or subgradient method, the gradient \(\frac{\partial D_{\mathrm {vN}}}{\partial A_{t}}=\log A_{t}-\log B\) is needed in each step. In this paper, we apply our multi-task framework to the LMNN (Weinberger and Saul 2009) metric learning algorithm which proved effective in many applications. In each gradient descent step of our algorithm, the additional calculation is just the matrix logarithm of A t ’s and B where a spectral decomposition is needed. However, in order to project the obtained solution into the SPSD cone, the LMNN algorithm itself includes the spectral decomposition in each updating step. Thus the calculation of the matrix logarithm can use this result directly. Then the additional calculation is only the logarithm of the eigenvalues and a matrix multiplication.

It should be again carefully treated when A t is low-rank, which means the current solution moves to the boundary the of domain of D vN(A,B). The gradient cannot be calculated directly on these points and we can resort to the subspace spanned by the eigenvectors corresponding to the positive eigenvalues. Please refer to Sect. 4 of Kulis et al. (2009) for more details.

4.4.2 Fix A t and optimize on B

The optimization problem about B with all A t ’s fixed is

$$ \min_B\ {\sum_t\gamma {D_\mathrm{vN}(A_t,B)}+ \gamma_0D_\mathrm{vN}(A_0,B)}\quad \textup{s.t. }B\succeq0. $$

This problem is just a special case of Proposition 1 in Banerjee et al. (2005) where the optimal solution is called Bregman representative, but in the case of matrix variables. Here we generalize this result into the case of symmetric matrices where the optimal solution is also the weighted average of A t ’s as shown in Theorem 2.

Theorem 2

(Bregman matrix representative)

Let \(\{X_{i}\}_{i=1}^{n}\) be a set of symmetric matrices and \(\{p_{i}\} _{i=1}^{n}\) form a probability distribution where i p i =1. Then for any Bregman divergence, the problem

$$ \min_Y{\sum_i{p_iD_\phi(X_i,Y)}} $$

has a unique minimizer given by Y =∑ i p i X i

Proof

The function to be minimized is J ϕ (Y)=∑ i p i D ϕ (X i ,Y). Let \(\bar{X}=\sum_{i}{p_{i}X_{i}}\), then for ∀Y,

Since ϕ is strictly convex, the equality holds only when \(Y^{*}=\bar {X}=\sum_{i}{p_{i}X_{i}}\). □

It is very interesting that this result does not depend on the choice of ϕ. Then, in our problem, the optimal solution of the common metric is

$$ B=\frac{\gamma\sum_t{A_t}+\gamma_0A_0}{\gamma T+\gamma_0}. $$

When von Neumann divergence is used, since ∀A t ⪰0 and γ,γ 0>0, the constraint B⪰0 is automatically satisfied.

4.4.3 Convergence of the alternating optimization

Our alternating optimization approach is indeed a block coordinate descent method (Tseng 1988, 2001; Friedman et al. 2007). In this section, we show that this method will converge to the optimal solution by alternating optimization if von Neumann divergence or squared Frobenius norm is used and the prior is chosen as A 0=I m .

Tseng (2001) did an in-depth research about the block coordinate descent method and presented a condition to guarantee the convergence of this algorithm. The objective function to be optimized in this paper has the following special form:Footnote 4

$$ f(\mathbf{x}_1,\ldots,\mathbf{x}_N)=f_0( \mathbf{x}_1,\ldots,\mathbf {x}_N)+\sum _{k=1}^N{f_k(\mathbf{x}_k)} $$

for some \(f_{0}:\mathbb{R}^{n_{1}+\cdots+n_{N}}\rightarrow\mathbb{R}\cup\{ \infty\}\) and some \(f_{k}:\mathbb{R}^{n_{k}}\rightarrow\mathbb{R}\cup\{ \infty\}\).

The condition to guarantee convergence of the coordinate descent method is proposed in Proposition 5.1 of Tseng (2001) with a series of assumptions:

  1. (B1)

    f 0 is continuous on domf 0.

  2. (B2)

    For each k∈{1,…,N} and (x j ) jk , the function x k f(x 1,…,x N ) is quasiconvex and hemivariate (Tseng 2001).

  3. (B3)

    f 0,f 1,…,f N are lsc (lower semicontinuous).

  1. (C1)

    domf 0 is open and f 0 tends to ∞ at every boundary point of domf 0.

  2. (C2)

    domf 0=Y 1×⋯×Y N , for some \(Y_{k}\subseteq\mathbb{R}^{n_{k}},k=1,\ldots,N\).

There are N=T+1 coordinate blocks in our problem as x i =A i , i=1,…,T and x T+1=B. The function with respect to all the variables is \(f_{0}(\mathbf{x}_{1},\ldots,\mathbf {x}_{T+1})=\gamma\sum_{i=1}^{T}{D_{\phi}(\mathbf{x}_{i},\mathbf{x}_{T+1})}\) and the separable functions are \(f_{i}(\mathbf{x}_{i})=\tilde{L}(\mathbf {x}_{i},\mathcal{X}_{i})+\delta_{X\succeq0}(A_{i})\), i=1,…,T and f T+1(x T+1)=γ 0 D ϕ (A 0,x T+1), where δ X⪰0(A) is the characteristic function (Wikipedia 2012) of the positive semi-definite cone: δ X⪰0(A)=0 if A⪰0 and δ X⪰0(A)=+∞ otherwise.

Then we can check the conditions above:

  1. (B1)

    Both \(\|A-B\|_{\mathrm{F}}^{2}\) and D vN(A,B) are continuous and thus (B1) is satisfied.

  2. (B2)

    \(\tilde{L}(\mathbf{x}_{i},\mathcal{X}_{i})\) is convex because a convex metric learning algorithm is used, and δ X⪰0(A) is also convex due to the convexity of the positive semi-definite cone (Boyd and Vandenberghe 2004). Thus f i (x i ) is convex for i=1,…,T. On the other hand, f 0 is convex due to the strict convexity of \(\|A-B\|_{\mathrm{F}}^{2}\) and D vN(A,B). Then it is straightforward to obtain that f is quasiconvex. It is also not difficult to check that f is hemivariate (Tseng 2001) and thus (B2) is satisfied.

  3. (B3)

    We choose the metric learning algorithm with continuous objective functions \(\tilde{L}\) in this paper and both \(\|A-B\|_{\mathrm{F}}^{2}\) and D vN(A,B) are continuous. Because {X|X⪰0} is a closed set, the indicator function δ X⪰0 is lsc (Wikipedia 2013b). Thus f 0,f 1,…,f N are all lsc and (B3) is satisfied.

  1. (C2)

    If \(\|A-B\|_{\mathrm{F}}^{2}\) is used, the domain of each coordinate block is \(\mathbb{R}^{m\times m}\), and domf 0=Y 1×⋯×Y N where \(Y_{i}=\mathbb{R}^{m\times m}\), ∀i=1,…,T+1. If D vN(A,B) is used, each variable should satisfy A i ⪰0 and C(A i )⊆C(B) where C(X) is the column space of X (Kulis et al. 2009). This seems to make the domains of coordinate blocks dependent with each other. However, since we choose A 0 as the identity matrix in our algorithm, it guarantees B to be a full-rank matrix and thus C(A i )⊆C(B) always holds for any i. Then the dependency of variables is decoupled and domf 0=Y 1×⋯×Y N where \(Y_{i}=\{X\in \mathbb{R}^{m\times m}|X\succeq0\}\), ∀i=1,…,T+1. This proves that (C) is satisfied.

We have shown in the above that f,f 0,f 1,…,f N satisfy Assumptions B1–B3 and f 0 satisfies Assumption C2. In our alternating optimization algorithm, the cyclic rule is used which is a special case of the essentially cyclic rule (Tseng 2001). Moreover, both the loss \(\tilde{L}\) and the Bregman matrix divergence are always non-negative and thus lower bound exists. Then by Proposition 5.1 of Tseng (2001), the algorithm is guaranteed to converge to a minimum point of f.

5 Theoretical analysis of geometry preserving property

In this section, we present a series of theoretical analysis to justify our proposed multi-task metric learning approach has the capability to better preserve data geometry. Before the analysis, we define the concepts of scale vector which characterizes the scale property of a metric, and scale extractor which is an operator transforming a metric to a scale vector. This provides a tool to analyze the relationship between the geometry preserving probability and the Bregman matrix divergence.

In general, the relationship between the geometry preserving probability and Bregman matrix divergence is established in three steps.

  1. 1.

    PG f (d A ,d B ) and \(\mathcal{E}(A,B)\) are linked: a higher geometry preserving probability PG f (d A ,d B ) usually accompanies with a smaller \(\mathcal{E}(A,B)\) which is an integration defined with scale vectors in all directions.

  2. 2.

    D ϕ (A,B) and \(\mathcal{D}_{\varphi}(A,B)\) are linked: the Bregman matrix divergence D ϕ (A,B) provides an upper bound for the corresponding Bregman divergence of scales \(D_{\varphi}(\rho_{W}^{A},\rho _{W}^{B})\). Therefore, minimizing D ϕ (A,B) has the effect to minimize \(\mathcal{D}_{\varphi}(A,B)\) which is an integration of Bregman divergence of scales.

  3. 3.

    PG f (d A ,d B ) and D vN(A,B) are linked by \(\mathcal{E}(A,B)\) and \(\mathcal{D}_{\mathrm{KL}}(A,B)\): when the difference of \(\rho_{W}^{A}\) and \(\rho_{W}^{B}\) is small, which is usually satisfied in multi-task problems, \(\mathcal{E}(A,B)\) and \(\mathcal {D}_{\mathrm{vN}}(A,B)\) are more consistent, which means a smaller (greater) \(\mathcal{E}(A,B)\) usually accompanies with a smaller (greater) \(\mathcal{D}_{\mathrm{KL}}(A,B)\). Therefore, by minimizing D vN(A,B), the \(\mathcal{D}_{\mathrm{KL}}(A,B)\) is minimized, which furthermore leads to a smaller \(\mathcal{E}(A,B)\) implying a higher PG f (d A ,d B ) ultimately.

5.1 Scale vector and scale extractor

The concept of scale is used to capture the scale (amplified or squashed) property or give an approximate representation of a metric. It translates a metric defined on the complicated functional space \(\mathcal{F}_{\mathbb{X}}\) into a simple real vector which contains the most important information of the metric.

Our motivation comes from the following fact. The essential role of a metric is to define the distance for any pair of points in the space. Given any pair of points \(\forall\mathbf{x},\mathbf{y}\in\mathbb{X}\), if two metrics d A and d B are similar, the distances they give, i.e. d A (x,y) and d B (x,y), are expected to be similar. This motivates us to measure the similarity between two metrics d A ,d B in such a way:

  1. 1.

    Select a set of pairs of points \(\{(\mathbf{x}_{i},\mathbf{y}_{i})\} _{i=1}^{n}\) and measure their distances with the two metrics respectively \(\{d_{A}(\mathbf{x}_{i},\mathbf{y}_{i}),d_{B}(\mathbf{x}_{i},\mathbf{y}_{i})\}_{i=1}^{n}\);

  2. 2.

    For each metric d M (M=A or B), use a vector ρ M=[f(d M (x 1,y 1)) … f(d M (x n ,y n ))] as its representation;

  3. 3.

    Since ρ A and ρ B are both vectors, it’s much easier to define the similarity of them, which can then be used to estimate the similarity of d A and d B .

If the metric d is translation invariant, i.e., \(d(\mathbf {x}+\mathbf{w},\mathbf{y}+\mathbf{w})=d(\mathbf{x},\mathbf{y}),\, \forall\mathbf{x},\mathbf{y},\mathbf{w}\in\mathbb{X}\), such as Mahalanobis metric, we can always translate x i to the origin and briefly denote d(x i ,y i )=d(z i ,0)≜d(z i ) where z i =x i y i . Then we can imagine that d defines a ruler in each direction z i and the most important properties are “the scales of these rulers”. Based on this idea, we propose the concept of scale as a representation of a metric.

Definition 2

(Scale)

Given any translation invariant metric \(d:\mathbb{X}\times\mathbb {X}\rightarrow\mathbb{R}_{+}\) and a unit vector \(\mathbf{w}\in\mathbb{X}\) where ∥w∥=1, the squared distance d 2(w,0)≜d 2(w) is defined as the scale of d in direction w.

Definition 3

(Scale extractor & scale vector)

Define the operator \(\rho_{W}:\mathcal{F}_{\mathbb{X}}\rightarrow\mathbb {R}^{n}\) which transforms a metric d to a vector consisting of the scales of d on a group of n vectors W m×n =[w 1 w 2w n ] as scale extractor:

The vector ρ W (d) is called the scale vector of d on W. For Mahalanobis metric d A , it simply equals to \(\rho_{W}(d_{A})= [\mathbf{w}_{1}^{\top}A\mathbf{w}_{1} \ \mathbf{w}_{2}^{\top}A\mathbf{w}_{2} \ \ldots\ \mathbf{w}_{n}^{\top}A\mathbf{w}_{n}] ^{\top}\), and we can denote it as ρ W (A) or \(\rho_{W}^{A}\) for brevity.

Imagine that a set of unit vectors \(\{\mathbf{w}\}_{i=1}^{n}\) are measured by the “rulers” defined by d A , and then all these squared distances compose the scale vector ρ W (d A ). With any fixed W, ρ W (d A ) is determined by the metric d A and reflects how the information in these directions is amplified or squashed.

This attitude is illustrated in Fig. 2 where the scales in two directions w 1=[1 0] and w 2=[0 1] are extracted. We always present the two unit vectors w 1,w 2 (starting from the origin and ending with a pentagram ★) in the original space, and use an ellipse to show the metric.Footnote 5 The ellipse contains all the points with unit distance to the origin measured by d A , i.e., {xd A (x,0)=1}. Two rulers corresponding to w 1 and w 2 are presented to show the scale properties of d A in these two directions. If the distance is amplified in one direction, the scale of the ruler becomes denser, such as w 1 in Fig. 2(b) and w 2 in Fig. 2(c). In contrast, the scale of the ruler in the squashed direction becomes sparser. The distances of w 1 and w 2 can then be read directly on the rulers and they compose the scale vector ρ W (d A ).

Fig. 2
figure 2

Examples showing the scale of three different metrics with the same basis W=[w 1 w 2] where w 1=[1 0] and w 2=[0 1]

In this example, the standard basis of \(\mathbb{X}\) is chosen for W. In Fig. 2(a), the distances are measured by Euclidean metric and thus the points with unit distance to the origin simply compose a circle. The scale of a unit vector in any direction is 1. In Fig. 2(b), the Mahalanobis matrix is diagonal. Therefore, its eigenvectors are just the standard basis w 1,w 2 and the ellipse with unit distance is symmetric with respect to the coordinate axes. Furthermore, w 1,w 2 correspond to the mostly amplified direction and the mostly squashed direction respectively. In Fig. 2(c), there is no special property of the metric and we just show the scales in the two directions.

Since the scale vector characterizes the most important properties of a metric, it can help to make a study on the metric. This idea is straightforward. Supposing that we are going to measure the shape of a rapidly spinning object, it’s neither possible nor necessary to measure directly on its body, but we can take photos of it and measure on the photos instead. In our problem, the metric is like the spinning object which we focus on but is difficult to measure directly. Then the scale extractor plays the role of a camera which takes photos of it. Each scale is one photo characterizing its property from a specific view and the scale vector is the album consisting of all these photos. Furthermore, if we want to compare two spinning objects that are difficult to measure directly, we can resort to their photos instead. Obviously, if the photos of two objects are similar from various views, we can consider them to be similar. Thus, the similarity between d A and d B can be measured by ρ W (d A ) and ρ W (d B ). In next subsections, utilizing the scale extractor, we will show that a higher geometry preserving probability PG f (d A ,d B ) is encouraged by minimizing the von Neumann divergence D vN(A,B).

5.2 Preserving the geometry by minimizing a function of scale vectors

We have shown in Sect. 4.2 that the geometry preserving property is mathematically measured by geometry preserving probability (PG). In this subsection, we show that a higher PG usually accompanies with a smaller \(\mathcal{E}(A,B)\), which is a integration defined with the scale vectors. This result transforms the optimization on the complicatedly defined geometry preserving probability into a tractable optimization on a formula of the scale vectors.

Theorem 3

(Geometry Preserving Theorem)

Suppose that there are two pairs of random points \(\mathbf{x}_{1},\mathbf {y}_{1}\in\mathbb{R}^{m}\) and \(\mathbf{x}_{2},\mathbf{y}_{2}\in\mathbb{R}^{m}\) following some distribution f(x 1,y 1,x 2,y 2), the geometry preserving probability PG f (d A ,d B ) can be formulated as an integration of a function \(R_{\mathbf{q}_{1},\mathbf{q}_{2}}(A,B)\) as

$$ \mathrm{PG}_f(d_A,d_B)= \iint_{\mathbb{S}^{m-1}\times\mathbb {S}^{m-1}}{R_{\mathbf{q}_1,\mathbf{q}_2}(A,B) \mathrm{d}\varOmega(\mathbf{q}_1) \mathrm{d}\varOmega(\mathbf{q}_2)} $$
(6)

where dΩ(q i ) is the solid angle elementFootnote 6 corresponding to the direction of q i which contains all the angular factors, and \(\mathbb{S}^{m-1}= \{\mathbf{x}\in\mathbb{R}^{m}\mid\|\mathbf{x}\| =1 \}\) is the (m−1)-dimensional unit sphere in \(\mathbb{R}^{m}\). The integration is calculated on \(\mathbb{S}^{m-1}\) for both q 1 and q 2.

Particularly, define

$$ \omega_{\mathbf{q}_1,\mathbf{q}_2}(A,B)=\arctan\sqrt{ \frac{\rho _{\mathbf{q}_2}^A}{\rho_{\mathbf{q}_1}^A}} -\arctan\sqrt{\frac{\rho_{\mathbf{q}_2}^B}{\rho_{\mathbf{q}_1}^B}} $$
(7)

and assume d B  (or d A ) is given. Then forq 1,q 2, both the \(R_{\mathbf{q}_{1},\mathbf{q}_{2}}(A,B)\) and \(|\omega_{\mathbf{q}_{1},\mathbf{q}_{2}}(A,B)|\) are functions of A (or B) and \(R_{\mathbf{q}_{1},\mathbf{q}_{2}}(A,B)\) always decreases with \(|\omega_{\mathbf{q}_{1},\mathbf{q}_{2}}(A,B)|\):

The proof of Theorem 3 is presented in Appendix A.1 for clarity.

By Theorem 3, PG f (d A ,d B ) equals to an integration of \(R_{\mathbf{q}_{1},\mathbf{q}_{2}}(A,B)\mathrm{d}\varOmega (\mathbf{q}_{1})\mathrm{d}\varOmega(\mathbf{q}_{2})\), which reflects the geometry preserving property for all pairs (x 1,y 1) and (x 2,y 2) satisfying x 1y 1=r 1 q 1 and x 2y 2=r 2 q 2. In order to obtain a higher PG f (d A ,d B ), we have to solve A and B that maximize the integration of \(R_{\mathbf{q}_{1},\mathbf{q}_{2}}(A,B)\) shown in (6) and satisfy the constraints from side-information. It is difficult to give a precise analysis in the general case because \(R_{\mathbf{q}_{1},\mathbf{q}_{2}}(A,B)\) is influenced by the distribution f, which is indeterminate. However, no matter what the distribution f is, \(R_{\mathbf{q}_{1},\mathbf{q}_{2}}(A,B)\) always monotonically decreases with \(|\omega_{\mathbf{q}_{1},\mathbf{q}_{2}}(A,B)|\). Thus, considering that we just want to maximize (6) rather than to obtain its exact value, in general, we can approximately achieve this goal by replacing \(R_{\mathbf{q}_{1},\mathbf{q}_{2}}(A,B)\) in (6) with \(|\omega_{\mathbf{q}_{1},\mathbf{q}_{2}}(A,B)|\) and then minimizing the integration (8) instead.

$$ \mathcal{E}(A,B)=\iint_{\mathbb{S}^{m-1}\times\mathbb{S}^{m-1}}{\bigl \vert \omega_{\mathbf{q}_1,\mathbf{q}_2}(A,B)\bigr \vert \mathrm{d}\varOmega(\mathbf{q}_1) \mathrm{d}\varOmega(\mathbf{q}_2)}. $$
(8)

Due to the nonnegativity of \(\vert \omega_{\mathbf{q}_{1},\mathbf {q}_{2}}(A,B)\vert \), if \(\mathcal{E}(A,B)=0\), it is obvious that \(\vert \omega_{\mathbf{q}_{1},\mathbf{q}_{2}}(A,B)\vert =0\), ∀q 1,q 2 and PG f (d A ,d B ) reaches the maximum 1. Along with the increase of \(\mathcal{E}(A,B)\), the PG f (d A ,d B ) begins to decrease. Therefore, \(\mathcal{E}(A,B)\) is a generic approximation of the deterioration of PG f (d A ,d B ) without knowledge of f, and minimizing \(\mathcal{E}(A,B)\) has the effect to increase PG f (d A ,d B ).

5.3 Bounding the Bregman divergence of scales by Bregman matrix divergence

In this section, we show how two Mahalanobis metrics d A and d B are enforced to be correlated with each other by minimizing the Bregman matrix divergence D ϕ (A,B). As two functions, the relationship between d A and d B is not explicit from D ϕ (A,B). Since the scale vector reveals the important scale properties of one Mahalanobis metric and is much simpler to deal with, we resort to \(\rho_{W}^{A}\) and \(\rho_{W}^{B}\) to investigate the relationship between d A and d B .

Bregman divergence (Dhillon and Tropp 2008) is a class of widely-used diversity measures for vectors in machine learning, including the squared Euclidean distance, generalized KL-divergence, Itakura-Saito distance, etc. For \(\forall\mathbf{x},\mathbf{y}\in \mathbb{R}^{m}\), it is defined as

$$ D_\varphi(\mathbf{x},\mathbf{y}) =\varphi(\mathbf{x})-\varphi( \mathbf{y})-\nabla\varphi(\mathbf{y})^\top (\mathbf{x}-\mathbf{y}), $$

where φ(⋅) is a convex generating function defined on \(\mathbb{R}^{m}\). If φ(x)=x x, we obtain the squared Euclidean distance D φ (x,y)=∥xy2; if φ(x)=∑ i x i logx i x i , we obtain the KL-divergence D KL(x,y)=∑ i x i (logx i −logy i )−x i +y i .

Compared with the definition of Bregman matrix divergence (2), the Bregman divergence has an identical form except that it takes real vectors as variable instead of Hermitian matrices. By Corollary 2 of Kulis et al. (2009), there is the relationship between them as

$$ D_{\varphi\circ\lambda}(X,Y) =\sum _{i,j}{\bigl(\mathbf{v}_i^\top \mathbf{u}_j\bigr)^2D_\varphi( \lambda_i,\theta_j)}, $$
(9)

where \(X=\sum_{i}{\lambda_{i}\mathbf{v}_{i}\mathbf{v}_{i}^{\top}}\), \(Y=\sum_{i}{\theta_{i}\mathbf{u}_{i}\mathbf{u}_{i}^{\top}}\) are spectral decompositions, and thus \(\{\mathbf{v}_{i}\}_{i=1}^{m}\)\(\{\mathbf{u}_{i}\}_{i=1}^{m}\) are both orthonormal bases. Since \(V^{\top}U= [\mathbf{v}_{i}^{\top}\mathbf {u}_{j} ]_{m\times m}\) is orthogonal, the matrix \([(\mathbf {v}_{i}^{\top}\mathbf{u}_{j})^{2} ]_{m\times m}\) is orthostochastic and thus doubly stochastic, whose row and column sums are 1 (Horn and Johnson 1991). Therefore, the matrix divergence is regarded as the sum of scalar divergences between pairs of eigenvalues, weighted by the squared cosine of the angle between the corresponding eigenvectors (Dhillon and Tropp 2008).

Among the Bregman matrix divergences, the \(\|A-B\|_{\mathrm{F}}^{2}\) and D vN(A,B) are specifically appropriate for the framework (1) due to the following reasons:

  1. 1.

    Both of them are jointly convex with respect to the two arguments, which guarantees a global optimal solution as long as the loss function determined by the metric learning algorithm is convex. The joint convexity of \(\|A-B\|_{\mathrm{F}}^{2}\) is straightforward, while the joint convexity of D vN(A,B) is presented in Theorem 1.

  2. 2.

    Both of them provide a bound for its corresponding Bregman divergence of the scale vectors of the metrics, which links the Bregman matrix divergence of two Mahalanobis matrices with the Bregman divergence of their scales.Footnote 7 Specifically, if φ(x)=∥x2 or φ(x)=∑ i logx i x i , then for any orthonormal bases W=[w 1w m ], we have

    $$ D_\varphi\bigl(\rho_W^A, \rho_W^B\bigr)\leq D_\phi(A,B), $$
    (10)

    where ϕ=φλ and λ(A) is the function that lists the eigenvalues of A. This result is formally presented in Theorem 4.

Theorem 4

Suppose \(d_{A},d_{B}\in\mathcal{F}_{\mathbb{R}^{m}}\) are two Mahalanobis metrics defined on \(\mathbb{R}^{m}\). Then for any orthonormal basis W=[w 1w m ] in \(\mathbb{R}^{m}\), the squared Frobenius norm of the difference and the von Neumann divergence of their Mahalanobis matrices A and B provides an upper bound for the squared Euclidean distance and the KL-divergence of their scale vectors \(\rho_{W}^{A}\) and \(\rho_{W}^{B}\) respectively:

(11)
(12)

In spite of their uniform formulation (10), the two cases resort to very different proofs, and we leave them in Appendix A.2.

Recall the example we presented at the end of Sect. 5.1 where the discrepancy between the shapes of two spinning objects (metrics) are measured by comparing their photos (scale vectors), then ρ W with an orthogonal W determines a minimal set of cameras that can cover the complete views. Each camera \(\rho_{\mathbf{w}_{i}}\) takes a photo for d A ,d B respectively and their discrepancy from this view is measured by \(D_{\varphi}(\rho_{\mathbf {w}_{i}}^{A},\rho_{\mathbf{w}_{i}}^{B})\). Then \(D_{\varphi}(\rho_{W}^{A},\rho_{W}^{B})\) gives the discrepancies of d A and d B by summing up the results from all cameras. Theorem 4 provides an upper bound for such a total measure on various views, which captures the overall discrepancy of the two objects (metrics).

Furthermore, define the continuous version of \(D_{\varphi}(\rho_{W}^{A},\rho _{W}^{B})=\sum_{i}{D_{\varphi}(\rho_{\mathbf{w}_{i}}^{A},\rho_{\mathbf{w}_{i}}^{B})}\) as

$$ \mathcal{D}_\varphi(A,B)=\int_{\mathbb{S}^{m-1}} {D_\varphi\bigl(\rho_\mathbf{q}^A, \rho_\mathbf{q}^B\bigr)\mathrm{d}\varOmega(\mathbf{q})}, $$
(13)

which summarizes the discrepancy of two metrics measured in all directions. Denote \(\mathcal{D}_{\varphi}\) corresponding to \(\|\rho _{\mathbf{q}}^{A}-\rho_{\mathbf{q}}^{B}\|^{2}\) and \(D_{\mathrm{KL}}(\rho_{\mathbf{q}}^{A},\rho_{\mathbf{q}}^{B})\) as \(\mathcal{D}_{\mathrm{Eu}}\) and \(\mathcal {D}_{\mathrm{KL}}\) respectively. Since \(D_{\varphi}(\rho_{W}^{A},\rho_{W}^{B})\leq D_{\phi}(A,B)\) holds for any orthogonal W, minimizing D ϕ (A,B) has an effect to minimize \(\mathcal{D}_{\varphi}(A,B)\). This subsection relates D ϕ (A,B) with a discrepancy measure defined by scale vectors, which enables us to further establish the relationship between PG f (d A ,d B ) and D ϕ (A,B) as shown in next subsection.

5.4 Preserving geometry property by minimizing von Neumann divergence

In this subsection, based on the results shown in the above subsections, we present our conclusion that a high geometry preserving probability PG f (d A ,d B ) is usually better encouraged by minimizing D vN(A,B) than by minimizing \(\|A-B\|_{\mathrm{F}}^{2}\) for multi-task problems.

For convenience of explanation, we first informally define the concept of consistent. Assuming that there are two functions f(x),g(x), if for randomly selected x 1,x 2∈domf∩domg, the assertion f(x 1)>f(x 2)⇔g(x 1)>g(x 2) holds with a high probability, we call that the two functions are consistent. It is obvious that if two functions are consistent, each of them is likely to decrease or increase with the other one and thus minimizing either is to minimize the other. Then we establish the relationship between D ϕ (A,B) and PG f (d A ,d B ) in the following steps:

  1. 1.

    Section 5.2 explains that a higher PG f (d A ,d B ) usually accompanies with a smaller \(\mathcal {E}(A,B)\), and thus a better geometry preserving property can be encouraged by minimizing \(\mathcal{E}(A,B)\). We denote this fact as

    $$ \mathrm{PG}_f(d_A,d_B){\uparrow}{} \quad {\Leftrightarrow}\quad \mathcal{E}(A,B){\downarrow}. $$
    (14)
  2. 2.

    Section 5.3 proves that the regularization to be minimized D ϕ (A,B) provides an upper bound for \(D_{\varphi}(\rho_{W}^{A},\rho_{W}^{B})\), which furthermore implies that \(\mathcal{D}_{\varphi}(A,B)\) is minimized in our framework. We denote this fact as

    $$ D_\phi(A,B){\downarrow}\ {\Rightarrow}\ \mathcal{D}_\varphi(A,B){\downarrow}. $$
    (15)
  3. 3.

    Our object is to bridge the left side of (14) and (15), while the right side of them, i.e. \(\mathcal {E}(A,B)\) and \(\mathcal{D}_{\varphi}(A,B)\), are both integrations of scale vectors. Thus it provides a way to bridge D ϕ (A,B) and PG f (d A ,d B ). Specifically, if there is a type of \(\mathcal {D}_{\varphi}(A,B)\) consistent with \(\mathcal{E}(A,B)\), i.e., \(\mathcal {D}_{\varphi}(A,B){\downarrow}{}\Leftrightarrow\mathcal{E}(A,B){\downarrow}\), we can obtain that

    $$ D_\phi(A,B){\downarrow}{}\Rightarrow\mathcal{D}_\varphi(A,B) {\downarrow}{} \quad {\Leftrightarrow}\quad \mathcal{E}(A,B){\downarrow}{}\quad {\Leftrightarrow}\quad \mathrm {PG}_f(d_A,d_B){\uparrow}, $$

    which means a good geometry preserving property can be obtained by minimizing the corresponding Bregman matrix divergence.

    In this subsection, we will show that \(\mathcal{D}_{\mathrm{KL}}(A,B)\) is more consistent with \(\mathcal{E}(A,B)\) compared with \(\mathcal{D}_{\mathrm{Eu}}(A,B)\), which proves D vN(A,B) a better candidate of the regularization to preserve geometry between metrics.

In multi-task problems, the difference between any two metrics is usually relatively small, and we investigate the consistency of \(\mathcal{E}(A,B)\) and \(\mathcal{D}_{\varphi}(A,B)\) based on this assumption. When d A =d B , all scales are equal as \(\rho_{\mathbf{q}}^{A}\equiv\rho_{\mathbf{q}}^{B},\forall\mathbf{q}\) and thus \(\vert \omega _{\mathbf{q}_{1},\mathbf{q}_{2}}(A,B)\vert \equiv0 \Rightarrow\mathcal {E}(A,B)=\mathcal{D}_{\varphi}(A,B)=0\). As d A becomes different from d B so that there is a difference of scales in some direction \(\rho _{\mathbf{q}}^{A}-\rho_{\mathbf{q}}^{B}\), both \(\mathcal{E}(A,B)\) and \(\mathcal {D}_{\varphi}(A,B)\) will thus increase.

If \(\mathcal{E}(A,B)\) and \(\mathcal{D}_{\varphi}(A,B)\) are consistent, a scale difference that brings about a greater increment of \(\mathcal {E}(A,B)\) is also expected to produce a greater increment of \(\mathcal {D}_{\varphi}(A,B)\), and vice versa. The increments are determined by both the value of scale difference \(\rho_{\mathbf{q}}^{A}-\rho_{\mathbf{q}}^{B}\) and the direction q. In the same direction q, it’s easy to see that both the two functions increase with the absolute difference of scales \(|\rho_{\mathbf{q}}^{A}-\rho_{\mathbf{q}}^{B}|\) and keep consistent. In the following, we investigate the increments of the two functions for scale differences in different directions.

First we study how \(\mathcal{E}(A,B)\) is influenced by the difference in each direction and the result is presented in Proposition 1.

Proposition 1

Assume that there are three Mahalanobis metrics \(d_{A_{1}},d_{A_{2}},d_{B}\in \mathcal{F}_{\mathbb{R}^{m}}\), and the unit vectors w 1,w 2 define two directions. Extracting the scales of the metrics by ρ W , we obtain that \(d_{A_{1}}\) and \(d_{A_{2}}\) differ from d B on w 1 and w 2 respectively as \(\rho_{\mathbf {w}_{1}}^{A_{1}}-\rho_{\mathbf{w}_{1}}^{B}=\rho_{\mathbf{w}_{2}}^{A_{2}}- \rho _{\mathbf{w}_{2}}^{B}=\Delta\rho\). If the difference Δρ is relatively small compared to \(\rho_{\mathbf{w}_{i}}\), we have

$$ \frac{\mathcal{E}(A_1,B)}{\mathcal{E}(A_2,B)} \approx \biggl(\frac{\rho_{\mathbf{w}_2}^B}{\rho_{\mathbf{w}_1}^B} \biggr)^\alpha, $$
(16)

where 0.5<α<1.5.

The proof of Proposition 1 is presented in Appendix A.3 for clarity.

From Proposition 1, the increase of \(\mathcal{E}(A,B)\) brought about by \(\rho_{\mathbf{w}_{i}}^{A}-\rho_{\mathbf{w}_{i}}^{B}\) is approximately inversely proportional to \(\rho_{\mathbf{w}_{i}}^{B}\), and thus the deterioration of the geometry preserving property is determined by the relative variation of the scale. This coincides with our intuition in that the geometrical property is more sensitive to the variation of the smaller scale, which is also a result of the fact that the scale essentially defines a squared magnification factor for the distance. Assuming that \(\rho_{\mathbf {w}_{1}}^{B}=10,\rho_{\mathbf{w}_{2}}^{B}=0.1\), then if \(\rho_{\mathbf{w}_{1}}^{B}\) is increased by 1, the distance of any pair of points in direction w 1 is magnified to \(\sqrt{11}/\sqrt{10}=1.05\) times, while if \(\rho_{\mathbf{w}_{2}}^{B}\) is increased by 1, the distance of any pair of points in direction w 2 is magnified to \(\sqrt {1.1}/\sqrt{0.1}=3.32\) times.

To be a discrepancy measure consistent with \(\mathcal{E}(A,B)\), a proper \(\mathcal{D}_{\varphi}(A,B)\) should also be more sensitive to the difference of the smaller scale. Then we analyze how the two types of \(\mathcal{D}_{\varphi}(A,B)\) increase with the difference of scales and present the results in Proposition 2.

Proposition 2

Assume that all the variables are identically defined as in Proposition 1, we have

$$ \frac{\mathcal{D}_\mathrm{Eu}(A_1,B)}{\mathcal{D}_\mathrm {Eu}(A_2,B)}=1,\qquad \frac{\mathcal{D}_\mathrm{KL}(A_1,B)}{\mathcal{D}_\mathrm{KL}(A_2,B)} \approx \frac{\rho_{\mathbf{w}_2}^B}{\rho_{\mathbf{w}_1}^B}. $$
(17)

We also present the proof of Proposition 2 in Appendix A.3.

If we compare (17) with (16), both \(\mathcal{D}_{\mathrm{KL}}(A,B)\) and \(\mathcal{E}(A,B)\) are more sensitive to the difference of the smaller scale, where the increments of them brought about by \(\rho_{\mathbf{w}_{i}}^{A}-\rho_{\mathbf{w}_{i}}^{B}\) are both approximately inversely proportional to \(\rho_{\mathbf {w}_{i}}^{B}\). In contrast, the increment of \(\mathcal{D}_{\mathrm{Eu}}(A,B)\) is independent with the direction or scale but determined by only the value of scale difference. Thus the increment of \(\mathcal{D}_{\mathrm{KL}}(A,B)\) is more consistent with \(\mathcal{E}(A,B)\) compared with \(\mathcal{D}_{\mathrm{Eu}}(A,B)\), which implies that if \(\mathcal {D}_{\mathrm{KL}}(A_{1},B)>\mathcal{D}_{\mathrm{KL}}(A_{2},B)\), it is more likely to obtain that \(\mathcal{E}(A_{1},B)>\mathcal{E}(A_{2},B)\) while \(\mathcal{D}_{\mathrm{Eu}}\) does not have such a property.

It’s notable that the result of Proposition 1 is obtained by considering only the variation of one scale and does not precisely hold when several scales change simultaneously because the partial derivative in (16) becomes more complex. However, in most cases, the conclusion that \(\mathcal{E}(A,B)\) is more sensitive to the variation of the smaller scale still holds and thus \(\mathcal {D}_{\mathrm{KL}}(A,B)\) reflects the deterioration of PG more accurately. Then from the discussion of the beginning of this subsection, D vN(A,B) is a better choice of the regularization to preserve the geometry.

This phenomenon is illustrated by the example in Fig. 3. The scale vectors of the three metrics measured in the directions of standard basis are \(\rho_{W}^{B}=[3.00\ 0.50]^{\top}\), \(\rho_{W}^{A_{1}}=[2.65\ 0.50]^{\top}\), and \(\rho_{W}^{A_{2}}=[3.00\ 0.20]^{\top}\), where \(d_{A_{1}}\) and \(d_{A_{2}}\) differ from d B on w 1 and w 2 respectively. Since \(\rho_{\mathbf{w}_{1}}^{B}>\rho_{\mathbf {w}_{2}}^{B}\), the metric is more sensitive to the difference in direction w 2 and PG decreases more rapidly in this direction. Therefore, even though the scale difference between \(d_{A_{1}}\) and d B is greater as \(|\rho_{\mathbf{w}_{1}}^{A_{1}}-\rho_{\mathbf {w}_{1}}^{B}|=0.35>|\rho_{\mathbf{w}_{2}}^{A_{2}}-\rho_{\mathbf{w}_{2}}^{B}|=0.30\), the geometry of samples measured by \(d_{A_{1}}\) look more similar to d B compared with \(d_{A_{2}}\), This phenomenon can be explained as that \(d_{A_{1}}\) preserves the geometry (relative distances) from d B better than \(d_{A_{2}}\) does, which can be verified by comparing the geometry preserving probabilities. Estimating the PG using randomly generated samples,Footnote 8 we obtain PG f (A 1,B)≈0.990>PG f (A 2,B)≈0.936, which confirms our conjecture above.

Fig. 3
figure 3

The geometry property of d A and d B is better preserved by minimizing D vN(A,B) rather than \(\|A-B\|_{\mathrm{F}}^{2}\). On one hand, with the randomly generated samples, the geometry probabilities are estimated to be PG f (A 1,B)≈0.990>PG f (A 2,B)≈0.936 and thus A 1 preserved the geometry property from B better than A 2 does. On the other hand, it is straightforward to calculate that \(\|A_{1}-B\|_{\mathrm{F}}^{2}=0.1225>\|A_{2}-B\|_{\mathrm{F}}^{2}=0.0900\), and D vN(A 1,B)=0.0213<D vN(A 2,B)=0.1167. Therefore, von Neumann divergence correctly assigns a lower discrepancy measure to A 1 which preserves the geometry properties from B better, and thus minimizing D vN(A,B) prefers A 1 to A 2 as the metric similar to B. Furthermore, from the contours of \(D_{\varphi}(\rho_{W}^{A},\rho_{W}^{B})\) with respect to \(\rho_{W}^{A}\), we see that the \(D_{\mathrm{KL}}(\rho_{W}^{A},\rho _{W}^{B})\) corresponding to D vN(A,B) increases more rapidly in the direction with respect to the smaller scale

On the other hand, calculating the Bregman matrix divergences, we obtain that \(\|A_{1}-B\|_{\mathrm{F}}^{2}=0.1225>\|A_{2}-B\|_{\mathrm{F}}=0.0900\) and D vN(A 1,B)=0.0213<D vN(A 2,B)=0.1167. Obviously, von Neumann divergence provides a discrepancy measure that is more consistent with the deterioration of geometry preserving property. Suppose that we want to encourage d A to be similar to d B by minimizing D ϕ (A,B), then \(d_{A_{1}}\) is preferred to \(d_{A_{2}}\) if D ϕ (A,B)=D vN(A,B), while \(d_{A_{2}}\) is preferred to \(d_{A_{1}}\) if \(D_{\phi}(A,B)=\|A-B\|_{\mathrm{F}}^{2}\). Therefore, von Neumann divergence can select the correct one with the better geometry preserving property.

Besides, when the squared Frobenius norm is used, the obtained solution may have negative eigenvalues and we have to project the solution to the positive semi-definite cone. Instead, von Neumann divergence can automatically keep the solution to be positive semi-definite.

In summary, minimizing the von Neumann divergence usually encourages a higher geometry preserving probability in multi-task problems, and thus it is a more appropriate regularization to couple different metrics. Using D vN(A,B) as the regularization in (1), it is expected to obtain a better geometry preserving property.

6 Experiments

6.1 A toy example

In this section, we use a toy example in Fig. 4 to show the advantage of the von Neumann divergence in preserving the geometry relationship between samples. There are two related classification tasks, each of which has 3 classes and the samples are shown in Figs. 4(a) and 4(e) respectively. The color of each point indicates its label and the shape represents its role in metric learning which we will explain later. A point with a black border represents a training sample while the one without a border represents a test sample. Unfortunately, in training set, there is only one green point for task-1 and one red point for task-2, which cannot represent the distribution of the corresponding class accurately. Observing that the samples of two tasks have very similar distributions, we are motivated to utilize the information from the training samples of the other task to improve the performance for both of the tasks.

Fig. 4
figure 4

An illustration of multi-task metric learning. (a), (e) The original data of task 1/2. (b), (f) The data of task 1/2 after single task metric learning. (c), (g) The data of task 1/2 after joint metric learning using von Neumann divergence as regularization. (d), (h) The data of task 1/2 after joint metric learning using squared Frobenius norm of difference as regularization. Joint learning of multiple tasks (given by our proposed geometry preserving framework) can lead to ideal metrics for both task-1 in (c) & task-2 in (g) (Color figure online)

Here we use the idea of LMNN to learn a better metric. Focusing on the yellow point in the center of the figure, LMNN aims to learn a metric so that the nearest neighbor of this point belongs to the same class. To obtain such a metric, the nearest yellow point is chosen as the target point (represented with ▵) and a circle through this point is drawn. Then any point belonging to a different class is expected to be further than any target with a large margin and thus stand outside the dashed perimeter. Any point with a different label lying inside the dashed perimeter is called imposter (represented with ■) and the objective of LMNN is to pull the target closer while pushing all imposters outside the perimeter. This encourages the similar samples to be closer to each other. In Fig. 4, we show the learned metric by drawing an ellipse formed by the locus of points equidistant from the central point (the same way as Fig. 2). Then the distance from any point on the dashed ellipse to the central point equals to the distance from the target to the center, plus a margin. Thus any red or green point lying inside the perimeter is an imposter and should be pushed out.

Figure 4(b) shows the learned metric of task-1, where all the red imposters (both training and test points) are pushed outside while Fig. 4(f) shows task-2 with all green imposters outside. Unfortunately, for task-1, since the green points in training set are too few to represent the distribution, the learned metric cannot accurately separate the green class from the yellow one and some test samples invade the perimeter. The same situation also happens for the red class in task-2.

Since the distribution of two tasks are very similar, we hope to let task-1 borrow information about the green class from task-2, and task-2 borrow information about the red class from task-1. Denote the Mahalanobis matrices with respect to Figs. 4(f) and 4(b) as A 1 and A 2. We have \(A_{1}\in\mathcal {C}_{1}\) and \(A_{2}\in\mathcal{C}_{2}\) since they satisfy the constraints from side-information of task-1 and task-2 respectively. Recall that we propagate information from A to B by minimizing D(A,B) and solve a better metric for task-1 by

$$ \min_{A\in\mathcal{C}_1}\ D(A,A_2). $$
(18)

Since \(A_{2}\in\mathcal{C}_{2}\), it is equivalent to solving a metric which satisfies all constraints from \(\mathcal{C}_{1}\) and as many constraints from \(\mathcal{C}_{2}\) as possible. In this example, it should push the red imposter in Fig. 4(f) out of the perimeter while trying the best to keep the green points outside. The problem is defined in the same way for task-2.

The solutions of (18) using D(A,B)=D vN(A,B) and \(D(A,B)=\|A-B\|_{\mathrm{F}}^{2}\) are shown in Figs. 4(c) and 4(d) respectively. From the figures, we see that if von Neumann divergence is used, constraints of both tasks are satisfied by the learned metric, i.e., \(A\in\mathcal {C}_{1}\cap\mathcal{C}_{2}\). There is no imposter in either training set or test set, and both the red and green classes are separated well. This shows that the geometry property of samples is preserved from task-2 to task-1 and the side-information of task-2 is well propagated to task-1. In contrast, if the Frobenius norm of difference is used, to push the red imposter outside the perimeter, some green points invade into this perimeter again, which produces more imposters. This is because the geometry property of A 2 that discriminates the green class from the yellow one is not preserved. For the case to improve task-2 with task-1, the results are shown in Figs. 4(g) and 4(h) and von Neumann divergence also performs better than Frobenius norm.

6.2 Experiments on real data

To validate our proposed approach, we apply our multi-task framework to the famous LMNN (Weinberger and Saul 2009) metric learning method and conduct experiments on several real data sets. We have introduced its basic idea in Sect. 6.1 and the loss function is simply the sum of all squared distances between samples and their target neighbors, i.e., \(\sum_{i,j\leadsto i}{d_{A_{t}}^{2}(\mathbf {x}_{ti},\mathbf{x}_{tj})}\), where ji means x tj is a target neighbor of x ti . The constraints require all imposters stand further than the target neighbors with a margin, i.e., \(d_{A_{t}}^{2}(\mathbf{x}_{ti},\mathbf {x}_{tl})-d_{A_{t}}^{2}(\mathbf{x}_{ti},\mathbf{x}_{tj})\geq1\) for ∀ji and y tl y ti where y ti is the label of the i-th sample of task-t. Since there may be no metric satisfying all constraints, a relaxed version of the constraints are used by introducing slack variables. The obtained loss function for task-t is then

$$ (1-\mu)\sum_{i,j\leadsto i}{d_{A_t}^2( \mathbf{x}_{ti},\mathbf{x}_{tj})} +\mu\sum _{i,j\leadsto i}\sum_l{(1-y_{til}) \bigl[1+d_{A_t}^2(\mathbf{x}_{ti}, \mathbf{x}_{tj})-d_{A_t}^2(\mathbf {x}_{ti},\mathbf{x}_{tl}) \bigr]_+}, $$

where y til =1 if and only if y ti =y tl , and y til =0 otherwise, [z]+=max(z,0). We use the fast solver (Weinberger and Saul 2008) to solve the LMNN and our code is based on the original LMNN code.Footnote 9

Every data set contains several related classification tasks, each of which is to predict the labels of the test samples using their features. For each task, a Mahalanobis metric is learned from the training samples, and then the label of each test sample is predicted by the nearest neighbor classifier, where the distance is calculated using the learned metric.

The multi-task learning setting is categorized into the label compatible and label incompatible scenarios, according to their label sets (Parameswaran and Weinberger 2010). For label incompatible scenario where the label sets of these tasks are different, we compared our method (MT von Neumann) with Euclidean, Single Task, mtLMNN, and MT Frobenius. The training samples of each task are used as the prototypes of the nearest neighbor classifier. For label compatible scenario where all tasks share the same label set, we also combined the samples of all tasks and learn a unique metric (Unique Task). Besides, for Unique Task, mtLMNN, MT Frobenius, and MT von Neumann, we also implement a “pooling” version of testing (with a suffix “-pool” after the name) on each of them, where the training samples of all tasks are used as the prototypes. The details of all the compared methods are shown below with a summary in Table 1.

  1. 1.

    Euclidean—The nearest neighbor of each test sample is searched in the training samples of this task where the distance is determined by the Euclidean metric directly.

  2. 2.

    Single Task—For each task, a metric is learned individually for each task. Then the classifier finds the nearest neighbor in the training samples set of this task using the learned metric.

  3. 3.

    Unique Task—The training samples of all tasks are mixed into one sample set and a unique metric is learned from it. Then the nearest neighbor is found in the training samples of this task using the learned metric.

  4. 4.

    Unique Task-pool—The same metric as Unique Task is used while the nearest neighbor is searched in the training samples of all the tasks using the learned metric.

  5. 5.

    mtLMNN—The method proposed by Parameswaran and Weinberger (2010) which has been introduced in Sect. 3.1. It is the same as MT Frobenius approach with an additional constraintFootnote 10 A t B⪰0. The nearest neighbor is searched in the training sample of this task using the learned metric.

  6. 6.

    mtLMNN-pool—The same metric as mtLMNN is used while the nearest neighbor is searched in the training samples of all the tasks using the learned metric.

  7. 7.

    MT Frobenius—The framework (1) with \(D(A,B)=\|A-B\|_{\mathrm{F}}^{2}\). As we indicated in Sect. 4.1, the constraint A t B in mtLMNN is too strong. By replacing it with A t ⪰0, the relation of A t and B is more flexible and expected to perform better. The nearest neighbor is searched in the training samples of this task using the learned metric.

  8. 8.

    MT Frobenius-pool—The same metric as MT Frobenius is used while the nearest neighbor is searched in the training samples of all the tasks using the learned metric.

  9. 9.

    MT von Neumann—Our proposed geometry preserving multi-task metric learning. It is the framework (1) with D(A,B)=D vN(A,B). The nearest neighbor is searched in the training samples of this task using the learned metric.

  10. 10.

    MT von Neumann-pool—The same metric as MT von Neumann is used while the nearest neighbor is searched in the training samples of all the tasks using the learned metric.

Table 1 Summary of our compared methods

In the model of LMNN, there are two hyper-parameters: (1) a coefficient to balance the loss function and the soft constraints; (2) the number of targets. To determine them, we perform the 5-folder cross-validation on the single-task LMNN using the training samples, and the optimal parameters are selected according to the average error of all tasks, which are then used for all methods. We do not adjust the hyper-parameters in the model of LMNN for each method but use the same values selected for the single-task approach. There are also two hyper-parameters γ and γ 0 in each model of mtLMNN, MT Frobenius and MT von Neumann. We will show how to select them for each dataset in the following.

6.2.1 Multi-speaker vowel classification

The vowel classification data set consists of 11 vowels uttered by 15 speakers of British English, each vowel is said six times. The speakers are divided into two subgroups according to their gender since men pronounce in a different style from women, and each subgroup is regarded as a task. Then we can utilize the multi-task learning to learn a metric for each task.

Considering that multi-task learning aims to deal with the situation where training samples are insufficient, we randomly select only a portion of samples from the vowels of speakers 1–8 and use them to learn a metric, which is tested on the vowels of speakers 9–15. Since the two tasks share the common label set, all the 10 methods are evaluated on this data set. The optimal hyper-parameters for each method are selected by a 5-fold cross-validation using the training samples. Considering that the training data are randomly selected, each experiment is repeated 10 times and the average error rates of the two tasks are reported in Fig. 5. The horizontal axis shows the ratio of training samples that are randomly selected to train the metric, and the vertical axis indicates the average test error rate of all tasks.

Fig. 5
figure 5

Test results on multi-speaker vowel classification

From the results, we have the following observations:

  1. 1.

    When only 10 % training samples are used, single task learning is incapable of learning a reliable metric and tends to be over-fitting. Its performance is even worse than the Euclidean distance. When more than 20 % training samples are used, its performance is better than Euclidean.

  2. 2.

    Multi-task methods improves the performance especially when the training samples are insufficient. In these experiments, all the multi-task methods demonstrate lower error rates on the test samples than single-task learning.

  3. 3.

    The pooling version of testing performs better when the samples are very few. For example, MT von Neumann-pool has lower error rate than MT von Neumann when only 10 % samples are used. The more training samples, the worse its performance becomes compared to the no-pooling version. This phenomenon shows that men and women pronounce in an essentially different way and thus mixing them usually deteriorates the classification accuracy.

  4. 4.

    The MT Frobenius approach usually performs better than mtLMNN, which is probably due to the too strict constraint A t B as we have explained.

  5. 5.

    The MT von Neumann approach performs the best (including the pooling version) due to its capability to propagate the information among tasks properly.

6.2.2 Handwritten letter classification

Handwritten Letter Classification data setFootnote 11 was collected by Rob Kassel at MIT Spoken Language System Group. It consists of 8 binary handwritten letter classification problems where the corresponding letters for each task are: c/e, g/y, m/n, a/g, i/j, a/o, f/t, h/n. The features are the bitmap of the image of written letters and each classification problem is regarded as one task.

The binary labels in different tasks represent different letters and thus this is a label-incompatible problem. Since there is no split training set and test set, we randomly select a proportion of samples to train a metric and use the remaining for test. Because such a split is different for each evaluation, we firstly repeat the experiment 10 times and select the optimal hyper-parameters for each method. Then the hyper-parameters are fixed and the evaluation is repeat 10 times again. The results on the newly split samples are reported in Fig. 6.

Fig. 6
figure 6

Test results on handwritten letter classification

On this data set, our algorithm performs the best only except when 3 % training samples are used. Even in this case, its accurate is very close to the best. We also observe that the single task method produces high error rate when rather few training samples are used. However, in this dataset, the results on the mtLMNN and the MT Frobenius are very similar. It is possible that the constraint A t B is satisfied for this data and thus mtLMNN is more suitable than MT Frobenius. However, both of them perform worse than our method.

6.2.3 USPS digit classification

USPS digit data setFootnote 12 consists of 7,291 samples of digits 0∼9, each of which is a 16×16 grayscale image. Following Zhang and Yeung (2010b), we construct 5 binary classification problems to separate the digits 0/1, 2/3, 4/5, 6/7, 8/9 respectively. Then we learn a metric for each of them jointly. Since each task is to separate a different pair of digits, it is a label-incompatible problem. The experiment setting is as same as Handwritten letter classification in Sect. 6.2.2 and the results are shown in Fig. 7.

Fig. 7
figure 7

Test results on USPS digit classification

For the USPS data set, single-task learning gives very bad performance. It is even worse than the Euclidean metric. This may be due to the over-fitting and thus multi-task learning is needed. The mtLMNN also gives high error rate on this data set, which is sometimes even worse than single-task learning. Since the MT Frobenius exploits the same regularization as mtLMNN but gives a much higher accuracy, it could be caused by the reason that the constraint A t B is not satisfied in this data. At last, our method also leads to the best results on all the tests, which again proves its advantage.

6.2.4 Insurance Company Benchmark data

The Insurance Company Benchmark (COIL 2000) data setFootnote 13 used in the CoIL 2000 Challenge contains information on customers of an insurance company. The data were collected to predict what kind of people would be interested in buying a caravan insurance policy and consists of 86 variables, including product usage data and socio-demographic data. This dataset consists of 5,822 training samples and 4,000 test samples.

Since each variable is discrete and can be regarded as a label to predict, we consider the problem to predict some of the variables using others as features (Parameswaran and Weinberger 2010). To be specific, we select out a set of related variables from the 86 variables as the targets to predict, and use the other variables as features to predict the selected targets. Prediction of each selected variable is regarded as one task and they constitute a multi-task learning problem. Apparently, it is a label-incompatible problem due to the different label sets. This data set has a specified training and test set. We randomly select a certain portion (10 %, 20 %, 30 %, 40 %, 50 %) of samples from the training set to learn a metric and predict the labels of test samples. We construct 5 multi-task learning problems by selecting 5 different target sets of related variables, which can be seen in Table 2. Each experiment is repeated for 10 times and the average error rates are shown in Figs. 812.

Fig. 8
figure 8

Results using variable 14–15 as targets

Fig. 9
figure 9

Results using variable 16–18 as targets

Fig. 10
figure 10

Results using variable 32–34 as targets

Fig. 11
figure 11

Results using variable 35–36 as targets

Fig. 12
figure 12

Results using variable 73–74 as targets

Table 2 Description of 5 sets of selected targets on CoIL data set

In these experiment, we can observe that for target sets 14–15, 16–18, 32–24, and 35–36, the test accuracies of all the methods increase with the training samples used. This shows the efficiency of utilizing more training samples. When only 10 % training samples are used, single-task learning does not give an ideal result due to the lack of information. However, the multi-task metric learning methods usually decrease the error rates a bit because it utilizes the information from other tasks. In most of these experiments, our method gives a better result than others, but the improvement is very limited. The reason may be that the correlation among these targets are weak.

For the target set 73–74, the results are very different from the former 4 target sets. In this case, all metric learning methods give significantly better performances than Euclidean distance and this proves the advantage of metric learning. However, the multi-task learning methods do not improve result of single-task learning. We try to explain this phenomenon as three possible reasons. (1) Noting that the error rates are almost the same with different number of training samples, information from more training samples cannot improve the performance. Therefore, multi-task learning cannot benefit from propagating more information from other tasks and the accuracy does not increase using multi-task methods. (2) These methods cannot propagate the information among tasks properly. (3) The selected targets are not correlated with others at all. However, we see that even the problem is not appropriate for the multi-task learning, our method doesn’t deteriorate the performance of the single-task method.

6.2.5 Isolet spoken alphabet recognition

In the Isolet data set,Footnote 14 150 speakers spoke the name of each letter of alphabet twice. The task is to classify the letters to be uttered. Since the speakers are grouped into 5 groups, the problem is naturally suitable for multi-task learning where each group is treated as a task. We directly use the data from the website,Footnote 15 which has been preprocessed with PCA and split into the training set, validation set, and test set randomly (Parameswaran and Weinberger 2010). To determine the hyper-parameters, we select a specific proportion of training samples to learn a metric with various combinations of hyper-parameters and then test on the validation set. Such an evaluation is repeated 5 times and the hyper-parameters producing the lowest average error rate are chosen as the optimal hyper-parameters. Then the metrics are learned using different proportions of training samples and used for predicting the labels of the test samples. Each experiment is repeated 10 times and the average error rates are shown in Fig. 13.

Fig. 13
figure 13

Test results on Isolet spoken alphabet recognition

For this data set, we observe that the unique-task usually generates better performance than single-task, which shows the tasks may be very similar to each other. However, multi-task learning can furthermore improve their accuracies. Moreover, MT Frobenius performs better than mtLMNN and MT von Neumann performs even better than MT Frobenius. Our methods (MT von Neumann and MT von Neumann-pool) again demonstrate the best results in most cases. Finally, we found that the pooling version of methods lead to better results when training samples are fewer, which also indicates these tasks are very similar. This implies that, when we have more training samples, it is better to utilize only the samples in this task as prototypes.

7 Conclusion

In this paper, we propose a novel multi-task metric learning framework using Bregman matrix divergence. On one hand, the novel regularized approach extends previous methods from the vector regularization to a general matrix regularization framework; on the other hand and more importantly, by exploiting von Neumann divergence as the regularization, the new multi-task metric learning has the capability to well preserve the data geometry. This leads to more appropriate propagation of side-information among tasks and proves very important for further improving the performance. We propose the concept of geometry preserving probability (PG) and justify our framework with a series of theoretical analysis. Furthermore, our formulation is jointly convex and the global optimal solution can be guaranteed. A series of experiments verify that our proposed approach can significantly outperform the current methods.