1 Introduction

It is critical to evaluate the distances between samples in pattern analysis and machine learning applications. If an appropriate distance metric can be obtained, even the simple k-nearest neighbor (k-NN) classifier, or k-means clustering can perform well [1, 2]. In addition, for large-scale and efficient information retrieval, the results are usually obtained directly according to the distances to the query [3], and a good distance metric is also the key of many other important applications, such as face verification [4] and person re-identification [5].

To learn a reliable distance metric, we usually need large amount of label information, which can be the class labels or target values as used in the typical machine learning approaches (such as classification or regression), and it is more common to utilize some pair or triplet-based constraints [6]. Such constraints are weakly-supervised since the exact label for an individual sample is unknown. However, in real-world applications, the label information is often scarce since manually labeling is labor-intensive and it is exhausted or even impossible to collect abundant side information for a new learning problem.

Transfer learning [7], which aims to mitigate the label deficiency issue in model training, is thus introduced to improve the performance of distance metric learnng (DML) when the label information is insufficient in a target domain. This leads to the so-called transfer metric learning (TML), which has been found to be very useful in many applications. For example, in face verification [8], the main step is to estimate the similarities/distances between face images. The data distributions of the images captured under different scenarios vary due to the varied background, illumination, etc. Therefore, the metric learned in one scenario may be not effective in a new scenario and TML would be helpful. In person re-identification [5, 9], the key is to estimate the similarities/distances between images of persons appeared in different cameras. The data distributions of the images captured using different cameras vary due to the varied camera setting and scenario. In addition, the distribution for the same camera may change over time. Hence, calibration is needed to achieve satisfactory performance and TML is able to reduce such effort. A more general example is image retrieval, where the data distributions of images in different datasets vary [10]. It would also be very useful to utilize expensive or semantic features to help learn a metric for cheap features or the ones that are hard to be interpreted [11, 12].

In the past decade, dozens of works have been proposed in this area and we provide in this survey a comprehensive overview of these methods. In this survey, we aim to make the machine learners quickly grasp the TML research area, and facilitate the chosen of appropriate methods for machine learning practitioners. Besides, there still be many issues to be tackled in TML, and we hope that some new ideas can be inspired from this survey.

The rest of this survey is organized as follows. We first present the background and overview of TML in Section 2, which includes a brief history of TML, the main notations used throughout the paper, and a categorization of the TML approaches. In the subsequent two sections, we give a detailed description of the approaches in the two main categories, i.e., homogeneous and heterogeneous TML respectively. Section 5 is a summarization of the different applications of TML and finally, we conclude this survey and identify some possible future directions in Section 7.

2 Background and overview

2.1 A brief history of transfer metric learning

Transfer metric learning (TML) is a relatively new research field. The works that explicitly applying transfer learning to improve DML start around the year of 2009. For example, multiple auxiliary (source) datasets are utilized in [13] to help the metric learning on the target set. The main idea is to enforce the target metric to be close to the different source metrics. An adaptive weight is learned to reflect the contribution of each source metric to the target metric. In [14], such contribution is determined by learning a covariance matrix between the different metrics. Instead of directly learning the target metric, the decomposition based method [15] assumes that the target metric can be represented as a linear combination of multiple base metrics, which can be derived from the source metric. Hence, the metric learning is casted as learning combination coefficients, where the parameters to be learned can be much fewer.

Fig. 1
figure 1

Evolution of transfer metric learning, which has been studied for more than ten years

We can not only using source metrics to help the target metric learning, but also make the different DML tasks help each other. The latter is often called multi-task metric learning (MTML). One representative work is the multi-task extension [16] of a well-known DML algorithm LMNN [2]. Some other related works including GPMTML [17], MtMCML [5] and CP-mtML [10]. In addition, there are a few domain adaptation metric learning approaches [18, 19]. Most of the above methods can only learn linear metric for the target domain. The domain adaptation metric learning (DAML) approach presented in [18] is able to learn nonlinear target metric based on the kernel method. Recently, neural network is also employed to conduct nonlinear metric transfer [8] by taking the advantage of deep learning technique [20].

The study of heterogeneous TML is a bit later than homogeneous TML and there are much fewer works than those in the homogeneous setting. To the best of our best knowledge, the first work that explicitly designed for heterogeneous TML is the one presented in [21], but it is limited in that only two domains (one source and target domain) can be handled. There exist a few tensor based approaches [22, 23] for heterogeneous MTML, where the high-order correlations between all domains are exploited. A main disadvantage of these approaches is that the computational complexity is high. Dai et al. [11] proposes an unsupervised heterogeneous TML algorithm, which aims to use some “expensive” (sophisticated, off-the-shelf) features to help learn a metric for relatively “cheap” feature. This is also termed metric imitation. Recently, a general heterogeneous TML framework is proposed in [12, 24]. The framework first extracts some knowledge fragments (linear or nonlinear mappings) from pre-trained source metric, and then using these fragments to help the target domain learn either linear or nonlinear distance metric. The framework is flexible and easy-to-use. An illustration figure for the evolution of TML is shown in Fig. 1.

Fig. 2
figure 2

An illustration of traditional distance metric learning (DML) and transfer metric learning (TML). Given abundant labeled data, DML aims to learn a distance function between samples so that their distance is small if semantically similar and large otherwise. TML improves DML when the labeled data are insufficient in the target domain by utilizing information from related source domains, which have better distance estimations between samples. For example, it may be hard to distinguish “zebra” from “tiger” by observing only a few labeled samples due to the very similar stripe texture. But this task can be much easier if we have enough labeled samples to well distinguish “horse” from “cat”. The sample images are from the NUS-WIDE [25] dataset

2.2 Notations and definitions

In this survey, we assume there are M different domains, and the m’th domain is associated with a feature space \(\mathcal {X}_m\) and marginal distribution \(P_m(X_m)\). Without loss of generality, we assume the M’th (the last) domain is the target domain, and all the remained ones are source domains. If there is only one source domain, we signify it using the script “S”. In distance metric learning (DML), the goal is to learn a distance function for any two instances, i.e., \(d_\phi (\textbf{x}_i, \textbf{x}_j)\), which must satisfy several properties including nonnegativity, identity, symmetry and triangle inequality [6]. Here, \(\phi\) is the parameter of the distance function, and we call it distance metric in this survey. For a nonlinear distance metric, \(\phi\) is often given by a nonlinear feature mapping. The linear metric is denoted as A, which is a positive semi-definite (PSD) matrix and adopted in the popular Mahalanobias metric learning [1].

To learn the metric in the m’th domain, we assume there is a training set \(\mathcal {D}_m\), which contains \(N_m\) samples with \(\textbf{x}_{mi} \in \mathbb {R}^{d_m}\) to be the feature representation for the i’th sample. In a fully-supervised scenario, the corresponding label \(y_{mi}\) is also given. However, DML is usually conducted in a weakly-supervised manner, where only some similar/dissimilar constraints on training sample pairs \((\textbf{x}_{mi}, \textbf{x}_{mj})\) are provided. Alternatively, the constraint can be a relative comparison for a training triplet \((\textbf{x}_{mi}, \textbf{x}_{mj}, \textbf{x}_{mk})\), e.g., \(\textbf{x}_{mi}\) is more similar to \(\textbf{x}_{mj}\) than to \(\textbf{x}_{mk}\) [6].

In traditional DML, we are often provided with abundant labeled data (such as samples with similar/dissimilar constraints) so that the learned metric \(\phi ^*\) can well separate semantically similar data from dissimilar ones, such as “zebra” and “tiger” shown in Fig. 2. While in real-world applications, the learned target metric \(\phi _M\) may be not satisfactory since the labeled data are insufficient in the target domain. For example, it may be hard to distinguish “zebra” from “tiger” given only a few labeled samples since the two types of animals have very similar stripe texture. To mitigate the label deficiency issue in the target metric learning, we may utilize the information from other related source domain, where the distance metric \(\phi _S^*\) is good enough or a good metric can be learned using large amounts of labeled data. For example, if we have enough labeled samples to well distinguish “horse” from “cat”, then it may be very easy for us to recognize “zebra” and “tiger” by observing only a few labeled samples. The source metric cannot be directly used in the target domain due to the different data distributions [13] or representations [21] between the source and target domains. Therefore, (homogeneous or heterogeneous) transfer metric learning (TML) is developed to learn an improved target distance function \(d_{\phi _M^*} (\cdot , \cdot )\) parameterized by the metric \(\phi _M^*\) by transferring knowledge (particularly, the metric information) from the source domain. A summarization and discussion of the various TML methods is given as follows.

Fig. 3
figure 3

A categorization of the TML approaches according to different principals. TML can be categorized according to the feature setting, label setting or utilized transfer strategy. Terms on each path from the left to the right make up a certain TML category, e.g., “distribution approximation based transductive homogeneous TML”

2.3 A categorization of transfer metric learning techniques

As shown in Fig. 3, we can classify TML into different categories according to various principals. Firstly, TML can be generally grouped as homogeneous TML and heterogeneous TML according to the feature setting. In the former group, the samples of different domains lie in the same feature space \((\mathcal {X}_1 = \mathcal {X}_2 = \cdots = \mathcal {X}_M)\), and only the data distributions vary \((P_1(X_1) \ne P_2(X_2) \ne \cdots \ne P_M(X_M))\). Whereas in heterogeneous TML, the feature spaces are different \((\mathcal {X}_1 \ne \mathcal {X}_2 \ne \cdots \ne \mathcal {X}_M)\) and there may be semantic gap between the source and target domains. For example, in the problem of image matching, we may have only a few labeled images in a new scenario due to the high labeling cost, but there are large amounts of labeled images in some other scenarios. The data distributions of different scenarios vary since there are different backgrounds, illuminations, etc. Besides, the web images are usually associated with text descriptions, and it is useful to utilize the semantic textual features to help learn a better distance metric for visual features [21]. The data representations are quite different for the textual and visual domains.

Table 1 Different categorizes of TML according to label setting
Table 2 Different approaches to TML

We can also categorize the different TML approaches as inductive TML, transductive TML, and unsupervised TML according to whether the label information is available in the source or target domains. The relationship of the three learning settings are summarized in Table 1. This is similar to the categorization of transfer learning presented in [7].

Furthermore, we summarize the TML approaches into four different cases according to the utilized transfer strategies. Some early works of TML directly enforce the target metric to be close the source metric, and we thus refer it to as TML via metric approximation. Since the main difference between the source and target domains in homogeneous TML is the distribution divergence, some approaches enable metric transfer by minimizing the distribution difference. We refer this case to as TML via distribution approximation. There is a large amount of TML approaches that enable knowledge transfer by finding a common subspace for the source and target domains, especially in heterogeneous TML. This context is referred to as TML via subspace approximation. Finally, there is a few works that let the distance functions of different domains share some common parts or enforce the distances of corresponding sample pairs to agree with each other in different domains, and we refer it to as TML via distance approximation. The former two cases are usually used in homogeneous TML, and the latter two cases can be adopted for heterogenous TML. Table 2 is a brief description of these cases.

Table 3 Different transfer strategies used in different TML settings

In Table 3, we show which strategies are currently employed for different settings. In homogeneous TML, most of the current algorithms are inductive, and the transductive ones are usually conducted via distribution approximation. There is still no unsupervised method and a possible solution is to extend some unsupervised DML (e.g., [41]) or transfer learning (e.g., [42]) algorithms for unsupervised TML. One challenge is how to ensure the metric learned in the source domain is better since there are no labeled data in both the source and target domains. In the heterogeneous setting [43], since feature dimensions of different domains do not have correspondences, it is inappropriate to conduct TML via direct metric approximation. Most of the current heterogeneous TML approaches first find a common subspace for different domains, and then conduct knowledge transfer in the subspace. Unsupervised heterogeneous TML can be easily extended for the transductive heterogeneous setting by further utilizing source labels, and it is possible to adopt the distribution approximation strategy in the heterogenous setting by first finding a common representation for the different domains.

Fig. 4
figure 4

An example of homogeneous transfer metric learning. In sentiment classification, distance metric learned for target (such as electronics) reviews may be not satisfactory due to the insufficient labeled data. Homogeneous TML improves the metric by using abundant labeled source (such as book) reviews, where the data distribution is different from the target reviews

3 Homogeneous transfer metric learning

In homogeneous TML, the utilized features (data representations) are the same, but the data distributions vary for different domains. For example, in sentiment classification as shown in Fig. 4, we would like to determine the sentiment polarity (positive, negative or neutral) for a review of electronics. The performance of a sentiment classifier depends much on the distance estimation between reviews. To obtain reliable distance estimation, we usually need large amounts of labeled reviews to learn a good distance metric. However, we may only have a few labeled electronics reviews due to the high labeling cost and thus the obtained metric is not satisfactory. Fortunately, we may have abundant labeled book reviews, which are often easier to collect. Directly applying the metric learned using the labeled book reviews to the sentiment classification of electronics reviews is not appropriate due to the distribution difference between the electronics and book reviews. Transfer metric learning is able to deal with this issue and learn improved distance metric for the target sentiment classification of electronics reviews by using labeled book reviews.

3.1 Inductive TML

Under the inductive setting, we are provided with a few labeled data in the target domain. The number of labeled data in the source domain is large enough so that a good distance metric can be obtained, i.e., \(N_S \gg N_M > 0\). In inductive transfer learning [7], there may be no labeled source data (\(N_S = 0\)), but we have not seen such works in homogeneous TML.

3.1.1 TML via metric approximation

An intuitive idea for homogeneous TML is to first use the source domain data \(\{\mathcal {D}_m\}\) to learn the source distance metrics \(\{\phi _m\}\) beforehand, and then enforce the target metric to be close to the pre-trained source metrics. Therefore, the general formulation for learning the target metric \(\phi _M\) is given by

$$\begin{aligned}{} & {} \mathop {\arg \min }_{\phi _M} \epsilon (\phi _M) \nonumber \\{} & {} = L(\phi _M; \mathcal {D}_M) + \gamma R(\phi _M; \phi _1, \cdots , \phi _{M-1}), \end{aligned}$$
(1)

where \(L(\phi _M; \mathcal {D}_M)\) is the empirical loss w.r.t. the metric, \(R(\phi _M; \phi _1, \cdots , \phi _{M-1})\) is a regularization term that exploits the relationship between the source and target metrics, and \(\gamma \ge 0\) is a trade-off hyper-parameter. Any loss function used in standard DML can be adopted, and the key is how to design an appropriate regularization term, which prevents the target metric from over-fitting to the limited labeled data in the target domain. In [13], two different regularization terms are developed. The first one is to minimize the LogDet divergence [44] between the source and target Mahalanobias metrics, i.e.,

$$\begin{aligned}{} & {} R(A_M; A_1, \cdots , A_{M-1}) = \sum _{m=1}^{M-1} \alpha _m D_{LD}(A_M, A_m) \nonumber \\{} & {} = \sum _{m=1}^{M-1} \alpha _m \left( \textrm{tr}(A_m^{-1} A_M) - \textrm{logdet}(A_M) \right) . \end{aligned}$$
(2)

Here, \(\{A_m \succeq 0\}_{m=1}^M\) are constrained to be PSD matrices and \(D_{LD}(\cdot ,\cdot )\) indicates the LogDet divergence of two matrices. This is more appropriate than the Forbenius norm of matrix difference due to the desirable properties of the LogDet divergence, such as scale invariance [44]. The coefficients \(\{\alpha _m\}\) that satisfy \(\alpha _m \ge 0\) and \(\sum _{m=1}^{M-1} \alpha _m = 1\) is learned to reflect the contributions of different source metrics to the target metric. Secondly, to exploit the geometric structure of data distribution, Zha et al. [13] propose a regularization term based on manifold regularization [45]:

$$\begin{aligned}{} & {} R(A_M; A_1, \cdots , A_{M-1}) \nonumber \\{} & {} = \sum \limits _{m=1}^{M-1} \alpha _m \textrm{tr}\left( X^U L_m (X^U)^T A_M \right) , \end{aligned}$$
(3)

where \(X^U\) is the feature matrix of unlabeled data, and \(L_m\) is the Laplacian matrix of the data adjacency graph calculated based on the metric \(A_m\). In [14], the importance of the source metrics to the target metric is exploited by learning a task covariance matrix over the metrics. The matrix can model the correlations between different tasks. This approach allows negative and zero transfer.

Both of the above two approaches incorporate the source metrics into a regularization term to penalize the target metric learning. Different from them, a novel decomposition-based TML method is proposed in [15], which constructs the target metric by using the base metrics derived from the source metrics, that is,

$$\begin{aligned} A_M{} & {} = U_M \textrm{diag}(\theta ) U_M^T \nonumber \\{} & {} = \sum \limits _{r=1}^{N_B} \theta _{Mr} \textbf{u}_{Mr} \textbf{u}_{Mr}^T = \sum \limits _{r=1}^{N_B} \theta _{Mr} B_{Mr}, \end{aligned}$$
(4)

where \(\{\textbf{u}_{Mr}\}\) are eigenvectors of source metrics (which are PSD matrices), \(\{\theta _{Mr} \ge 0\}\) are combination coefficients of different base metrics, and \(N_B\) is the number of bases. This transforms the metric learning into coefficient learning. Hence, the number of parameters to be learned is reduced significantly, and the performance can be improved since the labeled samples in the target domain is scarce. Another advantage of the model is that the PSD constraint of the target metric can be automatically satisfied, and thus the computational cost is low. A semi-supervised extension was presented in [26] by combining it with manifold regularization.

In addition to utilizing the source metrics to help the target metric learning, there exist some multi-task metric learning (MTML) approaches that enable different metrics to help each other in metric learning. A representative work is the large margin multi-task metric learning (mtLMNN) [16], which is a multi-task extension of a well-known DML algorithm, i.e., large margin nearest neighbor (LMNN) [2]. In mtLMNN, all the different metrics are learned simultaneously by assuming that each metric consists of a common metric \(A_0\) and task-specific metric \(\widehat{A}_m\), i.e., \(A_m = A_0 + \widehat{A}_m\). Based on the same idea, a semi-supervised MTML method is developed in [46], where the unlabeled data is utilized by designing a loss to preserve neighborhood relationship. Then a regularization term is designed to control the amount of information to be shared among all tasks. In [14], a MTML approach is presented by first vectorizing the Mahalanobias metrics and then using a task covariance matrix to exploit the task relationship. Similarly, the metrics are vectorized in [5], but the different metrics are enforced to be close under the graph-based regularization theme [47]. In [27], a generic metric is first learned by utilizing data from all tasks and then the metric is transferred to a specific metric for each candidate set (task) by reweighting the training samples. In addition, a general MTML framework is proposed in [17], which enables knowledge transfer by enforcing different metrics \(\{A_m\}\) to be close to a common metric \(A_0\). The general Bregman matrix divergence [48] is introduced to measure the difference between two metrics. The framework incorporates mtLMNN as a special case and the geometry is preserved in the transfer by adopting a special Bregman divergence, i.e., the von Neumann divergence [48]. A more comprehensive review of the MTML approaches is presented in [49]. Meta learning [50] is also introduced for metric learning [51], where the training set is split into multiple subsets and a meta metric is learned across the different subsets (tasks). Recently, a few-shot metric learning approach [29] is designed to adapt the metric space by rectifying channels of intermediate layers, and a universal metric learning method is proposed in [30] by utilizing prompt learning.

3.1.2 TML via subspace approximation

Most of the TML approaches via direct metric approximation have a main drawback, i.e., when the feature dimension is high, the model is prone to overfitting due to the large number of parameters to be learned. This also leads to high computational cost in both training and prediction. To tackle this issue, some low-rank TML methods are proposed. They usually decompose the metric as \(A_m = U_m U_m^T\), where \(U_m \in \mathbb {R}^{d_m \times r}\) is a low-rank transformation matrix. This leads to a common subspace for different domains, and the knowledge transfer is conducted in the subspace. For example, a low-rank multi-task metric learning framework is proposed in [37, 52], which assumes that each transformation is a product of a common transformation and task-specific one, i.e., \(U_m = \widehat{U}_m U_0\). As a special case, the large margin component analysis (LMCA) [53] is extended to multi-task LMCA (mtLMCA), which is shown to be superior to mtLMNN.

3.1.3 TML via distance approximation

Both the models of mtLMNN and mtLMCA are trained based on labeled sample triplets. Different from them, CP-mtML [10] learn the metrics using labeled pairs, which are often easier to collect. Similar to mtLMCA, CP-mtML decomposes the metric as \(A_m = U_m U_m^T\), but the different projections \(\{U_m\}\) are coupled by assuming that the distance function consists of a common part and task-specific one, i.e.,

$$\begin{aligned} d_{U_m}^2(\textbf{x}_i, \textbf{x}_j) = d_{\widehat{U}_m}^2(\textbf{x}_i, \textbf{x}_j) + d_{U_0}^2(\textbf{x}_i, \textbf{x}_j). \end{aligned}$$
(5)

A main advantage of CP-mtML is that the optimization problem can be solved efficiently using stochastic gradient descent (SGD), and hence the model is scalable for high-dimensional features and large amounts of training data. Besides, the learned transformation can be used to derive low-dimensional features, which are desirable in large-scale information retrieval. However, the optimization problem in [10] is nonconvex and hence only local optimum may be obtained. This issue is addressed in [39], where an alternative convex formulation is presented and an efficient online MTML algorithm is developed.

3.2 Transductive TML

Under the transductive setting, there are no labeled data in the target domain and we only have large amounts of labeled source data, i.e., \(N_S \gg N_M = 0\).

3.2.1 TML via distribution approximation

In homogeneous TML, the data distributions vary for different domains. Therefore, we can minimize the distribution difference between the source and target domains, so that the source domain samples can be reused in the target metric learning. In [18], a domain adaptation metric learning (DAML) approach is proposed. In DAML, the distance metric is parameterized by a feature mapping \(\phi _M\). The mapping is learned by first transforming the samples in the source and target domains using the mapping, and then minimizing the distribution difference of the source and target domains in the transformed space. At the same time, \(\phi _M\) is learned to make the transformed samples satisfy the similar/dissimilar constraints in the source domain. The general formulation for learning \(\phi _M\) is given by

$$\begin{aligned}{} & {} \mathop {\arg \min }_{\phi _M} \epsilon (\phi _M) \nonumber \\{} & {} = L(\phi _M; \mathcal {D}_S) + \gamma D_{PD}\left( P_M(X_M), P_S(X_S) \right) , \end{aligned}$$
(6)

where \(D_{PD}(\cdot ,\cdot )\) is a measure of the difference between two probability distributions. Maximum mean discrepancy (MMD) [54] is adopted as the measure in DAML. The nonlinear mapping \(\phi _M\) is learned in the reproducing kernel Hilbert space (RKHS), and the solution is found using the kernel method. Since the source and target samples in the transformed space follow similar distribution, the mapping learned using the source label information is also discriminative in the target domain. The same idea is adopted in deep TML (DTML) [8], and the main difference is that the nonlinear mapping is assumed to be a multi-layer neural network. The knowledge transfer is conducted at the output layer and each hidden layer, and some weight hyper-parameters are set to balance the importance of the losses in different layers. A major limitation of these works is that they only consider the marginal distribution difference. This limitation is overcame in [31], where a novel TML method is developed by simultaneously reducing the marginal and conditional distribution divergences between the source and target domains. The conditional distribution divergence is reduced by first assigning pseudo labels to target domain data using the classifiers trained on source domain data, and then applying the class-wise MMD [55]. Besides, class distribution is incorporated in [33] to improve the discriminative power of the learned metric in DTML. In addition to reduce the distribution difference via MMD, the geometry of the source and target domains is also exploited in [34], where TML is formulated as a geometric mean metric learning problem.

Recently, there is an increasing number of transfer learning methods that utilize adversarial learning [56] to align the source and target domain. For example, an adversarial-based TML method is presented in [35] for the scenario that source and target domains do not share the same label space. The original source and target representations are learned to be separated to retain their respective discriminative power, while the transformed source representation is enforced to be close to the target representation for metric adaptation.

Different from these methods, which reduce the distribution difference in a new space, the importance sampling [57] is introduced in [19] to handle DML under covariate shift. The formulation is given as follows,

$$\begin{aligned} \mathop {\arg \min }_{A_M \succeq 0} \epsilon (A_M) = \sum \limits _{i,j} w_{ij} l(A_M; \textbf{x}_{Si}, \textbf{x}_{Sj}, y_{Sij}), \end{aligned}$$
(7)

where \(l(\cdot )\) is some pre-defined loss function over a training pair \((\textbf{x}_{Si}, \textbf{x}_{Sj})\) with \(y_{Sij} = \pm 1\) indicating the two samples are similar or not. The weight \(w_{ij} = \frac{P_M(\textbf{x}_{Si}) P_M(\textbf{x}_{Sj})}{P_S(\textbf{x}_{Si}) P_S(\textbf{x}_{Sj})}\) indicates the importance of the pair in the source domain for learning the target metric. Intuitively, if the pair of source samples have large probability to be occurred in the target domain, they should contribute highly in the target metric learning. In particular, for the distance (such as the popular Mahalanobias distance) which is induced by a norm, i.e., \(d(\textbf{x}_i, \textbf{x}_j) = \varphi (\textbf{x}_i - \textbf{x}_j)\), we can calculate the weight as \(w_{ij} = \frac{P_M(\delta _{Sij})}{P_S(\delta _{Sij})}\), where \(\delta _{Sij} = \textbf{x}_{Si} - \textbf{x}_{Sj}\). In [19], the weights and target metric are learned separately and this may lead to error propagation across them. The issue is tackled by [32], where the weights and target metric are learned simultaneously in a unified framework. Considering that the labeled target data is rare, some landmark source samples that are geometrically close to target samples are selected to learn the target distance metric in [36].

3.3 Discussion

TML via metric approximation is straightforward in that divergence between the source and target metrics (parameterized by PSD matrices) are directly minimized. A major difference of the various metric approximation based approaches is that the source and target metrics are enforced to be close in different ways, e.g., by adopting different types of divergence. These approaches are often limited in that the training complexity is high due to the PSD constraint and the distance calculation in the inference stage is not efficient for high-dimensional data. Subspace approximation based TML compensates for these shortcomings by reformulating the metric learning as learning a transformation or mapping. The PSD constraint is automatically satisfied and the learned transformation can be used to derive compressed representation, which would facilitate efficient distance estimation or sample matching, where the hash technique [58] can be involved. This is critical in many applications, such as information retrieval. The main disadvantage of the subspace approximation based methods is that their optimization problems are often non-convex and hence only local optimum can be obtained. The work [10] based on distance approximation also learn a projection instead of the metric but the optimization is more efficient. All of these approaches do not explicitly deal with the distribution difference, which is the main issue that transfer learning would like to tackle. Distribution approximation based methods focus on this point by usually minimizing the MMD measure or utilizing the importance sampling strategy.

3.4 Related work

TML is quite related to transfer subspace learning (TSL) [59, 60] or transfer feature learning (TFL) [61]. An early work on TSL is presented in [59] that finds a low-dimensional latent space, where the distribution difference between the source and target domain is minimized. This algorithm is conducted in a transductive manner and not convenient to derive a representation for new samples. This issue is tackled by Si et al. [60], where a generic regularization framework is proposed for TSL based on Bregman divergence [62]. A low-rank TSL (LTSL) framework is proposed in [63, 64], where the subspace is found by reconstructing the projected target data using the projected source data under the low-rank representation [65, 66] theme. The main advantage of the framework is that only relevant source data are utilized to find the subspace and noisy information can be filtered out. That is, it can avoid negative transfer. The framework is further extended in [67] to help recover missing modality in the target domain and improved in [68] by exploiting both low-rank and sparse structures on the reconstruction matrix.

TFL is very similar to TSL and a representative method is presented in [61], where the typical MMD is modified to take both the marginal and class-conditional distributions into consideration. More recent works on TFL are built upon the powerful deep feature learning. For example, considering that the features in deep neural networks are usually general in the first layers and task-specific in higher layers, Long et al. [69] propose the deep adaptation networks (DAN), which frozes the general layers in convolutional neural networks (CNN) [70] and only conduct adaption in the task-specific layers. Besides, multi-kernel MMD (MK-MMD) [71] is employed to improve kernel selection in MMD. In DAN, only the marginal distribution difference between the source and target domains is exploited. This is improved by the joint adaptation networks (JAN) [72], which is able to reduce the joint distribution divergence using a proposed joint MMD (JMMD). The JMMD can involve both the input features and output labels in domain adaptation. The constrained deep TSL [73] method can also exploit the joint distribution and the target domain knowledge is incorporated gradually during a progressive transfer procedure.

All of these TSL or TFL approaches have very close relationships to the subspace and distribution approximation based TML. Although they do not aim to learn metrics, it is not hard to adapt them for TML by adopting some metric learning loss in these models.

Fig. 5
figure 5

An example of heterogeneous transfer metric learning. In multi-lingual sentiment classification, distance metric learned for target reviews (such as the ones written in Spanish) may be not satisfactory due to the insufficient labeled data. Heterogeneous TML improves the metric by using abundant labeled source reviews (such as the ones written in English), where the data representation is different from the target reviews (e.g., due to the different vocabularies)

4 Heterogeneous transfer metric learning

In heterogeneous TML, the different domains have different features (data representations), and sometimes have semantic gap, such as the textual and visual domains. A typical example is the multi-lingual sentiment classification as shown in Fig. 5, where we would like to determine the sentiment polarity for a review written in Spanish. The labeled Spanish reviews may be scarce but it is much easier to collect abundant labeled reviews written in English. Directly applying the metric learned using the labeled English reviews to the sentiment classification of Spanish reviews is infeasible since the representations of Spanish and English reviews are different due to the varied vocabularies. This issue can be tackled by heterogeneous TML, which improves the distance metric for the target sentiment classification of Spanish reviews using labeled English reviews.

4.1 Inductive heterogeneous TML

Different from the inductive homogenous setting, the number of labeled data in the source domain can be zero under the inductive heterogeneous setting. This is because the source feature may have much stronger representation power than that in the target domain, and thus no labeled data are required to obtain a good distance function in the source domain.

4.1.1 Heterogeneous TML via subspace approximation

To our best knowledge, heterogeneous TML under the inductive setting is only studied in recent years. For example, a heterogeneous multi-task metric learning (HMTML) method is proposed in [22]. HMTML assumes that the similar/dissimilar constraints are limited in multiple heterogeneous domains, but there are large amounts of unlabeled data that have representations in all domains, i.e., \(\mathcal {D}^U = \{(\textbf{x}_{1n}^U, \cdots , \textbf{x}_{Mn}^U)\}_{n=1}^{N^U}\). To build a connection between different domains, the linear metrics \(\{A_m\}\) are decomposed as \(\{A_m = U_m U_m^T\}\), and then the different representations of an unlabeled data are transformed into a common subspace using \(\{U_m\}\). The general formulation is given by

$$\begin{aligned}{} & {} \mathop {\arg \min }_{\{A_m \succeq 0\}} \epsilon (\{A_m\}) \nonumber \\{} & {} = \sum \limits _{m=1}^M L(A_m; \mathcal {D}_m) + \gamma R(U_1, \cdots , U_M; \mathcal {D}^U). \end{aligned}$$
(8)

Since the different representations corresponding to the same (unlabeled) sample, the transformed representations should be close to each other in the subspace. By minimizing the divergence of transformed representations (or equivalently maximizing their correlations), each transformation is learned by using the information from all domains. This results in an improved transformation, and thus better metric than learning them separately. In [22], a tensor-based regularization term is designed to exploit the high-order correlations between different domains. A variant of the model is presented in [23], which uses the class labels to build domain connection.

In [24], a general heterogeneous TML approach is proposed based on the knowledge fragments transfer [74] strategy. The optimization problem is given by

$$\begin{aligned}{} & {} \mathop {\arg \min }_{\phi _M} \epsilon (\phi _M) \nonumber \\{} & {} = L(\phi _M; \mathcal {D}_M) + \gamma R(\{\phi _{Mc}(\cdot )\}, \{\varphi _{Sc}(\cdot )\}; \mathcal {D}^U), \end{aligned}$$
(9)

where \(\phi _{Mc}(\cdot )\) is the c’th coordinate of the mapping \(\phi _M\), and \(\varphi _{Sc}(\cdot )\) is the c’th fragment of the knowledge in the source domain. The source knowledge fragments are represented by some mapping functions, which are learned by applying existing DML algorithms in the source domain beforehand. Then the target metric (which also consists of multiple mapping functions) is enforced to agree with the source fragments on the unlabeled corresponding data. This helps learn an improved metric in the target domain since the pre-trained source distance function is assumed to be superior than the target distance function without knowledge transfer. Intuitively, the target subspace is enforced to approach a better source subspace. An improvement of the model is presented in [12], where the locality of the geometric structure of the data distribution is preserved via manifold regularization [45]. Recently, an asymmetric metric learning [38] method is proposed to learn a metric for examples in the student (target) space by utilizing labels across the teacher (source) and student (target) domains.

4.1.2 Heterogeneous TML via distance approximation

We can not only enforce the subspace representations of corresponding sample in different domains to be close, but also let the distances of corresponding sample pairs to agree with each other in different domains. For example, an online heterogeneous TML approach is proposed in [40], which also assumes that there are abundant unlabeled corresponding data, but the target labeled sample pairs are provided in a sequential manner (one by one). Given a new labeled training pair, the target metric is updated as:

$$\begin{aligned} A_M^{k+1} ={} & {} \mathop {\arg \min }_{A_M \succeq 0} \epsilon (A_M) \nonumber \\ ={} & {} L(A_M) + \gamma _A D_{LD}(A_M, A_M^k) \nonumber \\{} & {} + \gamma _I R(d_{A_M}, d_{A_S}; \mathcal {D}^U), \end{aligned}$$
(10)

where \(L(A_M)\) is the empirical loss w.r.t. the current labeled pair, \(D_{LD}(\cdot , \cdot )\) is the LogDet divergence [44], and \(R(d_{A_M}, d_{A_S}; \mathcal {D}^U)\) is a regularization term that enforce agreements between the source and target distances (of corresponding pairs). Here, \(A_M^k\) is the target metric obtained previously and initialized as an identity matrix. The source metric \(A_S\) can be an identity matrix if the source feature is much more powerful than the target feature. By pre-calculating \(A_S\) and formulating the term \(R(\cdot )\) under the manifold regularization theme [45], an online algorithm is developed to update the target metric \(A_M\) efficiently.

4.2 Unsupervised heterogeneous TML

There exist a few unsupervised heterogeneous TML approaches that utilize unlabeled corresponding data for metric transfer and no label information is provided in either the source or target domains (\(N_S = N_M = 0\)). Under this unsupervised paradigm, the utilized source feature should be more expressive or interpretable than the target feature, so that the estimated distances in the source domain can be better than those in the target domain.

4.2.1 Heterogeneous TML via subspace approximation

An early work is done in [21], where the main idea is to maximize the similarity of any unlabeled corresponding pairs in a common subspace, i.e.,

$$\begin{aligned} \mathop {\arg \min }_{A_M \succeq 0} \epsilon (A_M) = \sum \limits _{n=1}^{N^U} l\left( \varphi (\theta ) \right) , \end{aligned}$$
(11)

where \(\varphi (\theta ) = \frac{1}{1 + \exp (-\theta )}\) with \(\theta = (\textbf{x}_{Mn}^U)^T G \textbf{x}_{Sn}^U\) and \(G = U_M^T U_S\). Here, \(l(\cdot )\) is chosen to be the negative logistic loss and the proximal gradient method is adopted for optimization. A main disadvantage of this TML approach is that the computational complexity is high since the costly singular value decomposition (SVD) is involved in each iteration of the optimization.

4.2.2 Heterogeneous TML via distance approximation

Instead of directly maximizing the likelihood between unlabeled sample pairs, Dai et al. [11] propose to use the target samples to approximate the source manifold. The method is inspired by locally linear embedding (LLE) [75], and metric transfer is conducted by enforcing embeddings of target samples to preserve local properties in the source domain. The optimization problem is given by

$$\begin{aligned}{} & {} \mathop {\arg \min }_{U_M} \epsilon (U_M) \nonumber \\{} & {} = \sum \limits _{i=1}^{N^U} \left\| U_M^T \textbf{x}_{Mi}^U - \sum \limits _{j=1}^{N^U} w_{Sij}(U_M^T \textbf{x}_{Mj}^U) \right\| , \end{aligned}$$
(12)

where \(w_{Sij}\) is the weight in the adjacency graph calculated using the source domain feature. This enables the distances (between samples) in the source and target domains to agree with each other on the manifold. The optimization is much more efficient than [21] since only a generalized eigenvector problem is needed to be solved.

In the unsupervised setting, the asymmetric metric learning [38] approach is able to perform heterogeneous TML via both subspace approximation (regression) and distance approximation (relational distillation).

4.3 Discussion

It is nature to conduct heterogeneous TML via subspace approximation since the representations of different domains vary and finding a common representation can facilitate the knowledge transfer. Similar to that in the homogeneous setting, the main drawback is that the optimization problem is usually non-convex. Although this drawback can be remedied by directly learning a PSD matrix, such as using the distance approximation strategy, it is nontrivial to perform efficient distance inference for high-dimensional data and extend the algorithm to learn nonlinear metric. Due to the strong ability and rapid development of deep learning, it may be more promising to learn transformation or mapping than PSD matrix in TML, based on either subspace or distance approximation.

Table 4 A summarization of the different applications in which TML utilized

4.4 Related work

Some early heterogeneous transfer learning approaches are not specially designed for DML, but the learned feature transformation or mapping for each domain can be used to derive a metric. For example, in the work of heterogeneous domain adaptation via manifold alignment (DAMA) [76], the class labels are utilized to align different domains. A mapping function is learned for each domain and all functions are learned together. After being projected into a common subspace, the samples should be close to each other if they belong to the same class and separated otherwise. This is conducted for all samples from either the same domain or different domains. The label information of all different domains can be utilized to learn the shared subspace, and thus better embeddings (representations) can be learned for different domains than learning them separately. In [77], a multi-task discriminant analysis (MTDA) approach is proposed to deal with heterogeneous feature spaces in different domains. MTDA assumes the linear transformation of the m’th domain is given by \(U_m = W_m H\), which consists of a task-specific part \(W_m\) and a common part H for all tasks. Then all the transformations are learned in a single optimization problem, which is similar to that of the well-known linear discriminant analysis (LDA) [78]. In [79], a multi-task nonnegative matrix factorization (MTNMF) approach is proposed to learn the different mappings for all domains by simultaneously factorizing their data representation and feature-class correlation matrices. The factorized class representation matrix is assumed to be shared by all tasks. This leads to a common subspace for different domains.

All of these approaches have very close relationships to the subspace approximation based heterogeneous TML, but they mainly utilize the fully-supervised class labels to learn feature mappings for different domains. As we mentioned previously, it is common to utilize the weakly-supervised pair/triplet constraints in DML and it is not hard to adapt these approaches for heterogeneous TML by adopting some metric learning loss w.r.t. pair/triplet constraints in these models.

5 Applications

In general, for any applications where DML is appropriate, TML is a good candidate when the label information is scarce or hard to collect. In Table 4, we summarize the different applications that TML utilized in.

5.1 Homogeneous TML

5.1.1 Computer vision

Similar to DML [80], most of the TML approaches are applied in computer vision. For example, effectiveness of many homogeneous TML methods are verified in the common image classification application, which includes handwritten letter/digit classification [14, 15, 17, 52], face recognition [13, 18], natural scene categorization and object recognition [15, 18, 31].

DML is particular suitable and crucial for some applications, such as face verification [8], person re-identification [5] and image retrieval [10]. This is because in these applications, results can be directly inferred from the distances between samples. Face verification aims to decide whether two face images belong to the same person or not. In [8], TML is applied for face verification across different datasets, where the distributions vary. The goal of person re-identification is to decide whether the people appear in multiple cameras are the same person or not, where the cameras often do not have overlapping views. The data distributions of the images captured by different cameras vary due to the varying illumination, background, etc. Besides, distribution may change over time for the same camera. Hence, TML can be very useful in person re-identification [5, 8, 27]. An efficient MTML approach is proposed in [10] to make use of auxiliary datasets for face retrieval, where the tasks vary for different datasets. Stochastic gradient descent (SGD) is adopted for optimization and the algorithm is scalable to large amounts of training data and high dimensional features.

5.1.2 Speech recognition

Different groups of speakers have different ways in uttering an English alphabet. In [16, 17, 46], alphabet recognition in each group is regarded as a task, and MTML is employed to learn the metrics of different groups together. Similarly, since men and women have different pronunciation styles, vowel classification is performed for two different groups according to the gender, and MTML is adopted to learn their metrics simultaneously by making use of all available labeled data [17].

5.1.3 Other applications

In [81], MTML is used for predictions in social networks. For example, citation prediction is to predict the referencing between articles given their contents. The citation patterns of different areas (such as computer science and engineering) are different but related, and thus MTML is adopted to learn the prediction models of multiple areas simultaneously. Social circle prediction is to assign a person to appropriate social circles given his/her profile. Different types of social circles (such as family members and colleges) are different but related with each other, and hence MTML is applied to improve the performance. In [16, 17, 37], MTML is applied to customer information prediction in insurance company. There are multiple variables that can be used to predict the interest of a person in buying a certain insurance policy. Each variable is a discrete value and can be predicted using other variables. The predictions of different variables can be conducted together since they are correlated with each other. Furthermore, MTML is also adopted to learn metrics of multiple traffic types simultaneously for device fingerprinting [39].

In [82], a protein is regarded as a bag, and different source bags are weighted to learn a target distance metric for protein function prediction. Since the distribution of time series data often changes overtime, a multi-source active metric transfer learning approach is proposed for time series prediction [83], which plays an important role in diverse applications, such as air pollution forecasting and stock prediction. Besides, it is exhaustive and expensive to collect data under new operating conditions, and thus a deep metric transfer learning method in [84] is proposed to estimate the remaining useful life of bearings under different operating conditions. Some other applications of TML include galaxy morphology characterisation [85], EEG (electroencephalogram) emotion recognition [86] and drug discovery [87], and a potential application is drug-target interaction prediction [88].

5.2 Heterogeneous TML

5.2.1 Computer vision

Similar to homogeneous TML, heterogeneous TML is also mainly applied to the computer vision community, such as image classification including face recognition [77], natural scene categorization [12, 23, 24] and object recognition [12, 22, 24], image clustering [11], image retrieval [11, 12, 38, 89] and face verification [12]. In these applications, either the feature dimensions vary or different types of features are extracted for the source and target domains. In particular, expensive features (has strong representation power but high computational cost, such as CNN [90]) can be used to guide learning an improved metric for relatively cheap features (such as LBP [91]), and interpretable text feature can help the metric learning of visual feature, which is often to interpret [21, 79].

In [11], heterogeneous TML is adopted to improve image super-resolution, which is to generate a high-resolution (HR) image for its low-resolution (LR) counterpart. The method is based on JOR [92], which is an example-based super-resolution approach. JOR needs to find the nearest neighbors for the LR images, and a metric is learned in [11] to replace the Euclidean metric in the k-NN search by leveraging information from the HR domain.

5.2.2 Text analysis

In the text analysis area, heterogenous TML is mainly applied by using labeled documents written in one language (such as English) to help analysis of the documents in another language (such as Spanish). The utilized vocabularies vary for different languages, and thus the data representations are heterogeneous for different domains. Some typical examples including text categorization [22, 76], sentiment classification [79] and document retrieval [76]. In [79], heterogenous MTML is applied to email spam detection since the vocabularies for different persons’ email vary.

6 Datasets and evaluation

There are some datasets that are widely used in the literature to demonstrate the effectiveness of transfer learning algorithms. For example, in the visual recognition application, a widely adopted dataset is “Office-Caltech” [93], which is a combination of the “Office” object recognition dataset [94]Footnote 1 and Caltech256 datasetFootnote 2 by choosing 10 overlapping categories. There are four different domains: Amazon (A), Webcam (W), Dslr (D) and Caltech (C) in the dataset. In this paper, we choose Caltech as the target domain, and the other three ones are used as the source domain in turn. A statistic of the information in the dataset is shown in Table 5. There are also some domain adaptation dataset for text classification, such as the Multi-Domain Sentiment (MDS) dataset introduced in [95]Footnote 3, where reviews of four different product types (domains): Kitchen, Books, DVDs, and Electronics are collected from Amazon.com.

Table 5 A statistic of the information in the “Office-Caltech” dataset
Table 6 A comparison of the homogeneous TML methods on the “Office-Caltech” dataset, where “Induc” and “Trans” signify “Inductive” and “Transductive” respectively

In this paper, we mainly compare with the methods that have codes public available, such as MTLF [32]Footnote 4, GCA [34]Footnote 5 and FTN [35]Footnote 6. We also re-implement some representative approaches, such as LDML [13] and DTML [8]. In the homogeneous inductive setting, we compare the following approaches:

  • EU: directly employing the simple Euclidean distance metric.

  • LEGO [3]: an efficient single-domain/single-task DML algorithm for fast information retrieval.

  • RDML [96]: a robust single-domain/single-task DML algorithm with theoretical guarantee.

  • LDML [13]: an early TML work that makes use of auxiliary source metric based on log-determinant based regularization.

  • DTDML [15]: a decomposition-based TML method that casts TML as a combination coefficients learning problem.

  • MTLF [32]: a unified TML approach that jointly learns the target metric and weights of source samples to enable effective knowledge transfer.

In the homogeneous transductive setting, we compare the following methods with the EU baseline and single domain metric learning algorithms:

  • DTML [8]: an early deep TML work that conducts nonlinear metric transfer using deep neural networks.

  • GCA [34]: a unified TML approach that can exploit both the statistical and geometrical information of the source and target domains.

  • FTN [35]: a recently proposed adversarial-based TML method that can simultaneously enable metric adaptation between the source and target domains and retain their discriminative power. It is mainly designed for the scenario when the source and target domains do not share the same label space.

Results of the homogeneous inductive and transductive TML methods, as well as the EU baseline and single domain metric learning algorithms are reported in Table 6. From the results, we mainly observe that: 1) learning the distance metric is always beneficial although the labeled data is insufficient. This indicates that DML is useful in this application; 2) the TML methods outperform the single domain DML algorithms in most cases. This demonstrates that knowledge is successfully transferred from the source to the target domain, and an appropriately designed TML methods can be quite helpful; 3) although no (target) label information is provided in transductive TML, the performance can be better than inductive TML. This may be because the test data is accessible in the transductive setting and the designed method is more sophisticated; 4) the deep approach DTML is superior to the non-deep GCA only in the “A-C” setting, but worse than GCA in the other two settings. This is because the number of labeled samples are not large enough in the “Webcam” and “Dslr” datasets to train the large amounts of parameters in DTML; 5) since FTN mainly focuses on tackling the problem of non-overlapping source and target class labels, it only achieves the best performance in the “W-C” setting. A comparison of the multi-task metric learning approaches can be found in [49].

In the heterogeneous setting, a popular application is to conduct transfer between images of different resolutions. For example, heterogeneous TML is applied to image super-resolution in [11]. The “Office” object recognition dataset can also be used for heterogeneous transfer [97] by conducting transfer between the high-resolution digital single-lens reflex (dslr) camera and low-resolution web camera. Another popular application is the cross-lingual text classification, and a widely adopted dataset is the cross-lingual sentiment (CLS) dataset [98]Footnote 7, where the transfer is conducted between the Amazon product reviews written in different languages (domains): English (E), French (F), German (G) and Japanese (J). In this paper, we choose English as the source domain, and the other three ones are regarded as the target domain in turn. A statistic of the information in this dataset is reported in Table 7. Specifically, we compare the following heterogeneous TML methods with the EU baseline and single domain metric learning algorithms:

Table 7 A statistic of the information in the “CLS” dataset
  • MI [11]Footnote 8: a heterogeneous TML algorithm via manifold transfer.

  • GB-HTDML [12, 24]: a general heterogeneous TML method via knowledge fragments transfer.

  • OHTML [40]: an efficient online heterogeneous TML approach that can handle streaming data.

Table 8 A comparison of the heterogeneous TML approaches on the “CLS” dataset

Results of the heterogeneous TML approaches and the EU baseline, as well as the single domain metric learning algorithms are reported in Table 8. From the results, we observe that: 1) performance of the EU baseline is quite poor. This indicates that the simple EU metric is quite inappropriate in this application; 2) the transfer approach MI is only superior to the EU baseline since it is an unsupervised TML approach, where no label information is available; 3) the performance can be improved significantly by learning the metric, and further improved by transferring information from the source domain.

7 Conclusion and discussion

7.1 Summary

In this survey, we provide a comprehensive and structured overview of the transfer metric learning (TML) methods and their applications. We generally group TML as homogeneous and heterogeneous TML according to the feature setting. Similar to [7], the TML approaches can also be classified into inductive, transductive and unsupervised TML according to the label setting. According to the transfer strategy, we further categorize the TML approaches into four contexts, i.e., TML via metric approximation, TML via distribution approximation, TML via subspace approximation and TML via distance approximation.

Homogeneous TML has been studied extensively under the inductive setting and various transfer strategies can be adopted. In the transductive setting, TML is mainly conducted by distribution approximation, and there are still no unsupervised methods for homogeneous TML. Unsupervised TML can be carried out under the heterogeneous setting. This is because if more powerful feature is utilized in the source domain, then the distance estimation can be better than that in the target domain [11]. Since the data representations vary for different domains in heterogeneous TML, most of these approaches find a common subspace for knowledge transfer.

A major limitation of TML is its worse flexibility compared with the general transfer learning since TML only focuses on the transfer of distance metric. Whereas in the more general transfer learning, many other types of knowledge (such as parameters, relations and policies) can be utilized for transfer. Besides, distance-based loss should adopted in TML. Whereas in this deep learning era, we usually aim to learn a good embedding of the input, no matter which type of loss is utilized.

7.2 Challenges and future directions

We finally identify some challenges in TML and speculate several possible future directions.

7.2.1 Selective transfer in TML

Current transfer learning and TML algorithms usually assume that the source tasks or domain samples are positively related with the target ones. However, this assumption may not hold in real-world applications [14, 99]. The TML algorithm presented in [14] can leverage negatively correlated task by learning task correlation matrix. In [100], the relations of 26 popular visual learning tasks are learned using a large image dataset, where each image has annotations in all tasks. This leads to a task taxonomy map, which can be used to guide the chosen of appropriate supervision policies in transfer learning. Different from these approaches, which consider selective transfer [101] at the task-level, a heterogeneous transfer learning method based on the attention mechanism is proposed in [99], which can avoid negative transfer at the instance-level. The low-rank TML model presented in [64] can also avoid negative transfer to some extent by filtering noisy information in the source domain.

Task correlations have been exploited for metric approximation based TML [14], and the attention scheme can be used for subspace approximation based TML following [99]. It is still unclear how to conduct selective transfer in distribution and distance approximation based TML. Adopting the attention scheme may be a certain choice, but this scheme cannot make use of the negative transfer. Therefore, a promising future direction may be to design appropriate transferability metrics [102] or conduct selective transfer at the hypothesis space-level so that both the positive and negative transfer can be effectively utilized.

7.2.2 More theoretical understanding of TML

A theoretical analysis of the metric approximation based homogeneous TML is presented in [28], where generalization error of the target metric is bounded by the complexity of the auxiliary source metric and its performance on the target training set. Besides, a weight is learned for the source metric to minimize the bound. There is also a theoretical study for heterogeneous TML in [12], which shows that generalization ability of the target metric can be improved by directly enforcing the source feature mappings to agree with the target mappings. But there is still lack of general analysis scheme (such as [103,104,105]) and theoretical results for TML. In particular, more theoretical studies should be conducted to understand when and how could the source domain knowledge help the target metric learning.

7.2.3 TML for handling complex data

Most of current TML approaches only learn linear metrics (such as the Mahalanobis metric). However, there may be nonlinear structure in the data, e.g., most of the visual feature representations. Linear metric may fail to capture such structure and hence it is desirable to learn nonlinear metric for the target domain in TML. There have been several works on nonlinear homogeneous TML based on neural networks [8, 69, 72]. But all of them are mainly designed for continuous real-valued data and learn real-valued metrics. More studies can be conducted for histogram data or learning binary target metrics. The histogram data is popular in visual analytic-based applications, and binary metric is efficient in distance calculation. As far as we are concerned, there is only one nonlinear TML work under the heterogeneous setting [24] (with an extension presented in [12]), where gradient boosting regression tree (GBRT) [106, 107] is adopted to learn nonlinear metric in the target domain. Some other nonlinear learning techniques can be investigated, and also binary metrics can be learned to accelerate prediction. In addition, when the structure of the target data (such as graph data) distribution is very complex, it could be a good choice to learn Riemannian metric [108] or multiple local metrics [109] to approximate the geodesic distance in the target domain.

7.2.4 TML for handling changing and big data

In TML, all the training data in the source and target domains are usually assumed to be provided at once and a fixed target metric is learned. However, in real-world applications, the data are usually comes in a sequential order and the data distribution may change overtime. For example, tremendous amounts of data are uploaded on the web everyday, and for a robot, the environment changes overtime and feedbacks are provided continuously. Therefore, it is desirable to develop some TML algorithms to make the metric adapt to different changes. Some quite related topics including online learning [110, 111] and lifelong/continual learning [112, 113]. There is a recent try in [40], where an online heterogeneous TML is developed. However, this approach needs abundant unlabeled corresponding data in the source and target domains for knowledge transfer. Hence, the approach may be not efficient when vast amounts of unlabeled data are needed to achieve satisfactory accuracy.

Although the number of training data in the target domain is often assumed to be small, the continuously changing data is “big” in a long term. In addition, when the feature dimension is high, computational costs of the distances between vast amounts of samples based on a learned Mahalanobias metric is intolerable. A typical example is information retrieval. Therefore, it is desirable to learn some target metric that is efficient in distance calculation, e.g., learn hamming distance metric [114, 115] or feature hashing [58, 116,117,118] in the target domain.

7.2.5 TML for handling extreme cases

One-shot learning [119] and zero-shot learning [120] are two extreme cases of transfer learning. In these cases, the number of labeled data in the target domain is very small (such as only one) and even zero. The main goal is to recognize rare or unseen classes [121, 122], where some additional knowledge (such as descriptions of the relations between existing and unseen classes) may be provided. This is more like human learning, and much useful in practice. They are quite related to the concepts of domain generalization [123,124,125].

DML has been found to be useful in learning unknown classifiers [126] (with an extension in [127]), but it does not aim to learn a metric in the target domain. In [128], an unbiased metric is learned across different domains, but no specific information about the target domain is leveraged. Although some existing TML algorithms allow no labeled data in the target domain [8, 11], they need large amounts of unlabeled target data, which can be regarded as additional knowledge. If we do not have unlabeled data, is it possible to utilize other semantic information to help the target metric learning? There exists a try in [129], where the ColorChecker Chart is utilized as additional information for person re-identification under the one-shot setting. But such information is not easy to access and not general for different applications. Hence, more common and easily accessible knowledge should be identified and explored for general TML under the one/zero-shot [130] and open-world [131,132,133] settings.

7.2.6 TML under large pre-trained models

Recently, CLIP (contrastive language image pre-training) [134], a large pre-trained model, have been proposed to learn highly effective and generalizable image representations, by conducting a simple pre-training task using vast amounts of image-text pairs. This is very similar to the idea of unsupervised heterogeneous TML, where a large number of unlabeled corresponding pairs from different feature spaces are utilized to learn improved representation for the target domain. Therefore, a possible future direction is to utilize multi-modal pre-trained models to improve the performance of heterogeneous TML.