Abstract
Recently, metric learning and similarity learning have attracted a large amount of interest. Many models and optimization algorithms have been proposed. However, there is relatively little work on the generalization analysis of such methods. In this paper, we derive novel generalization bounds of metric and similarity learning. In particular, we first show that the generalization analysis reduces to the estimation of the Rademacher average over “sums-of-i.i.d.” sample-blocks related to the specific matrix norm. Then, we derive generalization bounds for metric/similarity learning with different matrix-norm regularizers by estimating their specific Rademacher complexities. Our analysis indicates that sparse metric/similarity learning with \(L^1\)-norm regularization could lead to significantly better bounds than those with Frobenius-norm regularization. Our novel generalization analysis develops and refines the techniques of U-statistics and Rademacher complexity analysis.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The success of many machine learning algorithms (e.g. the nearest neighbor classification and k-means clustering) depends on the concepts of distance metric and similarity. For instance, k-nearest-neighbor (kNN) classifier depends on a distance function to identify the nearest neighbors for classification; k-means algorithms depend on the pairwise distance measurements between examples for clustering. Kernel methods and information retrieval methods rely on a similarity measure between samples. Many existing studies have been devoted to learning a metric or similarity automatically from data, which is usually referred to as metric learning and similarity learning, respectively.
Most work in metric learning focuses on learning a (squared) Mahalanobis distance defined, for any \(x,t\in \mathbb {R}^d\), by \(d_M(x,t)= {(x-t)^\top M(x-t) }\) where M is a positive semi-definite matrix, see e.g. (Bar-Hillel et al. 2005; Davis et al. 2007; Globerson and Roweis 2005; Goldberger et al. 2004; Shen et al. 2009; Weinberger and Saul 2008; Xing et al. 2002; Yang and Jin 2007; Ying et al. 2009). Concurrently, the pairwise similarity defined by \(s_M(x,t)=x^\top Mt\) was studied in Chechik et al. (2010), Kar and Jain (2011), Maurer (2008), Shalit et al. (2010). These methods have been successfully applied to various real-world problems including information retrieval and face verification (Chechik et al. 2010; Guillaumin et al. 2009; Hoi et al. 2006; Ying and Li 2012). Although there are a large number of studies devoted to supervised metric/similarity learning based on different objective functions, few studies address the generalization analysis of such methods. The recent work (Jin et al. 2009) pioneered the generalization analysis for metric learning using the concept of uniform stability (Bousquet and Elisseeff 2002). However, this approach only works for the strongly convex norm, e.g. the Frobenius norm, and the bias term is fixed which makes the generalization analysis essentially different.
In this paper, we develop a novel approach for generalization analysis of metric learning and similarity learning which can deal with general matrix regularization terms including Frobenius norm (Jin et al. 2009), sparse \(L^1\)-norm (Rosales and Fung 2006), mixed (2, 1)-norm (Ying et al. 2009) and trace-norm (Ying et al. 2009; Shen et al. 2009). In particular, we first show that the generalization analysis for metric/similarity learning reduces to the estimation of the Rademacher average over “sums-of-i.i.d.” sample-blocks related to the specific matrix norm, which we refer to as the Rademacher complexity for metric (similarity) learning. Then, we show how to estimate the Rademacher complexities with different matrix regularizers. Our analysis indicates that sparse metric/similarity learning with \(L^1\)-norm regularization could lead to significantly better generalization bounds than that with Frobenius norm regularization, especially when the dimensionality of the input data is high. This is nicely consistent with the rationale that sparse methods are more effective for high-dimensional data analysis. Our novel generalization analysis develops and extends Rademacher complexity analysis (Bartlett and Mendelson 2002; Koltchinskii and Panchenko 2002) to the setting of metric/similarity learning by using techniques of U-statistics (Clémencon et al. 2008; Peña and Giné 1999).
The paper is organized as follows. The next section reviews the models of metric/similarity learning. Section 3 establishes the main theorems. In Sect. 4, we derive and discuss generalization bounds for metric/similarity learning with various matrix-norm regularization terms. Section 5 concludes the paper.
Notation Let \(\mathbb {N}_n = \{1,2,\ldots , n\}\) for any \(n\in \mathbb {N}\). For any \(X,Y\in \mathbb {R}^{d\times n}\), \(\langle X, Y\rangle = \mathbf{Tr}(X^\top Y)\) where \(\mathbf{Tr}(\cdot )\) denotes the trace of a matrix. The space of symmetric d times d matrices will be denoted by \(\mathbb {S}^d.\) We equip \(\mathbb {S}^d\) with a general matrix norm \(\Vert \cdot \Vert \); it can be a Frobenius norm, trace-norm and mixed norm. Its associated dual norm is denoted, for any \(M\in \mathbb {S}^d\), by \(\Vert M\Vert _*= \sup \{ \langle X, M \rangle : X\in \mathbb {S}^d, \Vert X\Vert \le 1 \}.\) The Frobenius norm on matrices or vector is always denoted by \(\Vert \cdot \Vert _F.\) The cone of positive semi-definite matrices is denoted by \(\mathbb {S}_+^d\). Later on we use the conventional notation that \(X_{ij}=(x_i-x_j)(x_i-x_j)^\top \) and \(\widetilde{X}_{ij} = x_ix_j^\top .\)
2 Metric/similarity learning formulation
In our learning setting, we have an input space \({\mathcal {X}}\subseteq \mathbb {R}^d\) and output (labels) space \({\mathcal {Y}}\). Denote \({\mathcal {Z}}= {\mathcal {X}}\times {\mathcal {Y}}\) and suppose \(\mathbf{z}:= \{z_i=(x_i,y_i)\in {\mathcal {Z}}: i\in \mathbb {N}_n\}\) an i.i.d. training set according to an unknown distribution \(\rho \) on \({\mathcal {Z}}.\) Denote the \(d \times n\) input data matrix by \(\mathbf{X} = (x_i: i\in \mathbb {N}_n)\) and the \(d \times d\) distance matrix by \(M= (M_{\ell k})_{\ell ,k\in \mathbb {N}_d}\). Then, the (pseudo-) distance between \(x_i\) and \(x_j\) is measured by
The goal of metric learning is to identify a distance function \(d_M(x_i,x_j)\) such that it yields a small value for a similar pair and a large value for a dissimilar pair. The bilinear similarity function is defined by
Similarly, the target of similarity learning is to learn \(M\in \mathbb {S}^d\) such that it reports a large similarity value for a similar pair and a small similarity value for a dissimilar pair. It is worth of pointing out that we do not require the positive semi-definiteness of the matrix M throughout this paper. However, we do assume M to be symmetric, since this will guarantee the distance (similarity) between \(x_i\) and \(x_j\) (\(d_M(x_i,x_j)\)) is equivalent to that between \(x_j\) and \(x_i\) (\(d_M(x_j,x_i)\)).
There are two main terms in the metric/similarity learning model: empirical error and matrix regularization term. The empirical error function is to employ the similarity and dissimilarity information provided by the label information and the appropriate matrix regularization term is to avoid overfitting and improve generalization performance.
For any pair of samples \((x_i,x_j)\), let \(r(y_i,y_j) = 1\) if \(y_i = y_j\) otherwise \(r(y_i,y_{j})=-1\). It is expected that there exists a bias term \(b\in \mathbb {R}\) such that \(d_M(x_i,x_j)\le b\) for \(r(y_i,y_j) = 1\) and \(d_M(x_i,x_j)>b\) otherwise. This naturally leads to the empirical error (Jin et al. 2009) defined by
where the indicator function I[x] equals to 1 if x is true and zero otherwise.
Due to the indicator function, the above empirical error is non-differentiable and non-convex which is difficult to do optimization. A usual way to overcome this shortcoming is to upper-bound it with a differentiable and convex loss function. For instance, we can use the hinge loss to upper-bound the indicator function which leads to the following empirical error:
In order to avoid overfitting, we need to enforce a regularization term denoted by \(\Vert M\Vert \), which will restrict the complexity of the distance matrix. We emphasize here \(\Vert \cdot \Vert \) denotes a general matrix norm in the linear space \(\mathbb {S}^d\). Putting the regularization term and the empirical error term together yields the following metric learning model:
where \(\lambda >0\) is a trade-off parameter.
Different regularization terms lead to different metric learning formulations. For instance, the Frobenius norm \(\Vert M\Vert _F\) is used in Jin et al. (2009). To favor the element-sparsity, Rosales and Fung (2006) introduced the \(L^1\)-norm regularization \(\Vert M\Vert = \sum _{\ell ,k\in \mathbb {N}_d}|M_{\ell k}|.\) Ying et al. (2009) proposed the mixed (2, 1)-norm \( \Vert M\Vert = \sum _{\ell \in \mathbb {N}_d}\bigl (\sum _{k\in \mathbb {N}_d}|M_{\ell k}|^2\bigr )^{1\over 2}\) to encourage the column-wise sparsity of the distance matrix. The trace-norm regularization \(\Vert M\Vert = \sum _{\ell }\sigma _\ell (M)\) was also considered by Ying et al. (2009), Shen et al. (2009). Here, \(\{\sigma _\ell : \ell \in \mathbb {N}_d\}\) denotes the singular values of a matrix \(M\in \mathbb {S}^d.\) Since M is symmetric, the singular values of M are identical to the absolute values of its eigenvalues.
In analogy to the formulation of metric learning, we consider the following empirical error for similarity learning (Maurer 2008; Chechik et al. 2010):
This leads to the regularised formulation for similarity learning defined as follows:
The work Maurer (2008) used the Frobenius-norm regularization for similarity learning. The trace-norm regularization has been used by Shalit et al. (2010) to encourage a low-rank similarity matrix M.
3 Statistical generalization analysis
In this section, we mainly give a detailed proof of generalization bounds for metric and similarity learning. In particular, we develop a novel line of generalization analysis for metric and similarity learning with general matrix regularization terms. The key observation is that the empirical data term \({\mathcal {E}}_\mathbf{z}(M,b)\) for metric learning is a modification of U-statistics and it is expected to converge to its expected form defined by
The empirical term \(\widetilde{{\mathcal {E}}}_\mathbf{z}(M,b)\) for similarity learning is expected to converge to
The target of generalization analysis is to bound the true error \({\mathcal {E}}({\mathcal {M}}_\mathbf{z},b_\mathbf{z})\) by the empirical error \({\mathcal {E}}_\mathbf{z}(M_\mathbf{z},b_\mathbf{z})\) for metric learning and \(\widetilde{{\mathcal {E}}}(\widetilde{M}_\mathbf{z},\widetilde{b}_\mathbf{z})\) by the empirical error \(\widetilde{{\mathcal {E}}}_\mathbf{z}(\widetilde{M}_\mathbf{z},\widetilde{b}_\mathbf{z})\) for similarity learning.
In the sequel, we provide a detailed proof for generalization bounds of metric learning. Since the proof for similarity learning is exactly the same as that for metric learning, we only mention the results followed with some brief comments.
3.1 Bounding the solutions
By the definition of \((M_\mathbf{z}, b_\mathbf{z})\), we know that
which implies that
Now we turn our attention to deriving the bound of the bias term \(b_\mathbf{z}\) by modifying the techniques in Chen et al. (2004) which was originally developed to estimate the offset term of the soft-margin SVM.
Lemma 1
For any samples \(\mathbf{z}\) and \(\lambda >0\), there exists a minimizer \((M_\mathbf{z},b_\mathbf{z})\) of formulation (2) such that
Proof
We first prove the inequality \(\min _{i\ne j}[d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}]\le 1.\) To this end, we first consider the special case where the training set \(\mathbf{z}\) only contains two examples with distinct labels, i.e. \(\mathbf{z}=\{(z_i=(x_i,y_i): i=1,2, x_1\ne x_2, y_1\ne y_2\}.\) For any \(\lambda >0,\) let \((M_\mathbf{z}, b_\mathbf{z})=(\mathbf {0},-1) \), and observe that \({\mathcal {E}}_\mathbf{z}(\mathbf {0},-1) + \lambda \Vert \mathbf {0}\Vert ^2 = 0\). This observation implies that \((M_\mathbf{z}, b_\mathbf{z})\) is a minimizer of problem (2). Consequently, we have the desired result in this extreme case, since \(\min _{i \ne j} [d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}]=d_{M_\mathbf{z}}(x_1,x_2) -b_\mathbf{z}=1.\)
Now let us consider the general case where the training set \(\mathbf{z}\) has at least two examples with the same label, i.e.
In this general case, we prove the inequality \(\min _{i\ne j} [d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}]\le 1\) by contradiction. Suppose that \(s :=\min _{i\ne j} [d_{M_\mathbf{z}}(x_i,x_j)-b_\mathbf{z}]>1\) which equivalently implies that \(d_{M_\mathbf{z}}(x_i,x_j) -(b_\mathbf{z}+s-1)\ge 1\) for any \(i\ne j.\) Hence, for any pair of examples \((x_i,x_j)\) with distinct labels, i.e. \(y_i\ne y_j\) (equivalently \(r(y_i,y_j)=-1\)), there holds
Consequently,
The above estimation implies that \({\mathcal {E}}_{\mathbf {z}}(M_\mathbf {z}, b_\mathbf {z}+s-1) + \lambda \Vert M_\mathbf {z}\Vert ^ 2 < {\mathcal {E}}_{\mathbf {z}}(M_\mathbf {z}, b_\mathbf {z})+ \lambda \Vert M_\mathbf {z}\Vert ^ 2\) which contradicts the definition of the minimizer \((M_\mathbf{z},b_\mathbf{z})\). Hence, \(s = \min _{i \ne j} [d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}]\,{\le }\, 1\).
Secondly, we prove the inequality \(\max _{i\ne j}[d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}]\ge -1\) in analogy to the above argument. Consider the special case where the training set \(\mathbf{z}\) contains only two examples with the same label, i.e. \(\{(z_i=(x_i,y_i): i=1,2, x_1\ne x_2, y_1 = y_2\}.\) For any given \(\lambda >0,\) let \((M_\mathbf{z},b_\mathbf{z})= (\mathbf {0},1).\) Since \( {\mathcal {E}}_\mathbf{z}(\mathbf {0},1) + \lambda \Vert \mathbf {0}\Vert ^2 = 0\), \((\mathbf {0},1)\) is a minimizer of problem (2). The desired estimation follows from the fact that \(\max _{i\ne j} d_{M_\mathbf{z}}(x_i, x_j)-b_\mathbf{z}=0-1= -1.\)
Now let us consider the general case where the training set \(\mathbf{z}\) has at least two examples with distinct labels, i.e.
We prove the estimation \(\max _{i\ne j}[d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}]\ge -1\) by contradiction. Assume \(s:=\max _{i\ne j} [d_{M_\mathbf{z}}(x_i, x_j)-b_\mathbf{z}] < -1,\) then \(d_{M_\mathbf{z}}(x_i,x_j) -(b_\mathbf{z}+s+1)< -1\) holds for any \(i\ne j.\) This implies, for any pair of examples \((x_i,x_j)\) with the same label, i.e. \(r(i,j)=1\), that \(\bigl (1+r(i,j)(d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}-s-1)\bigr )_+=0.\) Hence,
The above estimation yields that \({\mathcal {E}}_{\mathbf{z}}(M_\mathbf{z}, b_\mathbf{z}+s+1)+\lambda \Vert M_\mathbf{z}\Vert ^2<{\mathcal {E}}_{\mathbf{z}}(M_\mathbf{z}, b_\mathbf{z})+\lambda \Vert M_\mathbf{z}\Vert ^2\) which contradicts the definition of the minimizer \((M_\mathbf{z}, b_\mathbf{z})\). Hence, we have the desired inequality \(\max _{i\ne j}[d_{M_\mathbf{z}}(x_i, x_j)-b_\mathbf{z}] \ge -1\) which completes the proof of the lemma. \(\square \)
Corollary 2
For any samples \(\mathbf{z}\) and \(\lambda >0\), there exists a minimizer \((M_\mathbf{z},b_\mathbf{z})\) of formulation (2) such that
Proof
From inequality (8) in Lemma 1, we see that \(-b_\mathbf{z}+ \min _{i\ne j} [d_{M_\mathbf{z}}(x_i,x_j)]\le 1\) and \(\max _{i\ne j} [d_{M_\mathbf{z}}(x_i,x_j)] \ge b_\mathbf{z}-1.\) Equivalently, this implies that \(-b_\mathbf{z}\le 1- \min _{i\ne j} [d_{M_\mathbf{z}}(x_i,x_j)]\) and \(b_\mathbf{z}\le 1+ \max _{i\ne j} [d_{M_\mathbf{z}}(x_i,x_j)].\) Recall that \(X_{ij} = (x_i-x_j)(x_i- x_j)^\top \) and observe, by the definition of the dual norm \(\Vert \cdot \Vert _*\), that
Combining this observation with the above estimates, we have that \(-b_\mathbf{z}\le 1+ \bigl (\max _{i\ne j } \Vert X_{ij}\Vert _*\bigr )\Vert M_\mathbf{z}\Vert \) and \(b_\mathbf{z}\le 1+ \bigl (\max _{i\ne j } \Vert X_{ij}\Vert _*\bigr )\Vert M_\mathbf{z}\Vert ,\) which yields the desired result. \(\square \)
Denote
where
From the above corollary, for any samples \(\mathbf{z}\) we can easily see that at least one optimal solution \((M_\mathbf{z}, b_\mathbf{z})\) of formulation (2) belongs to the bounded set \({\mathcal {F}}\subseteq \mathbb {S}^d \times \mathbb {R}.\)
We end this subsection with two remarks. Firstly, from the proof of Lemma 1 and Corollary 2, we can easily see that, if the set of training samples contains at least two examples with distinct labels and two examples with the same label, all minimizers of formulation (2) satisfy inequality (8) and inequality (9). Hence, in this case all minimizers \((M_\mathbf{z}, b_\mathbf{z})\) of formulation (2) belong to the bounded set \({\mathcal {F}}\). Consequently, we assume, without loss of generality, that any minimizer \((M_\mathbf{z}, b_\mathbf{z})\) of formulation (2) satisfies inequality (9) and belongs to the set \({\mathcal {F}}\). Secondly, our formulation (2) for metric learning focused on the hinge loss which is widely used in the community of metric learning, see e.g Jin et al. (2009), Weinberger and Saul (2008), Ying and Li (2012). Similar results as those in the above corollary can easily be obtained for q-norm loss given, for any \(x\in \mathbb {R}\), by \((1-x)_+^q\) with \(q>1\). However, it still remains a question to us on how to estimate the term b for general loss functions.
3.2 Generalization bounds
Before stating the generalization bounds, we introduce some notations. For any \(z=(x,y), z'=(x',y')\in {\mathcal {Z}}\), let \(\Phi _{M,b} (z, z') = (1+ r(y,y')(d_{M}(x,x')-b))_+.\) Hence, for any \((M,b)\in {\mathcal {F}}\),
Let \( \lfloor {n\over 2}\rfloor \) denote the largest integer less than \({n\over 2}\) and recall the definition that \(X_{ij} = (x_i-x_j)(x_i-x_j)^\top \). We now define Rademacher average over sums-of-i.i.d. sample-blocks related to the dual matrix norm \(\Vert \cdot \Vert _*\) by
and its expectation is denoted by \( R_n = {\mathbb {E}}_\mathbf{z}\bigl [\widehat{R}_n\bigr ]\). Our main theorem below shows that the generalization bounds for metric learning critically depend on the quantity of \(R_n\). For this reason, we refer to \(R_n\) as the Radmemcher complexity for metric learning. It is worth mentioning that metric learning formulation (2) depends on the norm \(\Vert \cdot \Vert \) of the linear space \(\mathbb {S}^d\) and the Rademacher complexity \(R_n\) is related to its dual norm \(\Vert \cdot \Vert _*\).
Theorem 3
Let \((M_\mathbf{z}, b_\mathbf{z})\) be the solution of formulation (2). Then, for any \(0<\delta <1\), with probability \(1-\delta \) we have that
Proof
The proof of the theorem can be divided into three steps as follows.
Step 1: Let \({\mathbb {E}}_\mathbf{z}\) denote the expectation with respect to samples \(\mathbf{z}\). Observe that \({\mathcal {E}}(M_\mathbf{z},b_\mathbf{z}) - {\mathcal {E}}_\mathbf{z}(M_\mathbf{z},b_\mathbf{z}) \le \displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - {\mathcal {E}}_\mathbf{z}(M,b)\Bigr ].\) For any \(\mathbf{z}= (z_1,\ldots ,z_{k-1},z_k,z_{k+1},\ldots , z_n)\) and \( \mathbf{z}'=(z_1,\ldots , z_{k-1},z'_k, z_{k+1},\ldots , z_n)\) we know from inequality (11) that
Applying McDiarmid’s inequality (McDiarmid 1989) (see Lemma 6 in the “Appendix”) to the term \(\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - {\mathcal {E}}_\mathbf{z}(M,b)\Bigr ]\), with probability \(1-{\delta }\) there holds
Now we only need to estimate the first term in the expectation form on the right-hand side of the above equation by symmetrization techniques.
Step 2: To estimate \({\mathbb {E}}_\mathbf{z}\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - {\mathcal {E}}_\mathbf{z}(M,b)\Bigr ]\), applying Lemma 7 with \(q_{(M,b)} (z_i,z_j) ={\mathcal {E}}(M,b) - (1+ r(y_i,y_j)(d_M(x_i,x_j)-b))_+\) implies that
where \(\overline{{\mathcal {E}}}_\mathbf{z}(M,b)={1\over \lfloor {n\over 2}\rfloor } \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\Phi _{M,b}(z_{i},z_{\lfloor {n\over 2}\rfloor +i}).\) Now let \(\bar{\mathbf{z}} = \{\bar{z}_1,\bar{z}_2,\ldots , \bar{z}_{n}\}\) be i.i.d. samples which are independent of \(\mathbf{z}\), then
By standard symmetrization techniques (see e.g. Bartlett and Mendelson 2002), for i.i.d. Rademacher variables \(\{\sigma _i \in \{\pm 1\}: i\in \mathbb {N}_{\lfloor {n\over 2}\rfloor }\}\), we have that
Applying the contraction property of Rademacher averages (see Lemma 8 in the “Appendix”) with \(\Psi _i(t) = \bigl (1+ r(y_i, y_{\lfloor {n\over 2}\rfloor +i}) t \bigr )_+ - 1\), we have the following estimation for the last term on the righthand side of the above inequality:
Step 3: It remains to estimate the terms on the righthand side of inequality (18). To this end, observe that
Moreover,
Putting the above estimations and inequalities (17), (18) together yields that
Consequently, combining this with inequalities (15), (16) implies that
Putting this estimation with (14) completes the proof the theorem. \(\square \)
In the setting of similarity learning, \(X_*\) and \(R_n\) are replaced by
where \(\widetilde{X}_{i ({\lfloor {n\over 2}\rfloor +i})} = x_i x^\top _{\lfloor {n\over 2}\rfloor +i}.\) Let \(\widetilde{{\mathcal {F}}} = \Bigl \{(M,b): \Vert M\Vert \le {1/\sqrt{\lambda }}\), \(|b| \le 1+ \widetilde{X}_*\Vert M\Vert \Bigr \}\). Using the exactly same argument as above, we can prove the following bound for similarity learning formulation (4).
Theorem 4
Let \((\widetilde{M}_\mathbf{z}, \widetilde{b}_\mathbf{z})\) be the solution of formulation (4). Then, for any \(0<\delta <1\), with probability \(1-\delta \) we have that
4 Estimation of \(R_n\) and discussion
From Theorem 3, we need to estimate the Rademacher average for metric learning, i.e. \(R_n \), and the quantity \(X_*\) for different matrix regularization terms. Without loss of generality, we only focus on popular matrix norms such as the Frobenius norm (Jin et al. 2009), \(L^1\)-norm (Rosales and Fung 2006), trace-norm (Ying et al. 2009; Shen et al. 2009) and mixed (2, 1)-norm (Ying et al. 2009).
Example 1
(Frobenius norm) Let the matrix norm be the Frobenius norm i.e. \(\Vert M \Vert = \Vert M\Vert _F\), then the quantity \(X_*= \sup _{x,x\in {\mathcal {X}}}\Vert x-x'\Vert ^2_F\) and the Rademacher complexity are estimated as follows:
Let \((M_\mathbf{z},b_\mathbf{z})\) be a solution of formulation (2) with Frobenius norm regularization. For any \(0<\delta <1\), with probability \(1-\delta \) there holds
Proof
Note that the dual norm of the Frobenius norm is itself. The estimation of \(X_*\) is straightforward. The Rademacher complexity \(R_n\) is estimated as follows:
Putting the above estimation back into Eq. (13) completes the proof of Example 1. \(\square \)
Other popular matrix norms for metric learning are the \(L^1\)-norm, trace-norm and mixed (2, 1)-norm. The dual norms are respectively \(L^\infty \)-norm, spectral norm (i.e. the maximum of singular values) and mixed \((2,\infty )\)-norm. All these dual norms mentioned above are less than the Frobenius norm. Hence, the following estimation always holds true for all the norms mentioned above:
Consequently, the generalization bound (21) holds true for metric learning formulation (2) with \(L^1\)-norm, or trace-norm or mixed (2, 1)-norm regularization. However, in some cases, the above upper-bounds are too conservative. For instance, in the following examples we can show that more refined estimation of \(R_n\) can be obtained by applying the Khinchin inequalities for Rademacher averages (Peña and Giné 1999).
Example 2
(Sparse \(L^1\)-norm) Let the matrix norm be the \(L^1\)-norm i.e. \(\Vert M \Vert \!=\! \sum _{\ell ,k\in \mathbb {N}_d} |M_{\ell k}|\). Then, \(X_*= \sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty ^2\) and
Let \((M_\mathbf{z}, b_\mathrm{z})\) be a solution of formulation (2) with \(L^1\)-norm regularization. For any \(0<\delta <1\), with probability \(1-\delta \) there holds
Proof
The dual norm of \(L^1\)-norm is \(L^\infty \)-norm. Hence, \( X_*= \sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty ^2.\) To estimate \(R_n\), we observe, for any \(1<q<\infty \), that
where \(x_i^k\) represents the k-th coordinate element of vector \(x_i\in \mathbb {R}^d.\) To estimate the term on the right-hand side of inequality (23), we apply the Khinchin-Kahane inequality (see Lemma 9 in the “Appendix”) with \(p=2<q<\infty \) yields that
Putting the above estimation back into (23) and letting \(q= 4\log d\) implies that
Putting the estimation for \(X_*\) and \(R_n\) into Theorem 13 yields inequality (22). This completes the proof of Example 2. \(\square \)
Example 3
(Mixed (2, 1)-norm) Consider \(\Vert M \Vert = \sum _{\ell \in \mathbb {N}_d} \sqrt{\sum _{k\in \mathbb {N}_d}|M_{\ell k}|^2}.\) Then, we have \(X_*= \bigl [\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _F\bigr ]\bigl [\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty \bigr ],\) and
Let \((M_\mathbf{z}, b_\mathbf{z})\) be a solution of formulation (2) with mixed (2, 1)-norm. For any \(0<\delta <1\), with probability \(1-\delta \) there holds
Proof
The estimation of \(X_*\) is straightforward and we estimate \(R_n\) as follows. For any \(q>1\), there holds
It remains to estimate the terms inside the parenthesis on the right-hand side of the above inequality. To this end, we observe, for any \(q'>1\), that
Applying the Khinchin–Kahane inequality (Lemma 9 in the “Appendix”) with \(q = 2q' =4 \log d\) and \(p=2\) to the above inequality yields that
Putting the above estimation back into (26) implies that
Combining this with Theorem 3 implies the inequality (25). This completes the proof of the example. \(\square \)
In the Frobenius-norm case, the main term of the bound (21) is \({\mathcal {O}}\bigl ( {\sup _{x,x'\in {\mathcal {X}}}\Vert x-x'\Vert _F^{2} \over \sqrt{n \lambda }}\bigr )\). This bound is consistent with that given by Jin et al. (2009) where \(\sup _{x\in {\mathcal {X}}} \Vert x\Vert _F\) is assumed to bounded by some constant B. Comparing the generalization bounds in the above examples, we see that the key terms \(X_*\) and \(R_n\) mainly differ in two quantities, i.e. \(\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _F\) and \(\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty .\) We argue that \(\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty \) can be much less than \(\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _F.\) For instance, consider the input space \({\mathcal {X}}= [0,1]^d.\) It is easy to see that \(\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _F = \sqrt{d}\) while \(\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty \equiv 1.\) Consequently, we can summarise the estimations as follows:
-
Frobenius-norm: \(X_*=d\), and \(R_n \le {2 d \over \sqrt{n}}\).
-
Sparse \(L^1\)-norm: \(X_*= 1\), and \(R_n \le {4 \sqrt{e \log d}\over \sqrt{n}}.\)
-
Mixed (2, 1)-norm: \(X_*= \sqrt{d}\), and \(R_n \le {4 \sqrt{e d\log d}\over \sqrt{n}}.\)
Therefore, when d is large, the generalization bound with sparse \(L^1\)-norm regularization is much better than that with Frobenius-norm regularization while the bound with mixed (2, 1)-norm are between the above two. These theoretical results are nicely consistent with the rationale that sparse methods are more effective in dealing with high-dimensional data.
We end this section with two remarks. Firstly, in the setting of trace-norm regularization, it remains a question to us on how to establish more accurate estimation of \(R_n\) by using the Khinchin–Kahane inequality. Secondly, the bounds in the above examples are true for similarity learning with different matrix-norm regularization. Indeed, the generalization bound for similarity learning in Theorem 4 tells us that it suffices to estimate \(\widetilde{X}_*\) and \(\widetilde{R}_n\). In analogy to the arguments in the above examples, we can get the following results. For similarity learning formulation (4) with Frobenius-norm regularization, there holds
For \(L^1\)-norm regularization, we have
In the setting of (2, 1)-norm, we obtain
Putting these estimations back into Theorem 4 yields generalization bounds for similarity learning with different matrix norms. For simplicity, we omit the details here.
5 Conclusion and discussion
In this paper we are mainly concerned with theoretical generalization analysis of the regularized metric and similarity learning. In particular, we first showed that the generalization analysis for metric/similarity learning reduces to the estimation of the Rademacher average over “sums-of-i.i.d.” sample-blocks. Then, we derived their generalization bounds with different matrix regularization terms. Our analysis indicates that sparse metric/similarity learning with \(L^1\)-norm regularization could lead significantly better bounds than that with the Frobenius norm regularization, especially when the dimensionality of the input data is high. Our novel generalization analysis develops the techniques of U-statistics (Peña and Giné 1999; Clémencon et al. 2008) and Rademacher complexity analysis (Bartlett and Mendelson 2002; Koltchinskii and Panchenko 2002). Below we mention several questions remaining to be further studied.
Firstly, in Sect. 3, the derived bounds for metric and similarity learning with trace-norm regularization were the same as those with Frobenius-norm regularization. It would be very interesting to derive the bounds similar to those with sparse \(L^1\)-norm regularization. The key issue is to estimate the Rademacher complexity term (12) related to the spectral norm using the Khinchin–Kahne inequality. However, we are not aware of such Khinchin–Kahne inequalities for general matrix spectral norms.
Secondly, this study only investigated the generalization bounds for metric and similarity learning. We can get the consistency estimation for \(\Vert M - M_*\Vert _F^2\) under very strong assumption on the loss function and the underlying distribution. In particular, assume that the loss function is the least square loss, the bias term b is fixed (e.g. \(b \equiv 0\)) and let \(M_*= \arg \min _{M\in \mathbb {S}^d}{\mathcal {E}}(M,0)\), then we have
Here, \({{\mathcal {C}}}\) is \(d^2\times d^2\) matrix representing a linear mapping from \(\mathbb {S}^d\) to \(\mathbb {S}^d\):
Here, the notation \(\otimes \) represents the tensor product of matrices. Equation (27) implies that \({\mathcal {E}}(M_\mathbf{z},0)- {\mathcal {E}}(M_*,0)=\iint \langle M-M_*, x (x')^T \rangle ^2 d\rho (x)\rho (x')\ge \lambda _{\min } ({\mathcal {C}}) \Vert M - M_*\Vert _F^2,\) where \(\lambda _{\min }({\mathcal {C}})\) is the minimum eigenvalue of the \(d^2\times d^2\) matrix \({\mathcal {C}}\). Consequently, under the assumption that \({\mathcal {C}}\) is non-singular, we can get the consistency estimation for \(\Vert M - M_*\Vert _F^2\) for the least square loss. For the hinge loss, the equality (27) does not hold true any more. Hence, it remains a question on how to get the consistency estimation for metric and similarity learning under general loss functions.
Thirdly, in many applications involving multi-media data, different aspects of the data may lead to several different, and apparently equally valid notions of similarity. This leads to a natural question on how to combine multiple similarities and metrics for a unified data representation. An extension of multiple kernel learning approach was proposed in McFee and Lanckriet (2011) to address this issue. It would be very interesting to investigate the theoretical generalization analysis for this multi-modal similarity learning framework. A possible starting point would be the techniques established for learning the kernel problem (Ying and Campbell 2009, 2010).
Finally, the target of supervised metric learning is to improve the generalization performance of kNN classifiers. It remains a challenging question to investigate how the generalization performance of kNN classifiers relates to the generalization bounds of metric learning given here.
References
Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2005). Learning a mahalanobis metric from equivalence constraints. Journal of Machine Learning Research, 6, 937–965.
Bartlett, P. L., & Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482.
Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2, 499–526.
Chechik, G., Sharma, V., Shalit, U., & Bengio, S. (2010). Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, 11, 1109–1135.
Chen, D. R., Wu, Q., Ying, Y., & Zhou, D. X. (2004). Support vector machine soft margin classifiers: Error analysis. Journal of Machine Learning Research, 5, 1143–1175.
Clémencon, S., Lugosi, G., & Vayatis, N. (2008). Ranking and empirical minimization of U-statistics. The Annals of Statistics, 36, 844–874.
Davis, J., Kulis, B., Jain, P., Sra, S., & Dhillon, I. (2007). Information-theoretic metric learning. In ICML.
De La Peña, V. H., & Giné, E. (1999). Decoupling: From dependence to independence. New York: Springer.
Globerson, A., & Roweis, S. (2005). Metric learning by collapsing classes. In NIPS.
Goldberger, J., Roweis, S., Hinton, G., & Salakhutdinov, R. (2004). Neighbourhood component analysis. In NIPS.
Guillaumin, M., Verbeek, J., & Schmid, C. (2009). Is that you? Metric learning approaches for face identification. In ICCV.
Hoi, S. C. H., Liu, W. , Lyu, M. R. , & Ma, W.-Y. (2006). Learning distance metrics with contextual constraints for image retrieval. In CVPR.
Jin, R., Wang, S., & Zhou, Y. (2009). Regularized distance metric learning: Theory and algorithm. In NIPS.
Kar, P., & Jain, P. (2011). Similarity-based learning via data-driven embeddings. In NIPS.
Koltchinskii, V., & Panchenko, V. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 30, 1–5.
Ledoux, M., & Talagrand, M. (1991). Probability in Banach spaces: Isoperimetry and processes. New York: Springer Press.
Maurer, A. (2008). Learning similarity with operator-valued large-margin classifiers. Journal of Machine Learning Research, 9, 1049–1082.
McDiarmid, C. (1989). Surveys in combinatorics, chapter on the methods of bounded differences. Cambridge, UK: Cambridge University Press.
McFee, B., & Lanckriet, G. (2011). Learning multi-modal similarity. Journal of Machine Learning Research, 12, 491–523.
Rosales, R., & Fung, G. (2006). Learning sparse metrics via linear programming. In KDD.
Shalit, O., Weinshall, D., & Chechik, G. (2010). Online learning in the manifold of low-rank matrices. In NIPS.
Shen, C., Kim, J., Wang, L., & Hengel, A. (2009). Positive semidefinite metric learning with boosting. In NIPS.
Weinberger, K. Q., & Saul, L. K. (2008). Fast solvers and efficient implementations for distance metric learning. In ICML.
Xing, E., Ng, A., Jordan, M., & Russell, S. (2002). Distance metric learning with application to clustering with side information. In NIPS.
Yang, L., & Jin, R. (2007). Distance metric learning: A comprehensive survey. Technical report, Department of Computer Science and Engineering, Michigan State University.
Ying, Y., & Campbell. (2009). Generalization bounds for learning the kernel. In COLT.
Ying, Y., & Campbell, C. (2010). Rademacher chaos complexity for learning the kernel problem. Neural Computation, 22, 2858–86.
Ying, Y., Huang, K., & Campbell, C. (2009). Sparse metric learning via smooth optimization. In NIPS.
Ying, Y., & Li, P. (2012). Distance metric learning with eigenvalue optimization. Journal of Machine Learning Research, 13, 1–26.
Acknowledgments
We are grateful to the anonymous referees for their constructive comments and suggestions. This work was supported by the EPSRC under Grant EP/J001384/1. Zheng-Chu is also supported by NSFC under Grant No. 11401524 and the Fundamental Research Funds for the Central Universities (Program No. 2014QNA3002). The corresponding author is Yiming Ying.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Eric Xing.
Appendix
Appendix
In this “Appendix” we assemble some facts, which were used to establish generalization bounds for metric/similarity learning.
Definition 5
We say the function \(f:\prod _{k=1}^n \Omega _k \rightarrow \mathbb {R}\) with bounded differences \(\{ c_k \}_{k=1}^n\) if, for all \(1\le k\le n\),
Lemma 6
(McDiarmid’s inequality McDiarmid 1989) Suppose \(f:\prod _{k=1}^n \Omega _k \rightarrow \mathbb {R}\) with bounded differences \(\{ c_k \}_{k=1}^n\) then, for all \(\epsilon >0\), there holds
Finally we list a useful property for U-statistics. Given the i.i.d. random variables \(z_1,z_2,\ldots ,z_n\in {\mathcal {Z}},\) let \(q: Z\times Z \rightarrow \mathbb {R}\) be a symmetric real-valued function. Denote a U-statistic of order two by \(U_n = {1\over n(n-1)} \sum _{i\ne j} q(x_i,x_j).\) Then, the U-statistic \(U_n\) can be expressed as
where the sum is taken over all permutations \(\pi \) of \(\{1,2,\ldots ,n\}.\) The main idea underlying this representation is to reduce the analysis to the ordinary case of i.i.d. random variable blocks. Based on the above representation, we can prove the following lemma which plays a critical role in deriving generalization bounds for metric learning. For completeness, we include a proof here. For more details on U-statistics, one is referred to Clémencon et al. (2008), Peña and Giné (1999).
Lemma 7
Let \(q_\tau :{\mathcal {Z}}\times {\mathcal {Z}}\rightarrow \mathbb {R}\) be real-valued functions indexed by \(\tau \in {\mathcal {T}}\) where \({\mathcal {T}}\) is some index set. If \(z_1,\ldots ,z_n\) are i.i.d. then we have that
Proof
From the representation of U-statistics (28), we observe that
This completes the proof of the lemma. \(\square \)
We need the following contraction property of the Rademacher averages which is essentially implied by Theorem 4.12 in Ledoux and Talagrand (1991), see also Bartlett and Mendelson (2002), Koltchinskii and Panchenko (2002).
Lemma 8
Let F be a class of uniformly bounded real-valued functions on \((\Omega ,\mu )\) and \(m\in \mathbb {N}\). If for each \(i\in \{1,\ldots ,m\}\), \(\Psi _i: \mathbb {R}\rightarrow \mathbb {R}\) is a function with \(\Psi _i(0)=0\) having a Lipschitz constant \(c_i\), then for any \(\{x_i\}_{i=1}^m\),
The last property of Rademacher averages is the Khinchin–Kahne inequality [see e.g. Peña and Giné (1999, Theorem 1.3.1)].
Lemma 9
For \(n\in \mathbb {N}\), let \(\{f_i \in \mathbb {R}: i\in \mathbb {N}_n\} \), and \(\{\sigma _i: i\in \mathbb {N}_n\}\) be a family of i.i.d. Rademacher variables. Then, for any \(1<p<q<\infty \) we have
Rights and permissions
About this article
Cite this article
Cao, Q., Guo, ZC. & Ying, Y. Generalization bounds for metric and similarity learning. Mach Learn 102, 115–132 (2016). https://doi.org/10.1007/s10994-015-5499-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-015-5499-7