1 Introduction

The success of many machine learning algorithms (e.g. the nearest neighbor classification and k-means clustering) depends on the concepts of distance metric and similarity. For instance, k-nearest-neighbor (kNN) classifier depends on a distance function to identify the nearest neighbors for classification; k-means algorithms depend on the pairwise distance measurements between examples for clustering. Kernel methods and information retrieval methods rely on a similarity measure between samples. Many existing studies have been devoted to learning a metric or similarity automatically from data, which is usually referred to as metric learning and similarity learning, respectively.

Most work in metric learning focuses on learning a (squared) Mahalanobis distance defined, for any \(x,t\in \mathbb {R}^d\), by \(d_M(x,t)= {(x-t)^\top M(x-t) }\) where M is a positive semi-definite matrix, see e.g. (Bar-Hillel et al. 2005; Davis et al. 2007; Globerson and Roweis 2005; Goldberger et al. 2004; Shen et al. 2009; Weinberger and Saul 2008; Xing et al. 2002; Yang and Jin 2007; Ying et al. 2009). Concurrently, the pairwise similarity defined by \(s_M(x,t)=x^\top Mt\) was studied in Chechik et al. (2010), Kar and Jain (2011), Maurer (2008), Shalit et al. (2010). These methods have been successfully applied to various real-world problems including information retrieval and face verification (Chechik et al. 2010; Guillaumin et al. 2009; Hoi et al. 2006; Ying and Li 2012). Although there are a large number of studies devoted to supervised metric/similarity learning based on different objective functions, few studies address the generalization analysis of such methods. The recent work (Jin et al. 2009) pioneered the generalization analysis for metric learning using the concept of uniform stability (Bousquet and Elisseeff 2002). However, this approach only works for the strongly convex norm, e.g. the Frobenius norm, and the bias term is fixed which makes the generalization analysis essentially different.

In this paper, we develop a novel approach for generalization analysis of metric learning and similarity learning which can deal with general matrix regularization terms including Frobenius norm (Jin et al. 2009), sparse \(L^1\)-norm (Rosales and Fung 2006), mixed (2, 1)-norm (Ying et al. 2009) and trace-norm (Ying et al. 2009; Shen et al. 2009). In particular, we first show that the generalization analysis for metric/similarity learning reduces to the estimation of the Rademacher average over “sums-of-i.i.d.” sample-blocks related to the specific matrix norm, which we refer to as the Rademacher complexity for metric (similarity) learning. Then, we show how to estimate the Rademacher complexities with different matrix regularizers. Our analysis indicates that sparse metric/similarity learning with \(L^1\)-norm regularization could lead to significantly better generalization bounds than that with Frobenius norm regularization, especially when the dimensionality of the input data is high. This is nicely consistent with the rationale that sparse methods are more effective for high-dimensional data analysis. Our novel generalization analysis develops and extends Rademacher complexity analysis (Bartlett and Mendelson 2002; Koltchinskii and Panchenko 2002) to the setting of metric/similarity learning by using techniques of U-statistics (Clémencon et al. 2008; Peña and Giné 1999).

The paper is organized as follows. The next section reviews the models of metric/similarity learning. Section 3 establishes the main theorems. In Sect. 4, we derive and discuss generalization bounds for metric/similarity learning with various matrix-norm regularization terms. Section 5 concludes the paper.

Notation Let \(\mathbb {N}_n = \{1,2,\ldots , n\}\) for any \(n\in \mathbb {N}\). For any \(X,Y\in \mathbb {R}^{d\times n}\), \(\langle X, Y\rangle = \mathbf{Tr}(X^\top Y)\) where \(\mathbf{Tr}(\cdot )\) denotes the trace of a matrix. The space of symmetric d times d matrices will be denoted by \(\mathbb {S}^d.\) We equip \(\mathbb {S}^d\) with a general matrix norm \(\Vert \cdot \Vert \); it can be a Frobenius norm, trace-norm and mixed norm. Its associated dual norm is denoted, for any \(M\in \mathbb {S}^d\), by \(\Vert M\Vert _*= \sup \{ \langle X, M \rangle : X\in \mathbb {S}^d, \Vert X\Vert \le 1 \}.\) The Frobenius norm on matrices or vector is always denoted by \(\Vert \cdot \Vert _F.\) The cone of positive semi-definite matrices is denoted by \(\mathbb {S}_+^d\). Later on we use the conventional notation that \(X_{ij}=(x_i-x_j)(x_i-x_j)^\top \) and \(\widetilde{X}_{ij} = x_ix_j^\top .\)

2 Metric/similarity learning formulation

In our learning setting, we have an input space \({\mathcal {X}}\subseteq \mathbb {R}^d\) and output (labels) space \({\mathcal {Y}}\). Denote \({\mathcal {Z}}= {\mathcal {X}}\times {\mathcal {Y}}\) and suppose \(\mathbf{z}:= \{z_i=(x_i,y_i)\in {\mathcal {Z}}: i\in \mathbb {N}_n\}\) an i.i.d. training set according to an unknown distribution \(\rho \) on \({\mathcal {Z}}.\) Denote the \(d \times n\) input data matrix by \(\mathbf{X} = (x_i: i\in \mathbb {N}_n)\) and the \(d \times d\) distance matrix by \(M= (M_{\ell k})_{\ell ,k\in \mathbb {N}_d}\). Then, the (pseudo-) distance between \(x_i\) and \(x_j\) is measured by

$$\begin{aligned} d_M(x_i,x_j)=(x_i-x_j)^\top M(x_i-x_j). \end{aligned}$$

The goal of metric learning is to identify a distance function \(d_M(x_i,x_j)\) such that it yields a small value for a similar pair and a large value for a dissimilar pair. The bilinear similarity function is defined by

$$\begin{aligned} s_M(x_i,x_j)=x_i^\top M x_j. \end{aligned}$$

Similarly, the target of similarity learning is to learn \(M\in \mathbb {S}^d\) such that it reports a large similarity value for a similar pair and a small similarity value for a dissimilar pair. It is worth of pointing out that we do not require the positive semi-definiteness of the matrix M throughout this paper. However, we do assume M to be symmetric, since this will guarantee the distance (similarity) between \(x_i\) and \(x_j\) (\(d_M(x_i,x_j)\)) is equivalent to that between \(x_j\) and \(x_i\) (\(d_M(x_j,x_i)\)).

There are two main terms in the metric/similarity learning model: empirical error and matrix regularization term. The empirical error function is to employ the similarity and dissimilarity information provided by the label information and the appropriate matrix regularization term is to avoid overfitting and improve generalization performance.

For any pair of samples \((x_i,x_j)\), let \(r(y_i,y_j) = 1\) if \(y_i = y_j\) otherwise \(r(y_i,y_{j})=-1\). It is expected that there exists a bias term \(b\in \mathbb {R}\) such that \(d_M(x_i,x_j)\le b\) for \(r(y_i,y_j) = 1\) and \(d_M(x_i,x_j)>b\) otherwise. This naturally leads to the empirical error (Jin et al. 2009) defined by

$$\begin{aligned} {{\mathcal {E}}}_\mathbf{z}(M,b):= {1\over n(n-1)} \sum _{i,j\in \mathbb {N}_n,i\ne j}I[r(y_i,y_j)(d_M(x_i,x_j)-b)>0] \end{aligned}$$

where the indicator function I[x] equals to 1 if x is true and zero otherwise.

Due to the indicator function, the above empirical error is non-differentiable and non-convex which is difficult to do optimization. A usual way to overcome this shortcoming is to upper-bound it with a differentiable and convex loss function. For instance, we can use the hinge loss to upper-bound the indicator function which leads to the following empirical error:

$$\begin{aligned} {{\mathcal {E}}}_\mathbf{z}(M,b):= {1\over n(n-1)} \sum _{i,j\in \mathbb {N}_n,i\ne j}[1+r(y_i,y_j)(d_M(x_i,x_j)-b)]_+ \end{aligned}$$
(1)

In order to avoid overfitting, we need to enforce a regularization term denoted by \(\Vert M\Vert \), which will restrict the complexity of the distance matrix. We emphasize here \(\Vert \cdot \Vert \) denotes a general matrix norm in the linear space \(\mathbb {S}^d\). Putting the regularization term and the empirical error term together yields the following metric learning model:

$$\begin{aligned} (M_\mathbf{z},b_\mathbf{z}) = \arg \min _{M\in \mathbb {S}^d,b\in \mathbb {R}} \bigl \{{{\mathcal {E}}}_\mathbf{z}(M,b)+\lambda \Vert M\Vert ^2\bigr \}, \end{aligned}$$
(2)

where \(\lambda >0\) is a trade-off parameter.

Different regularization terms lead to different metric learning formulations. For instance, the Frobenius norm \(\Vert M\Vert _F\) is used in Jin et al. (2009). To favor the element-sparsity, Rosales and Fung (2006) introduced the \(L^1\)-norm regularization \(\Vert M\Vert = \sum _{\ell ,k\in \mathbb {N}_d}|M_{\ell k}|.\) Ying et al. (2009) proposed the mixed (2, 1)-norm \( \Vert M\Vert = \sum _{\ell \in \mathbb {N}_d}\bigl (\sum _{k\in \mathbb {N}_d}|M_{\ell k}|^2\bigr )^{1\over 2}\) to encourage the column-wise sparsity of the distance matrix. The trace-norm regularization \(\Vert M\Vert = \sum _{\ell }\sigma _\ell (M)\) was also considered by Ying et al. (2009), Shen et al. (2009). Here, \(\{\sigma _\ell : \ell \in \mathbb {N}_d\}\) denotes the singular values of a matrix \(M\in \mathbb {S}^d.\) Since M is symmetric, the singular values of M are identical to the absolute values of its eigenvalues.

In analogy to the formulation of metric learning, we consider the following empirical error for similarity learning (Maurer 2008; Chechik et al. 2010):

$$\begin{aligned} \widetilde{{{\mathcal {E}}}}_\mathbf{z}(M,b) :={1\over n(n-1)} \sum _{i,j \in \mathbb {N}_n,i\ne j}[1-r(y_i,y_j)(s_M(x_i,x_j)-b)]_+. \end{aligned}$$
(3)

This leads to the regularised formulation for similarity learning defined as follows:

$$\begin{aligned} (\widetilde{M}_\mathbf{z},\widetilde{b}_\mathbf{z}) = \arg \min _{M\in \mathbb {S}^d,b\in \mathbb {R}} \bigl \{\widetilde{{{\mathcal {E}}}}_\mathbf{z}(M,b) +\lambda \Vert M\Vert ^2\bigr \}. \end{aligned}$$
(4)

The work Maurer (2008) used the Frobenius-norm regularization for similarity learning. The trace-norm regularization has been used by Shalit et al. (2010) to encourage a low-rank similarity matrix M.

3 Statistical generalization analysis

In this section, we mainly give a detailed proof of generalization bounds for metric and similarity learning. In particular, we develop a novel line of generalization analysis for metric and similarity learning with general matrix regularization terms. The key observation is that the empirical data term \({\mathcal {E}}_\mathbf{z}(M,b)\) for metric learning is a modification of U-statistics and it is expected to converge to its expected form defined by

$$\begin{aligned} {\mathcal {E}}(M,b)= \displaystyle \iint (1+r(y,y')(d_M(x,x')-b))_+d\rho (x,y)d\rho (x',y'). \end{aligned}$$
(5)

The empirical term \(\widetilde{{\mathcal {E}}}_\mathbf{z}(M,b)\) for similarity learning is expected to converge to

$$\begin{aligned} \widetilde{{\mathcal {E}}}(M,b) = \displaystyle \iint (1-r(y,y')(s_M(x,x')-b))_+d\rho (x,y)d\rho (x',y'). \end{aligned}$$
(6)

The target of generalization analysis is to bound the true error \({\mathcal {E}}({\mathcal {M}}_\mathbf{z},b_\mathbf{z})\) by the empirical error \({\mathcal {E}}_\mathbf{z}(M_\mathbf{z},b_\mathbf{z})\) for metric learning and \(\widetilde{{\mathcal {E}}}(\widetilde{M}_\mathbf{z},\widetilde{b}_\mathbf{z})\) by the empirical error \(\widetilde{{\mathcal {E}}}_\mathbf{z}(\widetilde{M}_\mathbf{z},\widetilde{b}_\mathbf{z})\) for similarity learning.

In the sequel, we provide a detailed proof for generalization bounds of metric learning. Since the proof for similarity learning is exactly the same as that for metric learning, we only mention the results followed with some brief comments.

3.1 Bounding the solutions

By the definition of \((M_\mathbf{z}, b_\mathbf{z})\), we know that

$$\begin{aligned} {\mathcal {E}}_\mathbf{z}(M_\mathbf{z},b_\mathbf{z})+\lambda \Vert M_\mathbf{z}\Vert ^2 \le {\mathcal {E}}_\mathbf{z}(0,0) + \lambda \Vert 0\Vert = 1 \end{aligned}$$

which implies that

$$\begin{aligned} \Vert M_\mathbf{z}\Vert \le {1\over \sqrt{\lambda }}. \end{aligned}$$
(7)

Now we turn our attention to deriving the bound of the bias term \(b_\mathbf{z}\) by modifying the techniques in Chen et al. (2004) which was originally developed to estimate the offset term of the soft-margin SVM.

Lemma 1

For any samples \(\mathbf{z}\) and \(\lambda >0\), there exists a minimizer \((M_\mathbf{z},b_\mathbf{z})\) of formulation (2) such that

$$\begin{aligned} \min _{i\ne j} [d_{M_\mathbf{z}}(x_i,x_j)-b_\mathbf{z}]\le 1, \quad \max _{i\ne j} [d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}] \ge -1. \end{aligned}$$
(8)

Proof

We first prove the inequality \(\min _{i\ne j}[d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}]\le 1.\) To this end, we first consider the special case where the training set \(\mathbf{z}\) only contains two examples with distinct labels, i.e. \(\mathbf{z}=\{(z_i=(x_i,y_i): i=1,2, x_1\ne x_2, y_1\ne y_2\}.\) For any \(\lambda >0,\) let \((M_\mathbf{z}, b_\mathbf{z})=(\mathbf {0},-1) \), and observe that \({\mathcal {E}}_\mathbf{z}(\mathbf {0},-1) + \lambda \Vert \mathbf {0}\Vert ^2 = 0\). This observation implies that \((M_\mathbf{z}, b_\mathbf{z})\) is a minimizer of problem (2). Consequently, we have the desired result in this extreme case, since \(\min _{i \ne j} [d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}]=d_{M_\mathbf{z}}(x_1,x_2) -b_\mathbf{z}=1.\)

Now let us consider the general case where the training set \(\mathbf{z}\) has at least two examples with the same label, i.e.

$$\begin{aligned} \{(z_i=(x_i,y_i): i=1,2, x_1\ne x_2, y_1 = y_2\} \subseteq \mathbf{z}. \end{aligned}$$

In this general case, we prove the inequality \(\min _{i\ne j} [d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}]\le 1\) by contradiction. Suppose that \(s :=\min _{i\ne j} [d_{M_\mathbf{z}}(x_i,x_j)-b_\mathbf{z}]>1\) which equivalently implies that \(d_{M_\mathbf{z}}(x_i,x_j) -(b_\mathbf{z}+s-1)\ge 1\) for any \(i\ne j.\) Hence, for any pair of examples \((x_i,x_j)\) with distinct labels, i.e. \(y_i\ne y_j\) (equivalently \(r(y_i,y_j)=-1\)), there holds

$$\begin{aligned} \bigl (1+ r(y_i,y_j)(d_{M_\mathbf{z}}(x_i,x_j) -(b_\mathbf{z}+s-1)\bigr )_+ =\bigl (1-(d_{M_\mathbf{z}}(x_i,x_j) -(b_\mathbf{z}+s-1)\bigr )_+ = 0. \end{aligned}$$

Consequently,

$$\begin{aligned} \begin{array}{ll} {\mathcal {E}}_{\mathbf{z}}(M_\mathbf{z}, b_\mathbf{z}+s-1) &{} = {1\over n(n-1)}\displaystyle \sum _{i\ne j} \Big (1+r(i,j)(d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}-s-1)\Big )_+\\ &{} ={1\over n(n-1)}\displaystyle \sum _{i\ne j,y_i= y_j} (1+d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}-(s-1))_+\\ &{} < {1\over n(n-1)} \displaystyle \sum _{i\ne j,y_i= y_j} (1+d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z})_+\le {\mathcal {E}}_{\mathbf{z}}(M_\mathbf{z}, b_\mathbf{z}). \end{array} \end{aligned}$$

The above estimation implies that \({\mathcal {E}}_{\mathbf {z}}(M_\mathbf {z}, b_\mathbf {z}+s-1) + \lambda \Vert M_\mathbf {z}\Vert ^ 2 < {\mathcal {E}}_{\mathbf {z}}(M_\mathbf {z}, b_\mathbf {z})+ \lambda \Vert M_\mathbf {z}\Vert ^ 2\) which contradicts the definition of the minimizer \((M_\mathbf{z},b_\mathbf{z})\). Hence, \(s = \min _{i \ne j} [d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}]\,{\le }\, 1\).

Secondly, we prove the inequality \(\max _{i\ne j}[d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}]\ge -1\) in analogy to the above argument. Consider the special case where the training set \(\mathbf{z}\) contains only two examples with the same label, i.e. \(\{(z_i=(x_i,y_i): i=1,2, x_1\ne x_2, y_1 = y_2\}.\) For any given \(\lambda >0,\) let \((M_\mathbf{z},b_\mathbf{z})= (\mathbf {0},1).\) Since \( {\mathcal {E}}_\mathbf{z}(\mathbf {0},1) + \lambda \Vert \mathbf {0}\Vert ^2 = 0\), \((\mathbf {0},1)\) is a minimizer of problem (2). The desired estimation follows from the fact that \(\max _{i\ne j} d_{M_\mathbf{z}}(x_i, x_j)-b_\mathbf{z}=0-1= -1.\)

Now let us consider the general case where the training set \(\mathbf{z}\) has at least two examples with distinct labels, i.e.

$$\begin{aligned} \{(z_i=(x_i,y_i): i=1,2, ~ x_1\ne x_2, y_1 \ne y_2\} \subseteq \mathbf{z}. \end{aligned}$$

We prove the estimation \(\max _{i\ne j}[d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}]\ge -1\) by contradiction. Assume \(s:=\max _{i\ne j} [d_{M_\mathbf{z}}(x_i, x_j)-b_\mathbf{z}] < -1,\) then \(d_{M_\mathbf{z}}(x_i,x_j) -(b_\mathbf{z}+s+1)< -1\) holds for any \(i\ne j.\) This implies, for any pair of examples \((x_i,x_j)\) with the same label, i.e. \(r(i,j)=1\), that \(\bigl (1+r(i,j)(d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}-s-1)\bigr )_+=0.\) Hence,

$$\begin{aligned} \begin{array}{ll}{\mathcal {E}}_{\mathbf{z}}(M_\mathbf{z}, b_\mathbf{z}+s+1) &{}={1 \over n(n-1)} \displaystyle \sum _{i \ne j}\Big (1+r(i,j)(d_{M_\mathbf{z}}(x_i,x_j) -b_\mathbf{z}-s-1)\Big )_+ \\ &{}={1\over n(n-1)} \displaystyle \sum _{i\ne j,y_i\ne y_j} \Big (1-d_{M_\mathbf{z}}(x_i,x_j) +b_\mathbf{z}+(s+1)\Big )_+ \\ &{} < {1\over n(n-1)} \displaystyle \sum _{i\ne j,y_i \ne y_j} (1-d_{M_\mathbf{z}}(x_i,x_j) +b_\mathbf{z})_+\le {\mathcal {E}}_{\mathbf{z}}(M_\mathbf{z}, b_\mathbf{z}). \end{array} \end{aligned}$$

The above estimation yields that \({\mathcal {E}}_{\mathbf{z}}(M_\mathbf{z}, b_\mathbf{z}+s+1)+\lambda \Vert M_\mathbf{z}\Vert ^2<{\mathcal {E}}_{\mathbf{z}}(M_\mathbf{z}, b_\mathbf{z})+\lambda \Vert M_\mathbf{z}\Vert ^2\) which contradicts the definition of the minimizer \((M_\mathbf{z}, b_\mathbf{z})\). Hence, we have the desired inequality \(\max _{i\ne j}[d_{M_\mathbf{z}}(x_i, x_j)-b_\mathbf{z}] \ge -1\) which completes the proof of the lemma. \(\square \)

Corollary 2

For any samples \(\mathbf{z}\) and \(\lambda >0\), there exists a minimizer \((M_\mathbf{z},b_\mathbf{z})\) of formulation (2) such that

$$\begin{aligned} |b_\mathbf{z}|\le 1+ \bigl (\max _{i\ne j} \Vert X_{ij}\Vert _*\bigr ) \Vert M_\mathbf{z}\Vert . \end{aligned}$$
(9)

Proof

From inequality (8) in Lemma 1, we see that \(-b_\mathbf{z}+ \min _{i\ne j} [d_{M_\mathbf{z}}(x_i,x_j)]\le 1\) and \(\max _{i\ne j} [d_{M_\mathbf{z}}(x_i,x_j)] \ge b_\mathbf{z}-1.\) Equivalently, this implies that \(-b_\mathbf{z}\le 1- \min _{i\ne j} [d_{M_\mathbf{z}}(x_i,x_j)]\) and \(b_\mathbf{z}\le 1+ \max _{i\ne j} [d_{M_\mathbf{z}}(x_i,x_j)].\) Recall that \(X_{ij} = (x_i-x_j)(x_i- x_j)^\top \) and observe, by the definition of the dual norm \(\Vert \cdot \Vert _*\), that

$$\begin{aligned} d_M(x_i,x_j) = \langle X_{ij}, M \rangle \le \Vert X_{ij}\Vert _*\Vert M\Vert . \end{aligned}$$

Combining this observation with the above estimates, we have that \(-b_\mathbf{z}\le 1+ \bigl (\max _{i\ne j } \Vert X_{ij}\Vert _*\bigr )\Vert M_\mathbf{z}\Vert \) and \(b_\mathbf{z}\le 1+ \bigl (\max _{i\ne j } \Vert X_{ij}\Vert _*\bigr )\Vert M_\mathbf{z}\Vert ,\) which yields the desired result. \(\square \)

Denote

$$\begin{aligned} {\mathcal {F}}= \Bigl \{(M,b): \Vert M\Vert \le {1/\sqrt{\lambda }}, ~~ |b| \le 1+ X_*\Vert M\Vert \Bigr \}, \end{aligned}$$
(10)

where

$$\begin{aligned} X_*= \max _{x,x'\in {\mathcal {X}}} \Vert (x-x')(x-x')^\top \Vert _*. \end{aligned}$$

From the above corollary, for any samples \(\mathbf{z}\) we can easily see that at least one optimal solution \((M_\mathbf{z}, b_\mathbf{z})\) of formulation (2) belongs to the bounded set \({\mathcal {F}}\subseteq \mathbb {S}^d \times \mathbb {R}.\)

We end this subsection with two remarks. Firstly, from the proof of Lemma 1 and Corollary 2, we can easily see that, if the set of training samples contains at least two examples with distinct labels and two examples with the same label, all minimizers of formulation (2) satisfy inequality (8) and inequality (9). Hence, in this case all minimizers \((M_\mathbf{z}, b_\mathbf{z})\) of formulation (2) belong to the bounded set \({\mathcal {F}}\). Consequently, we assume, without loss of generality, that any minimizer \((M_\mathbf{z}, b_\mathbf{z})\) of formulation (2) satisfies inequality (9) and belongs to the set \({\mathcal {F}}\). Secondly, our formulation (2) for metric learning focused on the hinge loss which is widely used in the community of metric learning, see e.g Jin et al. (2009), Weinberger and Saul (2008), Ying and Li (2012). Similar results as those in the above corollary can easily be obtained for q-norm loss given, for any \(x\in \mathbb {R}\), by \((1-x)_+^q\) with \(q>1\). However, it still remains a question to us on how to estimate the term b for general loss functions.

3.2 Generalization bounds

Before stating the generalization bounds, we introduce some notations. For any \(z=(x,y), z'=(x',y')\in {\mathcal {Z}}\), let \(\Phi _{M,b} (z, z') = (1+ r(y,y')(d_{M}(x,x')-b))_+.\) Hence, for any \((M,b)\in {\mathcal {F}}\),

$$\begin{aligned} \sup _{z,z'}\sup _{(M,b)\in {\mathcal {F}}}\Phi _{M,b}(z,z') \le B_\lambda := 2\bigl (1+ X_*/\sqrt{\lambda }\bigr ). \end{aligned}$$
(11)

Let \( \lfloor {n\over 2}\rfloor \) denote the largest integer less than \({n\over 2}\) and recall the definition that \(X_{ij} = (x_i-x_j)(x_i-x_j)^\top \). We now define Rademacher average over sums-of-i.i.d. sample-blocks related to the dual matrix norm \(\Vert \cdot \Vert _*\) by

$$\begin{aligned} \widehat{R}_n ={1\over \lfloor {n\over 2}\rfloor } {\mathbb {E}}_\sigma \Bigl \Vert \sum _{i=1}^{\lfloor {n\over 2}\rfloor } \sigma _i X_{i ({\lfloor {n\over 2}\rfloor +i})} \Bigr \Vert _*, \end{aligned}$$
(12)

and its expectation is denoted by \( R_n = {\mathbb {E}}_\mathbf{z}\bigl [\widehat{R}_n\bigr ]\). Our main theorem below shows that the generalization bounds for metric learning critically depend on the quantity of \(R_n\). For this reason, we refer to \(R_n\) as the Radmemcher complexity for metric learning. It is worth mentioning that metric learning formulation (2) depends on the norm \(\Vert \cdot \Vert \) of the linear space \(\mathbb {S}^d\) and the Rademacher complexity \(R_n\) is related to its dual norm \(\Vert \cdot \Vert _*\).

Theorem 3

Let \((M_\mathbf{z}, b_\mathbf{z})\) be the solution of formulation (2). Then, for any \(0<\delta <1\), with probability \(1-\delta \) we have that

$$\begin{aligned}&{\mathcal {E}}(M_\mathbf{z},b_\mathbf{z}) - {\mathcal {E}}_\mathbf{z}(M_\mathbf{z},b_\mathbf{z}) \le \displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - {\mathcal {E}}_\mathbf{z}(M,b)\Bigr ] \nonumber \\&\quad \le {4R_n \over \sqrt{\lambda }} + {4(3+ 2X_*/\sqrt{\lambda }) \over \sqrt{n}}+2\bigl (1+ X_*/\sqrt{\lambda }\bigr )\left( {2\ln \bigl ({1 \over \delta }\bigr )\over {n}}\right) ^{1 \over 2}. \end{aligned}$$
(13)

Proof

The proof of the theorem can be divided into three steps as follows.

Step 1: Let \({\mathbb {E}}_\mathbf{z}\) denote the expectation with respect to samples \(\mathbf{z}\). Observe that \({\mathcal {E}}(M_\mathbf{z},b_\mathbf{z}) - {\mathcal {E}}_\mathbf{z}(M_\mathbf{z},b_\mathbf{z}) \le \displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - {\mathcal {E}}_\mathbf{z}(M,b)\Bigr ].\) For any \(\mathbf{z}= (z_1,\ldots ,z_{k-1},z_k,z_{k+1},\ldots , z_n)\) and \( \mathbf{z}'=(z_1,\ldots , z_{k-1},z'_k, z_{k+1},\ldots , z_n)\) we know from inequality (11) that

$$\begin{aligned} \begin{array}{ll} &{} \Bigl |\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - {\mathcal {E}}_\mathbf{z}(M,b)\Bigr ] -\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - {\mathcal {E}}_{\mathbf{z}'}(M,b)\Bigr ] \Bigr |\\ &{}\quad \le \displaystyle \sup _{(M,b)\in {\mathcal {F}}}|{\mathcal {E}}_\mathbf{z}(M,b) - {\mathcal {E}}_{\mathbf{z}'}(M,b)| \\ &{}\quad ={1\over n(n-1)}\displaystyle \sup _{(M,b)\in {\mathcal {F}}}\displaystyle \sum \limits _{j\in \mathbb {N}_n,j\ne k}|\Phi _{M,b}(z_k,z_j) - \Phi _{M,b}(z'_k,z_j)|\\ &{}\quad \le {1\over n(n-1)}\displaystyle \sup _{(M,b)\in {\mathcal {F}}}\displaystyle \sum \limits _{j\in \mathbb {N}_n,j\ne k}|\Phi _{M,b}(z_k,z_j)|+|\Phi _{M,b}(z'_k,z_j)| \\ &{}\quad \le 4\bigl (1+ X_*/\sqrt{\lambda }\bigr )/n. \end{array} \end{aligned}$$

Applying McDiarmid’s inequality (McDiarmid 1989) (see Lemma 6 in the “Appendix”) to the term \(\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - {\mathcal {E}}_\mathbf{z}(M,b)\Bigr ]\), with probability \(1-{\delta }\) there holds

$$\begin{aligned} \displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - {\mathcal {E}}_\mathbf{z}(M,b)\Bigr ]\le & {} {\mathbb {E}}_\mathbf{z}\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - {\mathcal {E}}_\mathbf{z}(M,b)\Bigr ] \nonumber \\&+ 2\bigl (1+ X_*/\sqrt{\lambda }\bigr )\left( {2\ln \bigl ({1\over \delta }\bigr )\over n}\right) ^{1\over 2}. \end{aligned}$$
(14)

Now we only need to estimate the first term in the expectation form on the right-hand side of the above equation by symmetrization techniques.

Step 2: To estimate \({\mathbb {E}}_\mathbf{z}\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - {\mathcal {E}}_\mathbf{z}(M,b)\Bigr ]\), applying Lemma 7 with \(q_{(M,b)} (z_i,z_j) ={\mathcal {E}}(M,b) - (1+ r(y_i,y_j)(d_M(x_i,x_j)-b))_+\) implies that

$$\begin{aligned} {\mathbb {E}}_\mathbf{z}\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - {\mathcal {E}}_\mathbf{z}(M,b)\Bigr ]\le {\mathbb {E}}_\mathbf{z}\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - {\overline{{\mathcal {E}}}}_\mathbf{z}(M,b)\Bigr ], \end{aligned}$$
(15)

where \(\overline{{\mathcal {E}}}_\mathbf{z}(M,b)={1\over \lfloor {n\over 2}\rfloor } \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\Phi _{M,b}(z_{i},z_{\lfloor {n\over 2}\rfloor +i}).\) Now let \(\bar{\mathbf{z}} = \{\bar{z}_1,\bar{z}_2,\ldots , \bar{z}_{n}\}\) be i.i.d. samples which are independent of \(\mathbf{z}\), then

$$\begin{aligned} {\mathbb {E}}_\mathbf{z}\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - \overline{{\mathcal {E}}}_\mathbf{z}(M,b)\Bigr ]= & {} {\mathbb {E}}_{\mathbf{z}}\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [ {\mathbb {E}}_{\bar{\mathbf{z}}} \bigl [ \overline{{\mathcal {E}}}_{\bar{\mathbf{z}}}(M,b)\bigr ] - \overline{{\mathcal {E}}}_\mathbf{z}(M,b)\Bigr ] \nonumber \\\le & {} {\mathbb {E}}_{\mathbf{z},\bar{\mathbf{z}}}\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [ \ \overline{{\mathcal {E}}}_{\bar{\mathbf{z}}}(M,b) - \overline{{\mathcal {E}}}_\mathbf{z}(M,b)\Bigr ]. \end{aligned}$$
(16)

By standard symmetrization techniques (see e.g. Bartlett and Mendelson 2002), for i.i.d. Rademacher variables \(\{\sigma _i \in \{\pm 1\}: i\in \mathbb {N}_{\lfloor {n\over 2}\rfloor }\}\), we have that

$$\begin{aligned}&{\mathbb {E}}_{\mathbf{z},\bar{\mathbf{z}}}\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [ {\overline{{\mathcal {E}}}}_{\bar{\mathbf{z}}}(M,b) - {\overline{{\mathcal {E}}}}_\mathbf{z}(M,b)\Bigr ] \nonumber \\&\quad ={\mathbb {E}}_{\mathbf{z},\bar{\mathbf{z}}} {1\over \lfloor {n\over 2}\rfloor }\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \displaystyle \sum \limits _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i\Bigl [\Phi _{M,b}(\bar{z}_{i},\bar{z}_{\lfloor {n\over 2}\rfloor +i}) - \Phi _{M,b}({z}_{i},{z}_{\lfloor {n\over 2}\rfloor +i})\Bigr ] \nonumber \\&\quad = 2{\mathbb {E}}_{\mathbf{z},\sigma }{1\over \lfloor {n\over 2}\rfloor }\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \displaystyle \sum \limits _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i\Phi _{M,b}(\bar{z}_{i},\bar{z}_{\lfloor {n\over 2}\rfloor +i})\nonumber \\&\quad \le 2{\mathbb {E}}_{\mathbf{z},\sigma }{1\over \lfloor {n\over 2}\rfloor }\displaystyle \sup _{(M,b)\in {\mathcal {F}}}\Bigl | \displaystyle \sum \limits _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i\Phi _{M,b}(\bar{z}_{i},\bar{z}_{\lfloor {n\over 2}\rfloor +i})\Bigr |. \end{aligned}$$
(17)

Applying the contraction property of Rademacher averages (see Lemma 8 in the “Appendix”) with \(\Psi _i(t) = \bigl (1+ r(y_i, y_{\lfloor {n\over 2}\rfloor +i}) t \bigr )_+ - 1\), we have the following estimation for the last term on the righthand side of the above inequality:

$$\begin{aligned}&{\mathbb {E}}_{\sigma }{1\over \lfloor {n\over 2}\rfloor }\displaystyle \sup _{(M,b)\in {\mathcal {F}}}\Bigl | \displaystyle \sum \limits _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i\Phi _{M,b}(\bar{z}_{i},\bar{z}_{\lfloor {n\over 2}\rfloor +i})\Bigr | \nonumber \\&\quad \le {\mathbb {E}}_{\sigma } {1\over \lfloor {n\over 2}\rfloor } \displaystyle \sup _{(M,b)\in {\mathcal {F}}}\Bigl | \displaystyle \sum \limits _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i (\Phi _{M,b}(\bar{z}_{i},\bar{z}_{\lfloor {n\over 2}\rfloor +i}) -1 )\Bigr | + {1\over \lfloor {n\over 2}\rfloor } {\mathbb {E}}_{\sigma }\Bigl |\displaystyle \sum \limits _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i \Bigr | \nonumber \\&\quad \le {2\over \lfloor {n\over 2}\rfloor } {\mathbb {E}}_{\sigma } \displaystyle \sup _{(M,b)\in {\mathcal {F}}}\Bigl | \displaystyle \sum \limits _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i \bigl ( d_M(x_i,x_{\lfloor {n\over 2}\rfloor +i}) -b \bigr )\Bigr | + {1\over \lfloor {n\over 2}\rfloor } {\mathbb {E}}_{\sigma }\Bigl |\displaystyle \sum \limits _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i\Bigr | \nonumber \\&\quad \le {2\over \lfloor {n\over 2}\rfloor } {\mathbb {E}}_{\sigma } \displaystyle \sup _{\Vert M\Vert \le {1\over \sqrt{\lambda }}}\Bigl | \displaystyle \sum \limits _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i d_M(x_i,x_{\lfloor {n\over 2}\rfloor +i})\Bigr | + {(3+ 2X_*/\sqrt{\lambda })\over \lfloor {n\over 2}\rfloor } {\mathbb {E}}_{\sigma }\Bigl |\displaystyle \sum \limits _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i\Bigr |. \end{aligned}$$
(18)

Step 3: It remains to estimate the terms on the righthand side of inequality (18). To this end, observe that

$$\begin{aligned} {\mathbb {E}}_{\sigma }\Bigl |\displaystyle \sum \limits _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i\Bigr | \le \left( {\mathbb {E}}_{\sigma }\Bigl |\displaystyle \sum \limits _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i\Bigr |^2\right) ^{1\over 2}\le \sqrt{\lfloor {n\over 2}\rfloor }. \end{aligned}$$

Moreover,

$$\begin{aligned} \begin{array}{ll}{\mathbb {E}}_\sigma \displaystyle \sup _{\Vert M\Vert \le {1\over \sqrt{\lambda }}}\Bigl | \displaystyle \sum \limits _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i d_M(x_i,x_{\lfloor {n\over 2}\rfloor +i})\Bigr | &{} = {\mathbb {E}}_\sigma \displaystyle \sup _{\Vert M\Vert \le {1\over \sqrt{\lambda }}}\Bigl |\big \langle \displaystyle \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i (x_i-x_{\lfloor {n\over 2}\rfloor +i})(x_i-x_{\lfloor {n\over 2}\rfloor +i})^\top , M\big \rangle \Bigr | \\ &{} \le {1\over \sqrt{\lambda }}{\mathbb {E}}_\sigma \Bigl \Vert \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i X_{i {(\lfloor {n\over 2}\rfloor +i})}\Bigr \Vert _*. \end{array} \end{aligned}$$

Putting the above estimations and inequalities (17), (18) together yields that

$$\begin{aligned} {\mathbb {E}}_{\mathbf{z},\bar{\mathbf{z}}}\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [ {\overline{{\mathcal {E}}}}_{\bar{\mathbf{z}}}(M,b) - {\overline{{\mathcal {E}}}}_\mathbf{z}(M,b)\Bigr ] \le {2(3+ 2X_*/\sqrt{\lambda }) \over \sqrt{\lfloor {n\over 2}\rfloor }} + {4R_n \over \sqrt{\lambda }} \le {4(3+ X_*/\sqrt{\lambda }) \over \sqrt{n}} + {2R_n \over \sqrt{\lambda }}. \end{aligned}$$

Consequently, combining this with inequalities (15), (16) implies that

$$\begin{aligned} {\mathbb {E}}_\mathbf{z}\displaystyle \sup _{(M,b)\in {\mathcal {F}}} \Bigl [{\mathcal {E}}(M,b) - {\mathcal {E}}_\mathbf{z}(M,b)\Bigr ]\le {4(3+ 2X_*/\sqrt{\lambda }) \over \sqrt{n}} + {4R_n \over \sqrt{\lambda }}. \end{aligned}$$

Putting this estimation with (14) completes the proof the theorem. \(\square \)

In the setting of similarity learning, \(X_*\) and \(R_n\) are replaced by

$$\begin{aligned} \widetilde{X}_*= \sup _{x,t\in {\mathcal {X}}} \Vert x t^\top \Vert _*\quad \hbox {and} \quad \widetilde{R}_n = {1\over \lfloor {n\over 2}\rfloor }{\mathbb {E}}_\mathbf{z}{\mathbb {E}}_\sigma \Bigl \Vert \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i \widetilde{X}_{i ({\lfloor {n\over 2}\rfloor +i})}\Bigr \Vert _*, \end{aligned}$$
(19)

where \(\widetilde{X}_{i ({\lfloor {n\over 2}\rfloor +i})} = x_i x^\top _{\lfloor {n\over 2}\rfloor +i}.\) Let \(\widetilde{{\mathcal {F}}} = \Bigl \{(M,b): \Vert M\Vert \le {1/\sqrt{\lambda }}\), \(|b| \le 1+ \widetilde{X}_*\Vert M\Vert \Bigr \}\). Using the exactly same argument as above, we can prove the following bound for similarity learning formulation (4).

Theorem 4

Let \((\widetilde{M}_\mathbf{z}, \widetilde{b}_\mathbf{z})\) be the solution of formulation (4). Then, for any \(0<\delta <1\), with probability \(1-\delta \) we have that

$$\begin{aligned}&\widetilde{{\mathcal {E}}}(\widetilde{M}_\mathbf{z},\widetilde{b}_\mathbf{z}) - \widetilde{{\mathcal {E}}}_\mathbf{z}(\widetilde{M}_\mathbf{z},\widetilde{b}_\mathbf{z}) \le \displaystyle \sup _{(M,b)\in \widetilde{{\mathcal {F}}}} \Bigl [\widetilde{{\mathcal {E}}}(M,b) - \widetilde{{\mathcal {E}}}_\mathbf{z}(M,b)\Bigr ] \nonumber \\&\quad \le {4\widetilde{R}_n \over \sqrt{\lambda }} + {4(3+ 2\widetilde{X}_*/\sqrt{\lambda }) \over \sqrt{n}} + 2\bigl (1+ \widetilde{X}_*/\sqrt{\lambda }\bigr )\left( {2\ln \bigl ({1\over \delta }\bigr )\over {n}}\right) ^{1\over 2}. \end{aligned}$$
(20)

4 Estimation of \(R_n\) and discussion

From Theorem 3, we need to estimate the Rademacher average for metric learning, i.e. \(R_n \), and the quantity \(X_*\) for different matrix regularization terms. Without loss of generality, we only focus on popular matrix norms such as the Frobenius norm (Jin et al. 2009), \(L^1\)-norm (Rosales and Fung 2006), trace-norm (Ying et al. 2009; Shen et al. 2009) and mixed (2, 1)-norm (Ying et al. 2009).

Example 1

(Frobenius norm) Let the matrix norm be the Frobenius norm i.e. \(\Vert M \Vert = \Vert M\Vert _F\), then the quantity \(X_*= \sup _{x,x\in {\mathcal {X}}}\Vert x-x'\Vert ^2_F\) and the Rademacher complexity are estimated as follows:

$$\begin{aligned} R_n \le {2 X_*\over \sqrt{n }}= {2\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert ^2_F \over \sqrt{n}}. \end{aligned}$$

Let \((M_\mathbf{z},b_\mathbf{z})\) be a solution of formulation (2) with Frobenius norm regularization. For any \(0<\delta <1\), with probability \(1-\delta \) there holds

$$\begin{aligned} {\mathcal {E}}(M_\mathbf{z},b_\mathbf{z}) - {\mathcal {E}}_\mathbf{z}(M_\mathbf{z},b_\mathbf{z})\le & {} 2\Big (1+ {\sup _{x,x\in {\mathcal {X}}} \Vert x-x'\Vert ^2_F \over \sqrt{\lambda }}\Big )\sqrt{{2\ln \bigl ({1\over \delta }\bigr )\over n}} \nonumber \\&+ {16\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert ^2_F \over \sqrt{n\lambda }} + { 12 \over \sqrt{n}}. \end{aligned}$$
(21)

Proof

Note that the dual norm of the Frobenius norm is itself. The estimation of \(X_*\) is straightforward. The Rademacher complexity \(R_n\) is estimated as follows:

$$\begin{aligned} R_n= & {} {1\over \lfloor {n\over 2}\rfloor } {\mathbb {E}}\left( \sum _{i,j=1}^{\lfloor {n\over 2}\rfloor } \sigma _i\sigma _j \langle x_i - x_{\lfloor {n\over 2}\rfloor +i}, x_j - x_{\lfloor {n\over 2}\rfloor +j}\rangle ^2 \right) ^{1\over 2}\nonumber \\\le & {} {1\over \lfloor {n\over 2}\rfloor }{\mathbb {E}}_\mathbf{z}\left( {\mathbb {E}}_\sigma \sum _{i,j=1}^{\lfloor {n\over 2}\rfloor }\sigma _i\sigma _j \langle x_i - x_{\lfloor {n\over 2}\rfloor +i}, x_j - x_{\lfloor {n\over 2}\rfloor +j}\rangle ^2 \right) ^{1\over 2} \nonumber \\= & {} {1\over \lfloor {n\over 2}\rfloor } {\mathbb {E}}_\mathbf{z}\left( \sum _{i=1}^{\lfloor {n\over 2}\rfloor } \Vert x_i - x_{\lfloor {n\over 2}\rfloor +i}\Vert _F^4 \right) ^{1\over 2} \nonumber \\\le & {} {X_*\big /\sqrt{\lfloor {n\over 2}\rfloor }} \le {2 X_*\over \sqrt{n }}. \end{aligned}$$

Putting the above estimation back into Eq. (13) completes the proof of Example 1. \(\square \)

Other popular matrix norms for metric learning are the \(L^1\)-norm, trace-norm and mixed (2, 1)-norm. The dual norms are respectively \(L^\infty \)-norm, spectral norm (i.e. the maximum of singular values) and mixed \((2,\infty )\)-norm. All these dual norms mentioned above are less than the Frobenius norm. Hence, the following estimation always holds true for all the norms mentioned above:

$$\begin{aligned} X_*\le \sup _{x,x\in {\mathcal {X}}} \Vert x-x'\Vert ^2_F, \quad \hbox {and} \quad R_n \le {2\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert ^2_F \over \sqrt{n}}. \end{aligned}$$

Consequently, the generalization bound (21) holds true for metric learning formulation (2) with \(L^1\)-norm, or trace-norm or mixed (2, 1)-norm regularization. However, in some cases, the above upper-bounds are too conservative. For instance, in the following examples we can show that more refined estimation of \(R_n\) can be obtained by applying the Khinchin inequalities for Rademacher averages (Peña and Giné 1999).

Example 2

(Sparse \(L^1\)-norm) Let the matrix norm be the \(L^1\)-norm i.e. \(\Vert M \Vert \!=\! \sum _{\ell ,k\in \mathbb {N}_d} |M_{\ell k}|\). Then, \(X_*= \sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty ^2\) and

$$\begin{aligned} R_n \le 4 \sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty ^2\sqrt{e \log d \over n}. \end{aligned}$$

Let \((M_\mathbf{z}, b_\mathrm{z})\) be a solution of formulation (2) with \(L^1\)-norm regularization. For any \(0<\delta <1\), with probability \(1-\delta \) there holds

$$\begin{aligned} {\mathcal {E}}(M_\mathbf{z},b_\mathbf{z})-{\mathcal {E}}_\mathbf{z}(M_\mathbf{z},b_\mathbf{z})\le & {} 2\Big (1+ {\sup _{x,x\in {\mathcal {X}}} \Vert x-x'\Vert ^2_\infty \over \sqrt{\lambda }}\Big )\sqrt{{2\ln \bigl ({1\over \delta }\bigr )\over n}}\nonumber \\&+{8\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert ^2_\infty (1 + 2\sqrt{e\log d})\over \sqrt{n\lambda }} + { 12 \over \sqrt{n}}. \end{aligned}$$
(22)

Proof

The dual norm of \(L^1\)-norm is \(L^\infty \)-norm. Hence, \( X_*= \sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty ^2.\) To estimate \(R_n\), we observe, for any \(1<q<\infty \), that

$$\begin{aligned} {R}_n= & {} {1\over \lfloor {n\over 2}\rfloor } {\mathbb {E}}_\mathbf{z}{\mathbb {E}}_\sigma \Bigl \Vert \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i X_{i ({\lfloor {n\over 2}\rfloor +i})}\Bigr \Vert _\infty \le {1\over \lfloor {n\over 2}\rfloor }{\mathbb {E}}_\mathbf{z}{\mathbb {E}}_\sigma \Bigl \Vert \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i X_{i ({\lfloor {n\over 2}\rfloor +i})}\Bigr \Vert _q \nonumber \\:= & {} {1\over \lfloor {n\over 2}\rfloor } {\mathbb {E}}_\mathbf{z}{\mathbb {E}}_\sigma \bigg ( {\sum }_{\ell ,k\in \mathbb {N}_d} \bigl | \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i(x^k_i- x^k_{{\lfloor {n\over 2}\rfloor }+i})(x^\ell _i- x^\ell _{{\lfloor {n\over 2}\rfloor }+i})\bigr |^q \bigg )^{1\over q}\nonumber \\\le & {} {1\over \lfloor {n\over 2}\rfloor }{\mathbb {E}}_\mathbf{z}\bigg ({\sum }_{\ell , k \in \mathbb {N}_d} {\mathbb {E}}_\sigma \bigl | \sum _{i=1}^{\lfloor {n\over 2}\rfloor } \sigma _i(x^k_i-x^k_{{\lfloor {n\over 2}\rfloor }+i})(x^\ell _i-x^\ell _{{\lfloor {n\over 2}\rfloor }+i})\bigr |^q \bigg )^{1\over q}, \end{aligned}$$
(23)

where \(x_i^k\) represents the k-th coordinate element of vector \(x_i\in \mathbb {R}^d.\) To estimate the term on the right-hand side of inequality (23), we apply the Khinchin-Kahane inequality (see Lemma 9 in the “Appendix”) with \(p=2<q<\infty \) yields that

$$\begin{aligned}&{\mathbb {E}}_\sigma \bigl | \sum _{i=1}^{\lfloor {n\over 2}\rfloor } \sigma _i(x^k_i- x^k_{{\lfloor {n\over 2}\rfloor }+i})(x^\ell _i- x^\ell _{{\lfloor {n\over 2}\rfloor }+i}) \bigr |^q \nonumber \\&\quad \le q^{q\over 2}\left( {\mathbb {E}}_\sigma \bigl | \sum _{i=1}^{\lfloor {n\over 2}\rfloor } \sigma _i (x^k_i- x^k_{{\lfloor {n\over 2}\rfloor }+i})(x^\ell _i- x^\ell _{{\lfloor {n\over 2}\rfloor }+i}) \bigr |^2\right) ^{q\over 2} \nonumber \\&\quad = q^{q\over 2} \left( \sum _{i=1}^{\lfloor {n\over 2}\rfloor }(x^k_i- x^k_{{\lfloor {n\over 2}\rfloor }+i})^2 (x^\ell _i- x^\ell _{{\lfloor {n\over 2}\rfloor }+i})^2\right) ^{q\over 2} \nonumber \\&\quad \le \max _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty ^{2q} \left( \lfloor {n\over 2}\rfloor \right) ^{q\over 2} q^{q\over 2}. \end{aligned}$$
(24)

Putting the above estimation back into (23) and letting \(q= 4\log d\) implies that

$$\begin{aligned} \begin{array}{ll} {R}_n &{} \le \max _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty ^{2} d^{2 \over q}\sqrt{q} \big / \sqrt{ \lfloor {n\over 2}\rfloor } = 2 \displaystyle \sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty ^{2}\sqrt{ e \log d \big / \lfloor {n\over 2}\rfloor } \\ &{} \le 4 \displaystyle \sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty ^{2}\sqrt{ e \log d \big / n}. \end{array} \end{aligned}$$

Putting the estimation for \(X_*\) and \(R_n\) into Theorem 13 yields inequality (22). This completes the proof of Example 2. \(\square \)

Example 3

(Mixed (2, 1)-norm) Consider \(\Vert M \Vert = \sum _{\ell \in \mathbb {N}_d} \sqrt{\sum _{k\in \mathbb {N}_d}|M_{\ell k}|^2}.\) Then, we have \(X_*= \bigl [\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _F\bigr ]\bigl [\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty \bigr ],\) and

$$\begin{aligned} R_n \le 4 \bigl [ \sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty \bigr ]\bigl [\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _F\bigr ] \sqrt{e \log d \over n}. \end{aligned}$$

Let \((M_\mathbf{z}, b_\mathbf{z})\) be a solution of formulation (2) with mixed (2, 1)-norm. For any \(0<\delta <1\), with probability \(1-\delta \) there holds

$$\begin{aligned}&{\mathcal {E}}(M_\mathbf{z},b_\mathbf{z}) - {\mathcal {E}}_\mathbf{z}(M_\mathbf{z},b_\mathbf{z}) \nonumber \\&\quad \le 2\left( 1+{\bigl [ \sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty \bigr ]\bigl [\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _F\bigr ] \over \sqrt{\lambda }}\right) \sqrt{{2\ln \bigl ({1\over \delta }\bigr )\over n}} \nonumber \\&\quad + {8\bigl [ \sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty \bigr ]\bigl [\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _F\bigr ] (1 + 2\sqrt{e\log d}) \over \sqrt{n\lambda }} + { 12 \over \sqrt{n}}. \end{aligned}$$
(25)

Proof

The estimation of \(X_*\) is straightforward and we estimate \(R_n\) as follows. For any \(q>1\), there holds

$$\begin{aligned} {R}_n= & {} {1\over \lfloor {n\over 2}\rfloor } {\mathbb {E}}_\mathbf{z}{\mathbb {E}}_\sigma \Bigl \Vert \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i X_{i ({\lfloor {n\over 2}\rfloor +i})}\Bigr \Vert _{(2,\infty )}\nonumber \\= & {} {1\over \lfloor {n\over 2}\rfloor } {\mathbb {E}}_\mathbf{z}{\mathbb {E}}_\sigma \sup \nolimits _{\ell \in \mathbb {N}_d} \left( {\sum }_{k\in \mathbb {N}_d} \bigl | \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i (x^k_i- x^k_{{\lfloor {n\over 2}\rfloor }+i})(x^\ell _i- x^\ell _{{\lfloor {n\over 2}\rfloor }+i}) \bigr |^2\right) ^{1\over 2}\nonumber \\\le & {} {1\over \lfloor {n\over 2}\rfloor } {\mathbb {E}}_\mathbf{z}\left( {\sum }_{k\in \mathbb {N}_d} {\mathbb {E}}_\sigma \sup \nolimits _{\ell \in \mathbb {N}_d} \bigl | \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i (x^k_i- x^k_{{\lfloor {n\over 2}\rfloor }+i})(x^\ell _i- x^\ell _{{\lfloor {n\over 2}\rfloor }+i}) \bigr |^2\right) ^{1\over 2}. \end{aligned}$$
(26)

It remains to estimate the terms inside the parenthesis on the right-hand side of the above inequality. To this end, we observe, for any \(q'>1\), that

$$\begin{aligned} \begin{array}{ll} &{} {\mathbb {E}}_\sigma \sup \nolimits _{\ell \in \mathbb {N}_d} \bigl | \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i (x^k_i- x^k_{{\lfloor {n\over 2}\rfloor }+i})(x^\ell _i- x^\ell _{{\lfloor {n\over 2}\rfloor }+i}) \bigr |^2 \\ &{}\quad \le {\mathbb {E}}_\sigma \left( {\sum }_{\ell \in \mathbb {N}_d} \bigl | \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i (x^k_i- x^k_{{\lfloor {n\over 2}\rfloor }+i})(x^\ell _i- x^\ell _{{\lfloor {n\over 2}\rfloor }+i}) \bigr |^{2q'}\right) ^{1\over q'}\\ &{}\quad \le \left( {\sum }_{\ell \in \mathbb {N}_d} {\mathbb {E}}_\sigma \bigl | \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i (x^k_i- x^k_{{\lfloor {n\over 2}\rfloor }+i})(x^\ell _i- x^\ell _{{\lfloor {n\over 2}\rfloor }+i}) \bigr |^{2q'}\right) ^{1\over q'}. \end{array} \end{aligned}$$

Applying the Khinchin–Kahane inequality (Lemma 9 in the “Appendix”) with \(q = 2q' =4 \log d\) and \(p=2\) to the above inequality yields that

$$\begin{aligned} \begin{array}{ll}&{} {\mathbb {E}}_\sigma \sup \nolimits _{\ell \in \mathbb {N}_d} \bigl | \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i (x^k_i- x^k_{{\lfloor {n\over 2}\rfloor }+i})(x^\ell _i- x^\ell _{{\lfloor {n\over 2}\rfloor }+i}) \bigr |^2 \\ &{}\quad \le \left( {\sum }_{\ell \in \mathbb {N}_d} (2q')^{q'} \bigl [{\mathbb {E}}_\sigma \bigl | \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\sigma _i (x^k_i- x^k_{{\lfloor {n\over 2}\rfloor }+i})(x^\ell _i- x^\ell _{{\lfloor {n\over 2}\rfloor }+i}) \bigr |^{2}\bigr ]^{q'}\right) ^{1\over q'}\\ &{}\quad = \left( {\sum }_{\ell \in \mathbb {N}_d} (2q')^{q'} \bigl [\sum _{i=1}^{\lfloor {n\over 2}\rfloor }(x^k_i- x^k_{{\lfloor {n\over 2}\rfloor }+i})^2(x^\ell _i- x^\ell _{{\lfloor {n\over 2}\rfloor }+i})^2 \bigr ]^{q'}\right) ^{1\over q'}\\ &{}\quad \le 2q' \sup \nolimits _{x,x'\in {\mathcal {X}}} \Vert x-x' \Vert ^2_\infty d^{1\over q'}\bigl [\sum _{i=1}^{\lfloor {n\over 2}\rfloor }(x^k_i- x^k_{{\lfloor {n\over 2}\rfloor }+i})^2\bigr ] \\ &{}\quad \le 4 e(\log d)\sup \nolimits _{x,x'\in {\mathcal {X}}} \Vert x-x' \Vert ^2_\infty \bigl [\sum _{i=1}^{\lfloor {n\over 2}\rfloor }(x^k_i- x^k_{{\lfloor {n\over 2}\rfloor }+i})^2\bigr ]. \end{array} \end{aligned}$$

Putting the above estimation back into (26) implies that

$$\begin{aligned} \begin{array}{ll} R_n &{} \le {\sqrt{4 e\log d} \bigl [\sup \nolimits _{x,x'\in {\mathcal {X}}} \Vert x-x' \Vert _\infty \bigr ]{\mathbb {E}}_\mathbf{z}\left( \sum _{i=1}^{\lfloor {n\over 2}\rfloor }\Vert x_i - x_{{\lfloor {n\over 2}\rfloor }+i}\Vert _F^2\right) ^{1\over 2}\big /{\lfloor {n\over 2}\rfloor }}\\ &{}\le {\sqrt{4 e\log d} \bigl [\sup \nolimits _{x,x'\in {\mathcal {X}}} \Vert x-x' \Vert _\infty \bigr ]\bigl [\sup \nolimits _{x,x'\in {\mathcal {X}}} \Vert x-x' \Vert _F\bigr ]\big / \sqrt{\lfloor {n\over 2}\rfloor }} \\ &{} \le {4\sqrt{ e\log d}\bigl [\sup \nolimits _{x,x'\in {\mathcal {X}}} \Vert x-x' \Vert _\infty \bigr ]\bigl [\sup \nolimits _{x,x'\in {\mathcal {X}}} \Vert x-x' \Vert _F\bigr ] \big / \sqrt{n}}. \end{array} \end{aligned}$$

Combining this with Theorem 3 implies the inequality (25). This completes the proof of the example. \(\square \)

In the Frobenius-norm case, the main term of the bound (21) is \({\mathcal {O}}\bigl ( {\sup _{x,x'\in {\mathcal {X}}}\Vert x-x'\Vert _F^{2} \over \sqrt{n \lambda }}\bigr )\). This bound is consistent with that given by Jin et al. (2009) where \(\sup _{x\in {\mathcal {X}}} \Vert x\Vert _F\) is assumed to bounded by some constant B. Comparing the generalization bounds in the above examples, we see that the key terms \(X_*\) and \(R_n\) mainly differ in two quantities, i.e. \(\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _F\) and \(\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty .\) We argue that \(\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty \) can be much less than \(\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _F.\) For instance, consider the input space \({\mathcal {X}}= [0,1]^d.\) It is easy to see that \(\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _F = \sqrt{d}\) while \(\sup _{x,x'\in {\mathcal {X}}} \Vert x-x'\Vert _\infty \equiv 1.\) Consequently, we can summarise the estimations as follows:

  • Frobenius-norm: \(X_*=d\), and \(R_n \le {2 d \over \sqrt{n}}\).

  • Sparse \(L^1\)-norm: \(X_*= 1\), and \(R_n \le {4 \sqrt{e \log d}\over \sqrt{n}}.\)

  • Mixed (2, 1)-norm: \(X_*= \sqrt{d}\), and \(R_n \le {4 \sqrt{e d\log d}\over \sqrt{n}}.\)

Therefore, when d is large, the generalization bound with sparse \(L^1\)-norm regularization is much better than that with Frobenius-norm regularization while the bound with mixed (2, 1)-norm are between the above two. These theoretical results are nicely consistent with the rationale that sparse methods are more effective in dealing with high-dimensional data.

We end this section with two remarks. Firstly, in the setting of trace-norm regularization, it remains a question to us on how to establish more accurate estimation of \(R_n\) by using the Khinchin–Kahane inequality. Secondly, the bounds in the above examples are true for similarity learning with different matrix-norm regularization. Indeed, the generalization bound for similarity learning in Theorem 4 tells us that it suffices to estimate \(\widetilde{X}_*\) and \(\widetilde{R}_n\). In analogy to the arguments in the above examples, we can get the following results. For similarity learning formulation (4) with Frobenius-norm regularization, there holds

$$\begin{aligned} \widetilde{X}_*= \sup _{x\in {\mathcal {X}}} \Vert x\Vert ^2_F, \quad \widetilde{R}_n \le {2\sup _{x} \Vert x\Vert ^2_F\over \sqrt{n}}. \end{aligned}$$

For \(L^1\)-norm regularization, we have

$$\begin{aligned} \widetilde{X}_*= \sup _{x\in {\mathcal {X}}} \Vert x\Vert ^2_\infty , \quad \widetilde{R}_n \le {4\sup _{x {\mathcal {X}}} \Vert x\Vert ^2_\infty \sqrt{e\log d }\big / \sqrt{n}}. \end{aligned}$$

In the setting of (2, 1)-norm, we obtain

$$\begin{aligned} \widetilde{X}_*=\sup _{x\in {\mathcal {X}}} \Vert x\Vert _\infty \sup _{x\in {\mathcal {X}}} \Vert x\Vert _F, \quad \widetilde{R}_n \le {4\bigl [\sup _{x\in {\mathcal {X}}} \Vert x\Vert _F \sup _{x\in {\mathcal {X}}} \Vert x\Vert _\infty \bigr ] \sqrt{e\log d}\big / \sqrt{n}}. \end{aligned}$$

Putting these estimations back into Theorem 4 yields generalization bounds for similarity learning with different matrix norms. For simplicity, we omit the details here.

5 Conclusion and discussion

In this paper we are mainly concerned with theoretical generalization analysis of the regularized metric and similarity learning. In particular, we first showed that the generalization analysis for metric/similarity learning reduces to the estimation of the Rademacher average over “sums-of-i.i.d.” sample-blocks. Then, we derived their generalization bounds with different matrix regularization terms. Our analysis indicates that sparse metric/similarity learning with \(L^1\)-norm regularization could lead significantly better bounds than that with the Frobenius norm regularization, especially when the dimensionality of the input data is high. Our novel generalization analysis develops the techniques of U-statistics (Peña and Giné 1999; Clémencon et al. 2008) and Rademacher complexity analysis (Bartlett and Mendelson 2002; Koltchinskii and Panchenko 2002). Below we mention several questions remaining to be further studied.

Firstly, in Sect. 3, the derived bounds for metric and similarity learning with trace-norm regularization were the same as those with Frobenius-norm regularization. It would be very interesting to derive the bounds similar to those with sparse \(L^1\)-norm regularization. The key issue is to estimate the Rademacher complexity term (12) related to the spectral norm using the Khinchin–Kahne inequality. However, we are not aware of such Khinchin–Kahne inequalities for general matrix spectral norms.

Secondly, this study only investigated the generalization bounds for metric and similarity learning. We can get the consistency estimation for \(\Vert M - M_*\Vert _F^2\) under very strong assumption on the loss function and the underlying distribution. In particular, assume that the loss function is the least square loss, the bias term b is fixed (e.g. \(b \equiv 0\)) and let \(M_*= \arg \min _{M\in \mathbb {S}^d}{\mathcal {E}}(M,0)\), then we have

$$\begin{aligned} {\mathcal {E}}(M_\mathbf{z},0)- {\mathcal {E}}(M_*,0)= & {} \iint \langle M-M_*, x (x')^T \rangle ^2 d\rho (x)\rho (x') \nonumber \\= & {} \langle {\mathcal {C}}(M-M_*), M-M_*\rangle . \end{aligned}$$
(27)

Here, \({{\mathcal {C}}}\) is \(d^2\times d^2\) matrix representing a linear mapping from \(\mathbb {S}^d\) to \(\mathbb {S}^d\):

$$\begin{aligned} {\mathcal {C}}= \iint (x(x')^T)\otimes (x (x')^T) d\rho (x)\rho (x'). \end{aligned}$$

Here, the notation \(\otimes \) represents the tensor product of matrices. Equation (27) implies that \({\mathcal {E}}(M_\mathbf{z},0)- {\mathcal {E}}(M_*,0)=\iint \langle M-M_*, x (x')^T \rangle ^2 d\rho (x)\rho (x')\ge \lambda _{\min } ({\mathcal {C}}) \Vert M - M_*\Vert _F^2,\) where \(\lambda _{\min }({\mathcal {C}})\) is the minimum eigenvalue of the \(d^2\times d^2\) matrix \({\mathcal {C}}\). Consequently, under the assumption that \({\mathcal {C}}\) is non-singular, we can get the consistency estimation for \(\Vert M - M_*\Vert _F^2\) for the least square loss. For the hinge loss, the equality (27) does not hold true any more. Hence, it remains a question on how to get the consistency estimation for metric and similarity learning under general loss functions.

Thirdly, in many applications involving multi-media data, different aspects of the data may lead to several different, and apparently equally valid notions of similarity. This leads to a natural question on how to combine multiple similarities and metrics for a unified data representation. An extension of multiple kernel learning approach was proposed in McFee and Lanckriet (2011) to address this issue. It would be very interesting to investigate the theoretical generalization analysis for this multi-modal similarity learning framework. A possible starting point would be the techniques established for learning the kernel problem (Ying and Campbell 2009, 2010).

Finally, the target of supervised metric learning is to improve the generalization performance of kNN classifiers. It remains a challenging question to investigate how the generalization performance of kNN classifiers relates to the generalization bounds of metric learning given here.