Low-rank robust online distance/similarity learning based on the rescaled hinge loss

Zabihzadeh, Davood; Tuama, Amar; Karami-Mollaee, Ali; Mousavirad, Seyed Jalaleddin

doi:10.1007/s10489-022-03419-1

Low-rank robust online distance/similarity learning based on the rescaled hinge loss

Published: 20 April 2022

Volume 53, pages 634–657, (2023)
Cite this article

Download PDF

Applied Intelligence Aims and scope Submit manuscript

Low-rank robust online distance/similarity learning based on the rescaled hinge loss

Download PDF

Davood Zabihzadeh ORCID: orcid.org/0000-0003-2487-7950¹,
Amar Tuama²,
Ali Karami-Mollaee³ &
…
Seyed Jalaleddin Mousavirad¹

843 Accesses
2 Citations
Explore all metrics

Abstract

An important challenge in metric learning is scalability to both size and dimension of input data. Online metric learning algorithms are proposed to address this challenge. Existing methods are commonly based on Passive/Aggressive (PA) approach. Hence, they can rapidly process large volumes of data with an adaptive learning rate. However, these algorithms are based on the Hinge loss and so are not robust against outliers and label noise. We address the challenges by formulating the online Distance/Similarity learning problem with the robust Rescaled Hinge loss function. The proposed model is rather general and can be applied to any PA-based online Distance/Similarity algorithm. To achieve scalability to data dimension, we propose low-rank online Distance/Similarity methods that learn a rectangular projection matrix instead of a full Mahalanobis matrix. The low-rank approaches not only reduce the computational cost but also keep the discrimination power of the learned metrics. Also, current online methods usually assume training triplets or pairwise constraints exist in advance. However, this assumption does not hold, and generating triplets using available batch sampling methods is both time and space consuming. We address this issue by developing an efficient, yet effective robust one-pass triplet construction algorithm. We conduct several experiments on datasets from various applications. The results confirm that the proposed methods significantly outperform state-of-the-art online metric learning methods in the presence of label noise and outliers by a large margin.

Explaining Siamese networks in few-shot learning

Article Open access 29 April 2024

Sub-One Quasi-Norm-Based k-Means Clustering Algorithm and Analyses

Article Open access 13 May 2024

Balanced K-Means for Clustering

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The performance of many machine learning and data mining algorithms depends on the metric used to compute the distance between data. Generic measures such as Euclidean or Cosine similarity in an input space often fail to discriminate different classes or clusters of data. Therefore, learning an optimal Distance/Similarity function from training information is actively studied in the last decade.

Distance Metric Learning (DML) methods aim to bring semantically similar data items together while keeping dissimilar ones at a distance. One major challenge for DML algorithms is scalability to both the size and dimension of input data [1]. For processing massive volumes of data generated in today’s applications, online methods are proposed.

Many of these algorithms are based on the PA [2,3,4,5,6,7,8]. The main advantages of PA-based methods are 1) closed-form solution and 2) adaptive learning rate leading to a high convergence rate. However, the following challenges are still needed to be addressed:

1-
These algorithms are based on the Hinge loss and hence are not robust against outliers and label noise data. Nowadays many modern datasets are collected from the Internet using crowdsourcing or similar techniques. Thus, examples with wrong labels are usual in these datasets that can considerably degrade the performance of existing online DML methods.
2-
Most DML algorithms learn a metric from pair or triplet side information defined as:

$$ {\displaystyle \begin{array}{c}S=\left\{\left.\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_i\right)\right|{\boldsymbol{x}}_i\ \mathrm{and}\ {\boldsymbol{x}}_j\ \mathrm{are}\ \mathrm{similar}\right\}\\ {}D=\left\{\left.\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_i\right)\right|{\boldsymbol{x}}_i\ \mathrm{and}\ {\boldsymbol{x}}_j\ \mathrm{are}\ \mathrm{dissimilar}\right\}\\ {}T=\left\{\left.\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_i^{+},{\boldsymbol{x}}_i^{-}\right)\right|{\boldsymbol{x}}_i\ \mathrm{should}\ \mathrm{be}\ \mathrm{more}\ \mathrm{closer}\ \mathrm{to}\ {\boldsymbol{x}}_i^{+}\ \mathrm{than}\ \mathrm{to}\ {\boldsymbol{x}}_i^{-}\right\}\end{array}} $$

Existing online methods [3,4,5,6,7,8] usually assume training triplets or pairs exist in advance. However, this assumption does not hold, and generating constraints by available batch sampling methods is both time and space consuming. Thus, we need an efficient one-pass sampling algorithm for online tasks.

3-
Another important challenge in online DML applications, particularly in machine vision domain, is the high dimension of input data. Many existing methods learn Mahalanobis distance [3, 5, 6, 8] or bilinear similarity [2, 3] that require O(d²) parameters (d indicates the data dimension). Therefore, these methods are infeasible in high dimensional environments.

The main contributions of the paper to overcome these issues are as follows:

1-
We address the first challenge by formulating the online Distance/Similarity learning task using the robust Rescaled Hinge loss [9]. The proposed model is rather general, and we can easily apply it to any existing PA-based methods. It significantly improves the robustness of existing methods in the presence of label noise without increasing their computational complexity.
2-
The second challenge is tackled by developing an efficient robust one-pass triplet construction algorithm in our work.
3-
Finally, we overcome the third challenge by developing the low-rank versions of the proposed methods that learn a rectangular projection matrix instead of a full Mahalanobis matrix. These approaches not only decrease the computational cost significantly but also retain the predictive performance of the learned metrics. Also, we can easily replace the low-rank projection matrix with a nonlinear deep neural network model. Therefore, extending our methods for online deep metric learning is straightforward.

Table 1 summarizes the main notations used throughout the paper. The rest of the paper is organized as follows: Section 2 reviews related work. In Section 3, we present the formulation of the online Distance/Similarity learning problem using the Rescaled Hinge loss as well as the development of the proposed algorithms. Experiments conducted to evaluate the proposed methods are discussed in Section 4. Finally, Section 5 concludes with remarks and recommendations for future work.

Table 1 Summary of the main notations

Full size table

2 Related work

DML is a well-studied problem and attracts a lot of interest in the last decade. We refer interested readers to the surveys [1, 10] for a complete review of existing work. In this section, we only focus on related online Distance/Similarity learning algorithms. Most existing online learning methods learn a Mahalanobis distance [4,5,6,7,8] or a bilinear similarity [2, 3]. However, some more generic measures such as [5, 11] are also presented.

Mahalanobis-based methods learn a matrix M ≽ 0 given by:

$$ {d}_{\boldsymbol{M}}{\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)}^2={\left({\boldsymbol{x}}_i-{\boldsymbol{x}}_j\right)}^{\top}\boldsymbol{M}\left({\boldsymbol{x}}_i-{\boldsymbol{x}}_j\right) $$

(1)

Since the matrix M ≽ 0, it can be decomposed as M = LL^T where L ∈ ℝ^d × r and r = rank (M). Therefore, Mahalanobis distance learning is equivalent to find a linear transformation L in the input space. Instead, bilinear similarity-based methods learn a similarity matrix M given by:

$$ {S}_{\boldsymbol{M}}{\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)}^2={{\boldsymbol{x}}_i}^{\top }{\boldsymbol{M}\boldsymbol{x}}_j $$

(2)

The optimization problem of both Mahalanobis and bilinear methods is formulated based on the PA approach as follows:

$$ {\displaystyle \begin{array}{c}{\boldsymbol{M}}_{t+1}=\arg \underset{\boldsymbol{M}}{\ \min }\ reg\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\xi \\ {}\mathrm{subject}\ \mathrm{to}\kern1em l\left(\boldsymbol{M},{R}_t\right)\le \xi, \kern3.5em \xi \ge 0,\dots \dots \dots \boldsymbol{M}\succcurlyeq 0\end{array}} $$

where M_t is the current Distance/Similarity matrix at time step t, reg(M, M_t ) is a regularization term, and l(M, R_t) indicates the margin-based Hinge loss function. In distance-based methods, the Hinge loss is defined as:

$$ l\left(\boldsymbol{M},\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+},{\boldsymbol{p}}_t^{-}\right)\right)=\max \left\{0,1+{d}_{\boldsymbol{M}}{\left({\boldsymbol{p}}_{\boldsymbol{t}},{\boldsymbol{p}}_{\boldsymbol{t}}^{+}\right)}^2-{d}_{\boldsymbol{M}}{\left({\boldsymbol{p}}_{\boldsymbol{t}},{\boldsymbol{p}}_{\boldsymbol{t}}^{-}\right)}^2\right\} $$

(4)

whereas it is defined in similarity-based methods as:

$$ l\left(\boldsymbol{M},\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+},{\boldsymbol{p}}_t^{-}\right)\right)=\max \left\{0,1-{S}_M{\left({\boldsymbol{p}}_{\boldsymbol{t}},{\boldsymbol{p}}_{\boldsymbol{t}}^{+}\right)}^2+{S}_M{\left({\boldsymbol{p}}_{\boldsymbol{t}},{\boldsymbol{p}}_{\boldsymbol{t}}^{-}\right)}^2\right\} $$

(5)

OASIS^{Footnote 1} [2] is a popular bilinear similarity learning method that uses the Frobenius norm as a regularization term, i.e. $ reg\left(\boldsymbol{M},{\boldsymbol{M}}_t\ \right)=\frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2. $ OASIS eliminates the p.s.d (positive semi-definite) constraint for scalability reasons. However, this property is extremely useful to produce a low-rank metric as well as to prevent overfitting.

OKS^{Footnote 2} [3] extends OASIS in the feature space of an RKHS kernel. Also, [3] presents OMKS which is the extension of OKS for multi-modal data.

ODML^{Footnote 3} and OMDML^{Footnote 4} [4] are similar to OKS and OMKS respectively, but instead of bilinear similarity, they learn Mahalanobis distance. To enforce the p.s.d constraint, these methods use full Eigen value decomposition that involves O(d³) operations at each stage. Therefore, they are infeasible for high-dimensional DML tasks. To address this problem, LSMDML^{Footnote 5} [8] utilizes DRP (Dual Random Projection) [12] in an online multi-modal environment to enforce the p.s.d constraint.

SLMOML^{Footnote 6} [5] is the online version of the seminal ITML^{Footnote 7} [13] method. It uses the logdet regularization term that automatically enforces the p.s.d constraint at each time step. However, it has a low convergence rate and still requires O(d²) parameters.

In [6] a large-scale local online Distance/Similarity framework is presented. It learns multiple metrics for the task at hand, one metric per class in the dataset. Each metric in this framework consists of a global and a local component learned simultaneously. Having a common component for local metrics shares discriminating information among them and efficiently reduces the overfitting problem.

OPML^{Footnote 8} [7] is an online DML method that learns projection matrix L (see eq. (1)) directly, so it does not require imposing the p.s.d constraint. In practice, L has a rectangular form (L ∈ ℝ^d × r, r ≪ d). However, OPML learns a square d × d matrix and obtains a closed-form solution with the O(d²) time complexity. Also, it adopts the Frobenius norm regularization term and the popular Hinge loss function. An interesting feature of OPML is its triplet sampling strategy which constructs triplets from incoming data in an online setting.

OAHU^{Footnote 9} [14] aims to dynamically adapt the complexity of the model and to effectively utilize the input constraints during the learning process. For this purpose, this method introduces the Adaptive-Bound Triplet Loss (ABTL) instead of the commonly used Hinge loss. Also, it uses an over-complete neural network model and connects a different MEI (Metric Embedding Layer) to each hidden layer of the network. The overall loss is considered as a weighted average loss of each MEI.

Table 2 summarizes the advantages and limitations of existing online metric learning methods. As seen, all studied online Distance/Similarity models are non-roust against label noise. These methods assume that the input training information is perfect. However, this assumption may be wrong in practical machine vision applications where this information is collected from the Internet by crowdsourcing or similar techniques. Although some robust DML methods such as [7, 8, 15, 16] are presented, these methods are focused on batch settings. Among them, only Bayesian approaches [7, 16] can be extended for online settings. However, although Bayesian learning helps to avoid over-fitting in a small or a dataset with noisy features, it is less effective to deal with the more complicated problem (i.e., label noise).

Table 2 Advantages and limitations of existing online metric learning methods

Full size table

Many metric learning algorithms such as [8, 11, 17, 18] generate triplets from training data using the following batch procedure. Each data point x_i is considered similar to its k nearest neighbors with the same label (named target neighbors of x_i). Suppose x_j is a target neighbor of x_i. The imposters of x_i are any data point x_l of a different class (i.e., y_i ≠ y_l) which violates the following condition:

$$ d\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)+ margin<d\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_l\right) $$

where d is a distance measure such as Euclidean.

The data point x_i is set dissimilar to any of its imposters. Then, the triplets are formed by the natural join of similar and dissimilar pairs. Figure 1 illustrates the concepts of target neighbors and imposters.

Generating triplets using this procedure is both time and space consuming and is not feasible for online tasks. Although an online triplet construction algorithm presented in OPML that is very efficient in terms of computational cost, it does not consider the distribution and structure of data. Therefore, it has a lower performance in comparison with that of the batch algorithm.

3 Proposed methods

In this section, we first derive a general form of objective functions in existing online Similarity/Distance methods. Then, we propose its robust variant based on the Rescaled Hinge loss. After, we develop some algorithms that efficiently solve the problem based on HQ.

3.1 General Form of Objective Functions in Online Similarity/Distance methods:

As observed, many Distance/Similarity algorithms are based on the margin-based Hinge loss function (l_hinge). Let define the variable z_t as follows:

$$ {z}_t=\left\{\begin{array}{c}{S}_M{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+}\right)}^2-{S}_M{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{-}\right)}^2,\kern0.5em For\ similarity- based\ methods\kern2em (6)\\ {}{d}_{\boldsymbol{M}}{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{-}\right)}^2-{d}_{\boldsymbol{M}}{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+}\right)}^2,\kern0.5em Mahalanobis- based\ methods\kern2em (7)\end{array}\right. $$

The Hinge loss is then can be written as:

$$ l\left(\boldsymbol{M},\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+},{\boldsymbol{p}}_t^{-}\right)\right)=\max \left\{0,1-{z}_t\right\} $$

(8)

Figure 2 shows the loss function. As seen, the loss linearly grows for z ≤ 1 with no bound. The unboundedness of the Hinge loss function causes the noisy labeled data and outliers to have a large effect on the training process that results in poor performance for the learned Distance/Similarity measure.

Most existing Distance/Similarity learning methods can be formulated as follows:

$$ {\displaystyle \begin{array}{c}{\boldsymbol{M}}_{t+1}=\arg \kern0.10em \underset{\boldsymbol{M}}{\min}\kern0.10em \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+{Cl}_{hinge}\left({z}_t\right)\\ {}\left[\mathrm{subject}\ \mathrm{to}\ \boldsymbol{M}\succcurlyeq 0\right]\end{array}} $$

(9)

Note that the constraint M ≽ 0 is not adopted in all methods, so we enclose it by bracket. We can derive many existing methods from this generic optimization problem. For example, if we consider$ \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)=\frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2 $ and omit the M ≽ 0 constraint, then by defining z_t according to (6), we obtain the OASIS [2] and OKS^{Footnote 10} [3] optimization problems. Also, if we consider z_t equal to (7), the optimization problem (9) reduces to the OPML [7]. Similarly, by setting$ \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)=\frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2 $, z_t equal to (7), and enforcing M ≽ 0, we reach the optimization problem in [4]. Finally, if we set $ \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)={D}_{ld}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)=\mathrm{trace}\left(\boldsymbol{M}{\boldsymbol{M}}_{\boldsymbol{t}}^{-\mathbf{1}}\right)-\log \det \left(\boldsymbol{M}{\boldsymbol{M}}_{\mathbf{0}}^{-\mathbf{1}}\right)-d $ and drop the M ≽ 0 constraint, we obtain the optimization problem of [5].

One approach to alleviate the effect of label noise data in PA-based problems (such as eq. (9)) is to select a small value for the hyper-parameter C. However, it causes lower values for the adaptive learning rate. Instead, we propose to replace the Hinge function with the robust Rescaled Hinge loss.

3.2 Robust variant of the General Objective Function:

The Rescaled Hinge loss is defined as:

$$ {l}_{rhinge}(z)=\beta \left[1-\exp \left(-\eta {l}_{hinge}(z)\right)\right] $$

(10)

Figure 3 shows the diagram of the l_rhinge(z) loss function with different values of η. In this function, η is a rescaling parameter and β = 1/(1 − exp(−η)) is just a normalizing constant that ensures l_rhinge(0) = 1. As seen, this loss function is more robust than the Hinge against the outliers and data contaminated with label noise. We can adjust the degree of robustness using the η parameter. Also, the Hinge loss can be regarded as a special case of the Rescaled Hinge. More specifically, l_rhinge(z) becomes l_hinge(z) as η → 0.

By replacing the Hinge loss function with the Rescaled Hinge loss in eq. (9), we obtain the following optimization problem for online robust Distance/Similarity learning.

$$ {\displaystyle \begin{array}{c}{\boldsymbol{M}}_{\boldsymbol{t}+1}=\arg\ \underset{\boldsymbol{M}}{\min }\ \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+{Cl}_{\boldsymbol{rhinge}}\left({z}_t\right)\\ {}\left[\mathrm{subject}\ \mathrm{to}\ \boldsymbol{M}\succcurlyeq 0\right]\end{array}} $$

(11)

In the next subsections, we derive two efficient algorithms that efficiently solve the above optimization problem in online fashion.

3.3 The proposed Robust methods

Since the Rescaled Hinge loss is not convex, we need an efficient algorithm to solve the optimization problem (11). The proposed algorithms are based on HQ (Half Quadratic) which is an efficient alternating approach for optimizing non-convex problems. The main idea of HQ is to add an auxiliary variable such as v to the problem using Conjugate theory [19], such that the new optimization problem becomes quadratic to the main variable (with the same optimal solution as the original non-convex problem).

Since l_rhinge(z) = β[1 − exp(−ηl_hinge(z))], we can obtain the following problem which is equivalent to (11).

$$ {\displaystyle \begin{array}{c}{\boldsymbol{M}}_{\boldsymbol{t}+1}=\underset{\boldsymbol{M}}{\arg\ \max }-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \exp \left(-\eta {l}_{hinge}\left({z}_t\right)\right)\\ {}\left[\mathrm{subject}\ \mathrm{to}\ \boldsymbol{M}\succcurlyeq 0\right]\end{array}} $$

(12)

According to the definition of conjugate function, we have (refer to the Appendix A of [9]),

$$ \exp \left(-\eta {l}_{hinge}(z)\right)=\underset{v<0}{\sup}\left(\eta {l}_{hinge}(z)v-g(v)\right) $$

(13)

where g(v) = − v log(−v) + v, (v < 0). By substituting eq. (13) in (12), we obtain

$$ {\displaystyle \begin{array}{l}f\left(\mathbf{M}\right)=-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \exp \left(-\eta {l}_{hinge}\left({z}_t\right)\right)\\ {}\kern3em =-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \underset{v_t<0}{\sup}\left(\eta {l}_{hinge}\left({z}_t\right){v}_t-g\left({v}_t\right)\right)\ \\ {}\kern3em =\underset{v_t\prec 0}{\sup}\left(-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \left(\eta {l}_{hinge}\left({z}_t\right){v}_t-g\left({v}_t\right)\right)\right)\ \end{array}} $$

(14)

The third relation in (14) holds since −reg(M, M_t) is constant regarding v. Using (14), we can rewrite (12) as:

$$ {\displaystyle \begin{array}{c}\left({\boldsymbol{M}}_{\boldsymbol{t}+1},{\boldsymbol{v}}_{\boldsymbol{t}}^{\ast}\right)=\underset{\boldsymbol{M},{\boldsymbol{v}}_{\boldsymbol{t}}}{\arg\ \max }-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \left(\eta {l}_{hinge}\left({z}_t\right){v}_t-g\left({v}_t\right)\right)\\ {}\left[\mathrm{subject}\ \mathrm{to}\ \boldsymbol{M}\succcurlyeq 0\right]\end{array}} $$

(15)

To solve the above problem, we use the alternating optimization approach. First, given M, we optimize (13) over v_t and then given v_t, we optimize it over M. Suppose M^(s) is given (the superscript s indicates the iteration number), then (15) is equivalent to:

$$ {v}_t^{(s)}=\underset{\ {v}_t}{\arg\ \max}\kern0.50em \left(\eta {l}_{hinge}\left({z}_t\right){v}_t-g\left({v}_t\right)\right) $$

(16)

The above equation has a closed-form solution obtained by setting its derivative with respect to v_t equal to zero.

$$ {v}_t^{(s)}=-\exp \left(-\eta {l}_{hinge}\left({z}_t\right)\right)\kern0.5em $$

(17)

After obtaining$ {v}_t^{(s)} $, we optimize the eq. (15) respecting to M as follows:

$$ {\displaystyle \begin{array}{c}\boldsymbol{M}=\underset{\boldsymbol{M}}{\arg\ \max }-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \eta {v}_t\ {l}_{hinge}\left({z}_t\right)\\ {}\left[\mathrm{subject}\ \mathrm{to}\ \boldsymbol{M}\succcurlyeq 0\right]\end{array}} $$

(18)

The above problem is equivalent to:

$$ {\displaystyle \begin{array}{c}\boldsymbol{M}=\underset{\boldsymbol{M}}{\arg\ \min}\kern0.5em \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+{C}_t{l}_{hinge}\left({z}_t\right)\\ {}\mathrm{subject}\ \mathrm{to}\kern0.5em \left[\boldsymbol{M}\succcurlyeq 0\right],\kern0.75em {l}_{hinge}\left({z}_t\right)\le \xi, \kern1.5em \xi \ge 0\end{array}} $$

(19)

where C_t = Cβη(−v_t). The robustness of the optimization problem (19) can be explained using the penalty factor C_t. Suppose the current triplet R_t contains noisy labeled data, so the hinge function (l_hinge(z_t)) returns a large loss for R_t. Thus, C_t = Cβη(−v_t) = Cβη exp(−ηl_hinge(z_t)) approaches zero. Therefore, R_t has less effect on the learning process.

The obtained optimization problem, unlike existing models, assigns an adaptive weight (C_t) for each incoming triplet. By adjusting reg(M, M_t), p.s.d constraint, and z_t, we can obtain a family of robust Distance/Similarity learning methods. For instance, we develop two proposed algorithms named Robust_OASIS and Robust_ODML.^{Footnote 11} These algorithms can be considered as robust variants of existing OASIS [2] and ODML [4] respectively.

3.3.1 Robust_OASIS

The robust similarity-based algorithm can be derived from the general optimization problem (15) by the following settings:

$ \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)=\frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2 $, drop M ≽ 0 constraint, and define z_t according to (6).

Then, the following optimization problem is achieved:

$$ \left({\boldsymbol{M}}_{\boldsymbol{t}+\mathbf{1}},{\boldsymbol{v}}_{\boldsymbol{t}}^{\ast}\right)=\underset{\boldsymbol{M},{\boldsymbol{v}}_{\boldsymbol{t}}}{\arg\ \max }-\frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2+ C\beta \left(\eta {l}_{hinge}\left({z}_t\right){v}_t-g\left({v}_t\right)\right) $$

(20)

The solution of the above problem is obtained by iteratively computing v_t from eq. (17) and then optimizing M by solving the following optimization problem.

$$ {\displaystyle \begin{array}{c}\boldsymbol{M}=\underset{\boldsymbol{M}}{\arg\ \min}\kern0.5em \frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2+{C}_t\xi \\ {}\mathrm{subject}\kern0.5em \mathrm{to}\kern0.5em l\left({p}_t,{p}_t^{+},{p}_t^{-}\right)=1-{S}_{\boldsymbol{M}}\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+}\right)+{\boldsymbol{S}}_{\boldsymbol{M}}\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{-}\right)\le \xi, \kern1.5em \xi \ge 0\end{array}} $$

(21)

The problem (21) has a similar solution to that obtained in [2].

$$ {\displaystyle \begin{array}{c}{\boldsymbol{M}}_{t+1}={\boldsymbol{M}}_t+\tau {\boldsymbol{A}}_t\\ {}\mathrm{where}\kern0.5em \tau =\min \left({C}_t,\frac{l\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+},{\boldsymbol{p}}_t^{-}\right)}{{\left\Vert {\boldsymbol{A}}_{\boldsymbol{t}}\right\Vert}_F^2}\right)\kern0.5em \mathrm{and}\kern0.5em {\boldsymbol{A}}_t={\boldsymbol{p}}_t{\left({\boldsymbol{p}}_t^{+}-{\boldsymbol{p}}_t^{-}\right)}^{\top}\end{array}} $$

(22)

The main difference is that now the learning rate τ is bounded to the adaptive triplet weight C_t instead of the constant C in the OASIS. Algorithm 1 summarizes the steps of Robust-OASIS.

3.3.2 Robust_ODML

The robust Mahalanobis distance learning algorithm can be derived from the general optimization problem (15) by the following settings:

$ \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)=\frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2 $, enforce M ≽ 0 constraint, and define z_t according to (7).

We then obtain the following optimization problem:

$$ {\displaystyle \begin{array}{c}\left({\boldsymbol{M}}_{\boldsymbol{t}+1},{\boldsymbol{v}}_{\boldsymbol{t}}^{\ast}\right)=\underset{\boldsymbol{M},{\boldsymbol{v}}_{\boldsymbol{t}}}{\arg\ \max}\kern0.5em -\frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2+ C\beta \left(\eta {l}_{hinge}\kern0.35em \left({z}_t\right){v}_t-g\left({v}_t\right)\right)\\ {}\mathrm{subject}\ \mathrm{to}\kern0.5em \boldsymbol{M}\succcurlyeq 0\end{array}} $$

(23)

In a similar way to the Robust-OASIS, we obtain the solution by iteratively computing v_t from the eq. (17) and then optimizing M by solving the following optimization problem.

$$ {\displaystyle \begin{array}{c}\boldsymbol{M}=\underset{\boldsymbol{M}}{\arg\ \min}\kern0.5em \frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2+{C}_t\xi \\ {}\mathrm{subject}\ \mathrm{to}\kern0.5em l\left({\boldsymbol{p}}_{\boldsymbol{t}},{\boldsymbol{p}}_{\boldsymbol{t}}^{+},{\boldsymbol{p}}_{\boldsymbol{t}}^{-}\right)=1+{d}_{\boldsymbol{M}}^2\left({\boldsymbol{p}}_{\boldsymbol{t}},{\boldsymbol{p}}_{\boldsymbol{t}}^{+}\right)-{d}_{\boldsymbol{M}}^2\left({\boldsymbol{p}}_{\boldsymbol{t}},{\boldsymbol{p}}_{\boldsymbol{t}}^{-}\right)\le \xi, \kern1.5em \xi \ge 0,\kern1em \boldsymbol{M}\succcurlyeq 0\end{array}} $$

(24)

The solution of the above problem is similar to that of [4].

$$ {\displaystyle \begin{array}{c}{\boldsymbol{M}}_{t+1}={\boldsymbol{M}}_t+\tau {\boldsymbol{A}}_t\\ {}\mathrm{where}\kern0.5em \tau =\min \left({C}_t,\frac{l\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+},{\boldsymbol{p}}_t^{-}\right)}{{\left\Vert {\boldsymbol{A}}_{\boldsymbol{t}}\right\Vert}_F^2}\right)\kern0.5em \mathrm{and}\kern0.5em \\ {}{\boldsymbol{A}}_t=\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{-}\right){\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{-}\right)}^{\top }-\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{+}\right){\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{+}\right)}^{\top}\end{array}} $$

(25)

Algorithm 2 summarizes the steps of Robust-ODML.

To enforce the p.s.d constraint, the naive approach is to perform the full Eigen value decomposition of matrix M and then set its negative Eigen values to zero. This approach requires O(d³) operations, so it is infeasible for high-dimensional DML tasks. Although some improved methods are available [6, 8, 12], we address this problem by developing the low-rank versions of the proposed algorithms in the following subsections.

3.4 Low-rank Robust Distance/Similarity learning methods

Instead of a full Mahalanobis matrix M ∈ ℝ^d × d , the proposed low-rank methods learn a rectangular projection matrix L ∈ ℝ^d × r where M = LL^⊤and r is the rank of M. We follow this approach to achieve the low-rank variants of the proposed method since: 1) it automatically enforces the p.s.d constraint, 2) in many real applications data lie on a latent subspace with dimensionality r ≪ d. Thus, this approach requires fewer parameters. An important problem is how to adjust the hyper parameter r. While some sophisticated methods like Bayesian variational inference [16] or low-rank approximation [20] exist that can automatically adjust the value of r; here, we simply use the cross-validation.

The optimization problem for low-rank online Distance/Similarity learning is formulated as:

$$ {\boldsymbol{L}}_{\boldsymbol{t}+\mathbf{1}}=\underset{\boldsymbol{L}}{\arg\ \max }-\mathrm{reg}\left(\boldsymbol{L},{\boldsymbol{L}}_t\right)+ C\beta \left(\eta {l}_{hinge}\ \left({z}_t\right){v}_t-g\left({v}_t\right)\right) $$

(26)

If we rewrite both the bilinear similarity and the Mahalanobis distance as functions of L as follows:

$$ {S}_{\boldsymbol{L}}{\left(\boldsymbol{p},\boldsymbol{q}\right)}^2={\boldsymbol{p}}^{\top}\boldsymbol{Mq}={\boldsymbol{p}}^{\top }{\boldsymbol{L}\boldsymbol{L}}^{\top}\boldsymbol{q}={\left({\boldsymbol{L}}^{\boldsymbol{T}}\boldsymbol{p}\right)}^{\top}\left({\boldsymbol{L}}^{\top}\boldsymbol{q}\right) $$

(27)

$$ {d}_{\boldsymbol{L}}{\left(\boldsymbol{p},\boldsymbol{q}\right)}^2={\left(\boldsymbol{p}-\boldsymbol{q}\right)}^{\top}\boldsymbol{M}\left(\boldsymbol{p}-\boldsymbol{q}\right)={\left(\boldsymbol{p}-\boldsymbol{q}\right)}^{\top }{\boldsymbol{L}\boldsymbol{L}}^{\top}\left(\boldsymbol{p}-\boldsymbol{q}\right)={\left\Vert {\boldsymbol{L}}^{\top}\boldsymbol{p}-{\boldsymbol{L}}^{\top}\boldsymbol{q}\right\Vert}_2^2, $$

(28)

Then, the bilinear similarity learning is equivalent to finding a linear projection L and then applying dot product to the inputs in the projected space. Similarly, Mahalanobis distance learning corresponds to compute the Euclidean distance after transforming the inputs by L.

The z_t variable can be expressed in terms of S_L and d_L as:

$$ {z}_t=\left\{\begin{array}{c}{S}_{\boldsymbol{L}}{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+}\right)}^2-{S}_{\boldsymbol{L}}{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{-}\right)}^2,\kern0.5em For\ similarity- based\ methods\kern1em (29)\\ {}{d}_{\boldsymbol{L}}{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{-}\right)}^2-{d}_{\boldsymbol{L}}{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+}\right)}^2,\kern0.5em Mahalanobis- based\ methods\kern1em (30)\end{array}\right. $$

Now, we can easily derive the proposed low-rank robust similarity learning algorithm named Robust-LOSL^{Footnote 12} from the generic optimization problem (26) with the following settings:

$ \mathrm{reg}\left(\boldsymbol{L},{\boldsymbol{L}}_t\right)=\frac{1}{2}{\left\Vert \boldsymbol{L}-{\boldsymbol{L}}_t\right\Vert}_2^2 $, define z_t according to (29).

The obtained optimization problem can be solved by iteratively computing v_t from the eq. (17) and then optimizing L by solving the following optimization problem:

$$ {\boldsymbol{L}}_{t+1}=\underset{\boldsymbol{L}}{\arg \kern0.15em \min}\kern0.50em \frac{1}{2}{\left\Vert \boldsymbol{L}-{\boldsymbol{L}}_t\right\Vert}_F^2+{C}_t{l}_{hinge}\left({z}_t\right) $$

(31)

The above unconstrained optimization problem is non-convex. However, we can solve it efficiently by optimizing a simple linear neural network model parameterized by L as illustrated in Fig. 4.

The sub-gradient of the loss function with respect to L can be computed from the following equation:

$$ {\displaystyle \begin{array}{c}\frac{{\partial l}_t}{\partial \boldsymbol{L}}=\left(\boldsymbol{L}-{\boldsymbol{L}}_t\right)+{C}_t\left[{\boldsymbol{p}}_t{\boldsymbol{p}}_t^{-\intercal }+{\boldsymbol{p}}_t^{-}{\boldsymbol{p}}_t^{\intercal }-{\boldsymbol{p}}_t{\boldsymbol{p}}_t^{+\intercal }-{\boldsymbol{p}}_t^{+}{\boldsymbol{p}}_t^{\intercal}\right]\boldsymbol{L}\\ {}=\left(\boldsymbol{L}-{\boldsymbol{L}}_t\right)-{C}_t\left[{\boldsymbol{p}}_t{\left({\boldsymbol{p}}_t^{+}-{\boldsymbol{p}}_t^{-}\right)}^{\intercal }+\left({\boldsymbol{p}}_t^{+}-{\boldsymbol{p}}_t^{-}\right){\boldsymbol{p}}_t^{\intercal}\right]\boldsymbol{L}\\ {}\begin{array}{c}\left(\boldsymbol{L}-{\boldsymbol{L}}_t\right)-{C}_t\left[{\boldsymbol{A}}_{\boldsymbol{t}}+{\boldsymbol{A}}_{\boldsymbol{t}}^{\intercal}\right]\boldsymbol{L}\\ {} where\ {A}_t={\boldsymbol{p}}_t{\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{+}\right)}^{\intercal}\end{array}\end{array}} $$

(32)

Thus, we can train the network using backpropagation or more sophisticated algorithms such as Adams. The steps of Robust-LOSL are summarized in Algorithm3.

Similarly, we can derive the proposed low-rank robust distance learning algorithm named Robust-LODML^{Footnote 13} from the generic optimization problem (26) with the following settings:

$ \mathrm{reg}\left(\boldsymbol{L},{\boldsymbol{L}}_t\right)=\frac{1}{2}{\left\Vert \boldsymbol{L}-{\boldsymbol{L}}_t\right\Vert}_F^2 $, define z_t according to (30).

We solve the obtained problem iteratively by computing v_t from the eq. (17) and then updating L by optimizing the neural network model presented in Fig. 4. The sub-gradient of the loss function with respect to L can be computed from the following equation:

$$ {\displaystyle \begin{array}{c}\frac{{\partial l}_t}{\partial \boldsymbol{L}}\left|=\left(\boldsymbol{L}-{\boldsymbol{L}}_t\right)+2{C}_t\left[\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{+}\right){\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{+}\right)}^{\intercal }-\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{-}\right){\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{-}\right)}^{\intercal}\right]\boldsymbol{L}\right.\\ {}\left|=\left(\boldsymbol{L}-{\boldsymbol{L}}_t\right)+2{C}_t{\boldsymbol{A}}_t\boldsymbol{L}\right.\\ {} where\ {\boldsymbol{A}}_t=\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{-}\right){\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{-}\right)}^{\intercal }-\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{+}\right){\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{+}\right)}^{\intercal}\end{array}} $$

(33)

Algorithm 4 summarizes the steps of Robust_LODML. We can easily replace the linear module in the proposed low-rank model with a nonlinear deep neural network module. Thus, extending our methods for online deep Distance/Similarity learning is straightforward. Also, the experimental results in the next section confirm that Robust_LODML reduces the computational cost significantly while preserving the predictive performance of the learned metric.

3.5 Convergence Analysis

This subsection establishes the convergence of our methods with a similar analysis in [9]. According to (14),

$$ {\displaystyle \begin{array}{l}f\left(\boldsymbol{M},\boldsymbol{v}\right)=-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \left(\eta {l}_{hinge}\left({z}_t\right){v}_t-g\left({v}_t\right)\right)\\ {}\kern4em =-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \exp \left(-\eta {l}_{hinge}\left({z}_t\right)\right)\le C\beta \end{array}} $$

The inequality holds since reg(M, M_t) ≥ 0 and exp(−ηl_hinge(z_t)) ≤ 1. Thus, our objective function f(M, v) is upper bounded. Let $ f\left({\boldsymbol{M}}_{\boldsymbol{t}}^{\left(\boldsymbol{s}\right)},{\boldsymbol{v}}_{\boldsymbol{t}}^{(s)}\right) $ indicates the objective function in the s-th iteration of the HQ loop. According to (16) and (19), we have

$$ f\left({\boldsymbol{M}}_{\boldsymbol{t}}^{\left(\boldsymbol{s}\right)},{\boldsymbol{v}}_{\boldsymbol{t}}^{(s)}\right)\le f\left({\boldsymbol{M}}_{\boldsymbol{t}}^{\left(\boldsymbol{s}\right)},{\boldsymbol{v}}_{\boldsymbol{t}}^{\left(s+1\right)}\right)\le f\left({\boldsymbol{M}}_{\boldsymbol{t}}^{\left(\boldsymbol{s}+\mathbf{1}\right)},{\boldsymbol{v}}_{\boldsymbol{t}}^{\left(s+1\right)}\right) $$

It means that the sequence $ \left\{f\left({\boldsymbol{M}}_{\boldsymbol{t}}^{\left(\boldsymbol{s}\right)},{\boldsymbol{v}}_{\boldsymbol{t}}^{(s)}\right),s=1,2,\dots, MaxHQIter\ \right\} $ generated by our algorithms is nonincreasing. Consequently, by considering the convergence property of gradient descent methods [21], the convergence of the proposed algorithms are established.

3.6 Run Time Analysis

As seen, the proposed robust online Distance/Similarity learning model is general and can easily be applied to the existing online Distance/Similarity algorithms. Let A be an online Distance/Similarity algorithm with the time complexity T_A. By applying our method to A, besides optimizing the Distance/Similarity measure, we require to compute the weight of the incoming triplet (C_t) using the eq. (17). As seen, C_t requires evaluation of l_hinge(z_t) which is also needed for updating the metric. Therefore, it does not imply additional costs, and the overall time complexity of the robust method will be O(MAXHQIter × T_A). The experimental results confirm that the convergence of the alternating loop is fast, and the best results are obtained by taking MAXHQIter ≤ 3 in all experiments. Therefore, the obtained robust method has the same time complexity as the corresponding algorithm (A).

3.7 Online Triplet Constructing Algorithm

Generating triplets using available batch algorithms is both time and space consuming. Also, the one-pass triplet constructing strategy adopted in OPML has low performance, especially in noisy environments. To this end, we propose an online triplet constructing algorithm named OCTG^{Footnote 14} which is not only very efficient but also effective in comparison with the available batch methods. By utilizing the distribution and clusters of input data, the proposed algorithm can effectively detect outliers and noisy labeled data. Therefore, its performance surpasses existing methods in noisy environments.

Suppose {V_i| i = 1, 2, …, K} is the set of cluster centers initialized by a sample of data at the beginning of the online algorithm. Here, we use the k-means algorithm to obtain c cluster centers per class in the dataset. OCTG receives incoming data (x_t, y_t) at time step t and finds its closest cluster center V_t with the same class. Then, it considers any cluster center V_i from a different class (i.e., y_i ≠ y_t) which violates the following condition as an imposter (see Fig. 5):

$$ d\left({\boldsymbol{x}}_t,{\boldsymbol{V}}_t\right)+ margin<d\left({\boldsymbol{x}}_t,{\boldsymbol{V}}_i\ \right) $$

The triplet set constructed at time step t is formed as:

$$ {T}_t=\left\{\left({\boldsymbol{x}}_t,{\boldsymbol{V}}_t,{\boldsymbol{V}}_i\ \right)|\ \mathrm{where}\ {\boldsymbol{V}}_i\ \mathrm{is}\ \mathrm{an}\ \mathrm{imposter}\ \right\} $$

As seen, the proposed methods assign a weight denoted by C_t to each incoming triplet. We assign weight w_t to x_t equal to the minimum weights of the generated triplets at time t. The small value for w_t means that x_t is a potential outlier or a noisy labeled instance. The weight and input data are then used to update the cluster centers using any existing online clustering methods.

The obtained weights can be used to enhance the performance of any metric-based algorithms such as kNN or CBIR (Content-Based Information Retrieval) in noisy environments. For example, we use the following version of kNN named Robust-kNN (instead of the standard kNN) to classify the objects in the experiments.

Figure 6 depicts the system flow of the proposed learning/test schemes.

4 Experimental Results

This section deals with the experiments performed to evaluate the performance of proposed methods in noisy environments. First, we study the effect of label noise on the generated triplets and then discuss how these noisy triplets affect the performance of online DML methods. Subsequently, we evaluate the performance of proposed methods on real datasets at different levels of label noise. The results are compared with peer methods.

The Bilinear or dot product similarity learning is equivalent to Mahalanobis distance learning when each instance of the input triplet has a unit norm (i.e., ‖p‖₂ = 1). Thus, the experiments focus on the Mahalanobis distance learning.

4.1 Effect of Label Noise on the Generated Triplets

As depicted in Fig. 7, we can distinguish between three different types of noisy triplets: anchor, positive, and negative noisy triplets.

To study the effects of different types of noisy triplets, we apply 10% label noise to the Wine dataset. The noisy dataset is visualized using the T-SNE algorithm [22] in Fig. 8.

The statistics of the generated triplets using both batch [17] and OCTG methods are summarized in Table 3.

Table 3 Statistics of generated triplets in the Wine dataset contaminated with 10% label noise

Full size table

As the results in Table 3 indicate, by applying only 10% label noise, 68% and 46% of generated triplets by the batch and OPML triplet construction methods are contaminated respectively. On the other hand, OCTG only constructs 25% contaminated triplets (just from anchor noisy type). It is due to the fact that OCTG selects positive and anchor points from cluster centers, not data instances that may have been contaminated by label noise. The generated noisy triplets by OCTG have large losses in comparison with that of normal ones (1.67 vs 0.39). It can be explained by the fact that a labeled noise example is often far away from its cluster center while it is close to a center from other classes. Hence, the proposed robust methods assign very small weights (C_t = Cβη exp(−ηl_hinge(z_t))) to them in the learning process and so they have a negligible effect on the learned metric.

To analyze the effect of different types of triplet noise in a typical DML task, we run the ODML [4] with the following settings on the generated triplets by the batch method.

ODML: The ODML algorithm.

Ideal ODML: The ideal algorithm which knows the noisy triplets in advance and so ignores them in the training process.

Anchor Ideal ODML: The ideal algorithm that only knows the anchor noisy triplets in advance.

Pos Ideal ODML: The ideal algorithm that only knows the positive noisy triplets in advance.

Neg Ideal ODML: The ideal algorithm that only knows the negative noisy triplets in advance.

In this experiment, we divide the dataset into train/test with a 70/30 ratio and run the above algorithms ten times on the dataset. Figure 9 depicts the mean of obtained results by various algorithms.

For small values of C, the results indicate that the learned metric by ODML has no meaningful difference with that of Euclidean. For large values of C, ODML performs worse than Euclidean, and its accuracy substantially degrades in this noisy environment. Also, among the ideal methods (cannot be implemented in practice), the Anchor Ideal ODML has the same performance as Ideal ODML, and others (Pos Ideal ODML, Neg Ideal ODML) are ineffective. Thus, anchor noisy triplets are the main reason for low performance in this experiment.

We repeat the experiment by running Robust-LODML using the triplets generated by our mechanism. The mean-accuracy of kNN-Robust-LODML^{Footnote 15} (k = 3, η = 3) and the weights assigned to instances by Robust-LODML are depicted in Figs. 10 and 11 respectively. As the results show, the proposed method is robust against label noise and its performance surpasses Euclidean metric even for the large values of C. Also, Robust-LODML effectively identified the contaminated instances and considerably reduces their weights (C_t) in the training process.

As shown in Fig. 3, the parameter η controls the robustness of the loss function against outliers and data with noisy labels. To study its effect on the noisy data in a real experiment, we apply 20% label noise on the Wine dataset. Then, we evaluate Robust-LODML in a 5-fold cross-validation setting. Figure 12 depicts the mean accuracy of kNN-Robust-LODML (k = 3) on the dataset. As the result show, the lower η values considerably degrade the performance of Robust-LODML. Also, by properly setting the η value, the performance of our method substantially increases in the noisy environment.

The results are obtained by using only one dataset. In the next subsections, we evaluate the proposed methods on the variety of datasets in different label noise levels. Also, the results are compared with state-of-the-art methods.

4.2 Experimental Setup

Table 4 summarizes the statistics of evaluated datasets in the experiments. Here, all datasets except Letters are normalized so that the mean and standard deviation of each attribute becomes 0 and 1, respectively. Also, the dimension of images in Extended Yale Faces has been reduced to 100 by applying PCA to alleviate the feature noise effects. The parameter d in Table 4 denotes the input dimension after feature reduction.

Table 4 Statistics and explanations of evaluated datasets

Full size table

In the experiments, triplet side information is generated using OCTG for the proposed methods whereas the one-pass triplet construction [7] is adopted for the other methods.

The results are obtained by k-fold cross validation (k = 5 is set for Letters and Extended Yale Faces and k = 10 for other datasets). The results are compared with peer distance-based methods: ODML [4], LPA-ODML^{Footnote 16} [6], and OPML [7].

The hyperparameters of the competing methods are adjusted by k-fold cross-validation as follows. The parameter C in ODML and the proposed methods are selected from (10⁻⁶, 30). The η in the proposed methods is chosen from the range (0.01, 5). Also, λ in OPML is selected from (10⁻⁶, 0.05). We evaluate the performance of the learned metrics by the kNN classifier with k = 3 in the experiments.

4.3 Results and Analysis

Table 5 presents the classification accuracy of the kNN using the learned metrics of the competing methods. Here, the parameter nl shows label noise level (in percent). Figure 7 depicts the mean of 5-fold cross validation accuracy of competing methods versus nl (ranging from 0% to 20%). To make the comparison meaningful, the statistical analysis test with p − value = 5% was performed on the results. In Table 5, we marked our results by * when differences with other methods were statistically significant. Also, boxplots of some statistically different results are depicted in Fig. 14.

Table 5 The classification accuracy of the kNN using the learned metric of the competing methods

Full size table

As the results in Table 5 and Fig. 13 indicate, the proposed robust methods (i.e., Robust-ODML and Robust-LODML) significantly outperform other DML methods in the presence of label noise. Also, the performance of these methods declines more slowly than other ones with the increase of noise level. That confirms our claim that using the robust loss function and robust sampling preserves the discrimination of the learned metric in a noisy environment.

Besides, the low-rank version of the proposed method (i.e., Robust-LODML) almost has the same accuracy as Robust-ODML. That confirms in real datasets, data lie on a latent subspace with dimensionality r ≪ d. Thus, learning the projection matrix L_d × r instead of full Mahalanobis matrix M results in the same performance while it is more efficient in terms of time and space requirements.

In the next subsection, we evaluate our proposed methods in a more challenging dataset for identifying COVID-19 patients from Chest-X-ray images.

4.4 Detecting COVID-19 Patients from Chest-X-ray images

4.4.1 Dataset description

The dataset used in our experiments is publicly available in the kaggle repository^{Footnote 17} [25]. Figure 15 depicts some examples from both classes. It contains 219 COVID-19 cases and 1341 normal images. As seen, the dataset is imbalanced and is too small to train a deep CNN model from scratch.

4.4.2 Experimental setup

To extract features from the images, we use the pretrained Resnet18 [26]. This network was trained on the ImageNet dataset (with 1.4 million labeled images and 1000 different classes). It has 71 layers, and the input layer requires images of size 224-by-224-by-3. We resize the images to the specified size and obtain 512 features from the global pooling layer, ‘pool5’, at the end of the model.

In addition of online methods, we also compared the proposed methods with the BLMNN [18] batch method. The λ and maxIter hyperparameters of BLMNN are selected from the ranges {1, 3, 5, 10, 20} and {1, 3, 5} respectively using 5-fold cross-validation.

We use 5-fold cross-validation to obtain the results in the experiments. The main concern in this task is to limit the number of missed COVID-19 cases. Hence, in addition to accuracy, we utilize a variety of metrics to evaluate our work. These metrics are Sensitivity (Recall), Precision, F1 Score, and G-mean (Geometric-mean). Here, COVID-19 and Normal are considered as positive and negative classes, respectively. The metrics are defined as follows:

$$ Accuracy=\left( TP+ TN\right)/ All\ Predictions $$

(34)

$$ Sensitivity\ (Recall)= TP/\left( FN+ TP\right) $$

(35)

$$ Precision= TP/\left( TP+ FP\right) $$

(36)

$$ F1- Score=2\ \left( Precision\times Sensitivity\right)/\left( Precision+ Sensitivity\right) $$

(37)

$$ Specificity= TN/\left( TN+ FP\right) $$

(38)

$$ G- mean=\sqrt{Sensitivity\times Specificity} $$

(39)

4.4.3 Results and analysis

Table 6 presents the classification results of the kNN using the learned metrics in the different levels of label noise. The results of both sensitivity and precision of the competing methods versus noise level are shown in Fig. 16(a). Since sensitivity is more important in this task, we multiply it by 2. Also, Fig. 16(b) presents the G-mean results versus noise level. The high value of G-mean indicates that accuracy in both classes is high and balanced.

Table 6 Classification metrics of kNN using the learned metrics of competing methods on the COVID-19 dataset

Full size table

As the results indicate, all methods obtain a high performance in a low-level label noise setting. However, with the increase of noise level, the performance of the competing methods declines sharper than our proposed methods. Especially, while the BLMNN (batch method) has the advantage of processing each data multiple times, it does not perform well in high-level noise settings. It can be explained as follows: 1) the batch triplet sampling utilized in BLMNN is vulnerable to label noise as discussed in subsection 4.1, 2) while Bayesian learning is effective to deal with feature noise, it is less helpful to deal with the more complicated problem (i.e., label noise).

The proposed methods achieve high sensitivity for COVID-19 patients in noisy environments. It is very important since the primary goal of this task is to limit the number of misclassified COVID-19 cases as much as possible. For example, the Confusion matrices of the proposed methods at Noise Level = 20% are shown in Table 7. As seen, only 1.8 and 1 (as the average of 5-fold cross validation) COVID-19 patients are misclassified as Normal by the proposed methods. Also, our methods obtain good precision (or predictive positive value). High precision is crucial since high FP (False Positive) increases the burden of the healthcare system for additional care and tests such as PCR (Polymerase Chain Reaction). Therefore, based on the results, we can conclude the proposed methods perform well in detecting COVID-19 cases in the presence of label noise. However, the difference between sensitivity and specificity values indicates further improvements are possible by adopting balancing techniques in this imbalanced dataset.

Table 7 Mean of confusion matrices of proposed methods obtained by 5-fold cross validation on the COVID-19 dataset (label noise = 20%)

Full size table

Also, we studied the mean run time of the competing methods in a 5-fold cross-validation setting. The results are depicted in Fig. 17. Also, Table 8 shows the summary of statistics in the experiment. Here, in the “hyper-parameters” column, we only report the value of time-related hyper-parameters. The parameter r indicates the number of columns in the projection matrix L ∈ ℝ^d × r. Note that, OPML only can learn a square projection matrix (r = d = 512 in these experiments), while Robust-LODML can learn a rectangular low-rank matrix. For Robust-LODML, we adjust the value of r from {128, 256, 512}. The #active column shows the mean number of active triplets.^{Footnote 18} It also indicates the number of times that the algorithm should update the metric.

Table 8 Summary of statistics and run-time of the competing methods in a noise free (nl = 0%) and high-level noisy (nl = 20%) settings

Full size table

The overall execution time of a DML method depends on the efficiency of the triplet sampling mechanism, the required time to update the metric, and its convergence rate. In the noise-free experiment, the average number of generated triplets by the one-pass triplet construction algorithm is 1231 (refer to Table 8). However, only a few of them violate the margin constraint. The mean number of active triplets for LPA-ODML, ODML, and OPML are 65.00, 100.60, and 33.20, respectively. Thus, OPML achieves a low runtime in this experiment. On the other hand, the OCTG utilized in our methods only generates an average of 43.40 triplets. The average number of active triplets for Robust ODML and Robust LODML are 21.00 and 26.80, respectively. As seen, the execution time of both Robust-ODML and Robust-LODML are acceptable in this experiment.

In the high-level noisy environment (nl = 20%), the convergence rate of non-robust methods (i.e., LPA-ODML, ODML, and OPML) is low. Therefore, the number of active constraints is high, and their execution times exceed the robust algorithms. Here, we found that the best hyper-parameters setting for Robust-LODML is r = 128, MaxHQIter = 1. Hence, the number of its parameters is a quarter of other methods. Also, it only has an average of 292.60 active constraints. Thus, its run time is considerably smaller than other competing methods.

5 Conclusion and Future work

Existing online Distance/Similarity learning methods are usually formulated by the Hinge loss and so are not robust against outliers and label noise data. Also, they often have the wrong assumption that training triplets or pairwise constraints exist in advance. Generating triplets using available batch algorithms is both time and space consuming. To address these challenges, we formulate the online Distance/Similarity learning problem using the robust Rescaled Hinge loss [9]. Also, we develop an efficient robust one-pass triplet sampling algorithm that takes data distribution and its clusters into account.

We further extend our work by providing the low-rank variants of proposed methods that learn a rectangular projection matrix instead of a full Mahalanobis matrix.

We studied the effects of label noise in a DML task and conducted several experiments to measure the performance of the proposed methods at different noise levels. Extensive experimental results show that the proposed methods can effectively detect wrong label data and reduce their influences in DML tasks. Thus, they consistently outperform other related online Distance/Similarity learning algorithms in noisy environments.

We intend to extend the work for online deep distance/similarity learning. Some other directions for future work are:

I.
Examining the performance of the proposed methods in other applications like CBIR.
II.
Extension of the proposed methods in imbalanced environments.
III.
Enhance the performance of the proposed online triplet sampling algorithm.

Notes

Online Algorithm for Scalable Image Similarity
Online Kernel Similarity Learning
Online Distance Metric Learning
Online Multi-Modal Distance Metric Learning
Large-Scale Multi-modal Distance Metric Learning
Scalable Large Margin Online Metric Learning
Information-Theoretic Metric Learning
One-Pass Metric Learning
Online metric learning with Adaptive Hedge Update
Online Kernel Similarity
Robust Online Distance Metric Learning
Robust Low-rank Online Similarity Learning
Robust Low-rank Online Distance Metric Learning
Online Cluster-based Triplet Generator
The kNN classifier using the learned metric of Robust-LODML method.
Local Passive/Aggressive Online Distance Metric Learning
https://www.kaggle.com/tawsifurrahman/covid19-radiography-database?select=COVID-19+Radiography+Database
Triplets that violate the margin and so have none zero loss.

References

Bellet A, Habrard A, Sebban M (2013) A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709
Chechik G, Sharma V, Shalit U, Bengio S (2010) Large scale online learning of image similarity through ranking. J Mach Learn Res 11:1109–1135
MathSciNet MATH Google Scholar
Xia H, Hoi SC, Jin R, Zhao P (2014) Online multiple kernel similarity learning for visual search. IEEE Trans Pattern Anal Mach Intell 36(3):536–549
Article Google Scholar
Wu P, Hoi SC, Zhao P, Miao C, Liu Z-Y (2016) Online multi-modal distance metric learning with application to image retrieval. IEEE Trans Knowl Data Eng 28(2):454–467
Article Google Scholar
Zhong G, Zheng Y, Li S, Fu Y (2017) SLMOML: online metric learning with global convergence. IEEE Trans Circuits Syst Video Technol 28(10):2460–2472
Article Google Scholar
Hamdan B, Zabihzadeh D (2021) Large-Scale Local Online Similarity/Distance Learning Framework Based on Passive/Aggressive. Int J Pattern Recognit Artif Intell 35:2151017
Article Google Scholar
Li W, Gao Y, Wang L, Zhou L, Huo J, Shi Y (2018) OPML: a one-pass closed-form solution for online metric learning. Pattern Recogn 75:302–314
Article Google Scholar
Rasheed AS, Zabihzadeh D, Al-Obaidi SAR (2020) Large-Scale Multi-modal Distance Metric Learning with Application to Content-Based Information Retrieval and Image Classification. Int J Pattern Recognit Artif Intell 34(13):2050034
Article Google Scholar
Xu G, Cao Z, Hu B-G, Principe JC (2017) Robust support vector machines based on the rescaled hinge loss function. Pattern Recogn 63:139–148
Article MATH Google Scholar
Kaya M, Bilge HŞ (2019) Deep metric learning: A survey. Symmetry 11(9):1066
Article Google Scholar
Zabihzadeh D, Monsefi R, Yazdi HS (2018) Sparse Bayesian similarity learning based on posterior distribution of data. Eng Appl Artif Intell 67:173–186
Article Google Scholar
Qian Q (2015) Large-scale high dimensional distance metric learning and its application to computer vision. Michigan State University. Computer Science-Doctor of Philosophy
Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) "Information-theoretic metric learning," presented at the proceedings of the 24th international conference on machine learning, Corvalis, Oregon, USA
Gao Y, Li Y-F, Chandra S, Khan L, Thuraisingham B (2019) Towards self-adaptive metric learning on the fly. in The World Wide Web Conference, pp. 503–513
Yang T, Jin R, Jain AK (2010) Learning from Noisy Side Information by Generalized Maximum Entropy Model. in ICML, pp. 1199–1206
Zabihzadeh D, Monsefi R, Yazdi HS (2019) Sparse Bayesian approach for metric learning in latent space. Knowl-Based Syst 178:11–24
Article Google Scholar
Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10, no. Feb:207–244
MATH Google Scholar
Wang D, Tan X (2018) Robust distance metric learning via Bayesian inference. IEEE Trans Image Process 27(3):1542–1553
Article MathSciNet MATH Google Scholar
Boyd S, Boyd SP, Vandenberghe L (2004) Convex optimization. Cambridge University press
Book MATH Google Scholar
Xue X, Zhang X, Feng X, Sun H, Chen W, Liu Z (2020) Robust subspace clustering based on non-convex low-rank approximation and adaptive kernel. Inf Sci 513:190–205
Article MathSciNet MATH Google Scholar
Shapiro A, Wardi Y (1996) Convergence analysis of gradient descent stochastic algorithms. J Optim Theory Appl 91(2):439–454
Article MathSciNet MATH Google Scholar
Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9(11):2579−2605
Dua D, Graff C (2019) UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
Lee K-C, Ho J, Kriegman DJ (2005) Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans Pattern Anal Mach Intell 27(5):684–698
Article Google Scholar
Chowdhury ME et al (2020) Can AI help in screening viral and COVID-19 pneumonia? IEEE Access 8:132665–132676
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778

Download references

Acknowledgments

We would like to acknowledge the Machine Learning Lab in the Engineering Faculty of FUM for their kind and technical support.

Author information

Authors and Affiliations

Department of Computer Engineering, Hakim Sabzevari University, Sabzevar, Iran
Davood Zabihzadeh & Seyed Jalaleddin Mousavirad
Computer Engineering Department, Engineering Faculty, Ferdowsi University of Mashhad, Mashhad, Iran
Amar Tuama
Electrical and Computer Engineering Faculty, Hakim Sabzevari University, Sabzevar, Iran
Ali Karami-Mollaee

Authors

Davood Zabihzadeh
View author publications
You can also search for this author in PubMed Google Scholar
Amar Tuama
View author publications
You can also search for this author in PubMed Google Scholar
Ali Karami-Mollaee
View author publications
You can also search for this author in PubMed Google Scholar
Seyed Jalaleddin Mousavirad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Davood Zabihzadeh.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zabihzadeh, D., Tuama, A., Karami-Mollaee, A. et al. Low-rank robust online distance/similarity learning based on the rescaled hinge loss. Appl Intell 53, 634–657 (2023). https://doi.org/10.1007/s10489-022-03419-1

Download citation

Accepted: 21 February 2022
Published: 20 April 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s10489-022-03419-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Low-rank robust online distance/similarity learning based on the rescaled hinge loss

Abstract

Similar content being viewed by others

Explaining Siamese networks in few-shot learning

Sub-One Quasi-Norm-Based k-Means Clustering Algorithm and Analyses

Balanced K-Means for Clustering

1 Introduction

2 Related work

3 Proposed methods

3.1 General Form of Objective Functions in Online Similarity/Distance methods:

3.2 Robust variant of the General Objective Function:

3.3 The proposed Robust methods

3.3.1 Robust_OASIS

3.3.2 Robust_ODML

3.4 Low-rank Robust Distance/Similarity learning methods

3.5 Convergence Analysis

3.6 Run Time Analysis

3.7 Online Triplet Constructing Algorithm

4 Experimental Results

4.1 Effect of Label Noise on the Generated Triplets

4.2 Experimental Setup

4.3 Results and Analysis

4.4 Detecting COVID-19 Patients from Chest-X-ray images

4.4.1 Dataset description

4.4.2 Experimental setup

4.4.3 Results and analysis

5 Conclusion and Future work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Low-rank robust online distance/similarity learning based on the rescaled hinge loss

Abstract

Similar content being viewed by others

Explaining Siamese networks in few-shot learning

Sub-One Quasi-Norm-Based k-Means Clustering Algorithm and Analyses

Balanced K-Means for Clustering

1 Introduction

2 Related work

3 Proposed methods

3.1 General Form of Objective Functions in Online Similarity/Distance methods:

3.2 Robust variant of the General Objective Function:

3.3 The proposed Robust methods

3.3.1 Robust_OASIS

3.3.2 Robust_ODML

3.4 Low-rank Robust Distance/Similarity learning methods

3.5 Convergence Analysis

3.6 Run Time Analysis

3.7 Online Triplet Constructing Algorithm

4 Experimental Results

4.1 Effect of Label Noise on the Generated Triplets

4.2 Experimental Setup

4.3 Results and Analysis

4.4 Detecting COVID-19 Patients from Chest-X-ray images

4.4.1 Dataset description

4.4.2 Experimental setup

4.4.3 Results and analysis

5 Conclusion and Future work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation