1 Introduction

The performance of many machine learning and data mining algorithms depends on the metric used to compute the distance between data. Generic measures such as Euclidean or Cosine similarity in an input space often fail to discriminate different classes or clusters of data. Therefore, learning an optimal Distance/Similarity function from training information is actively studied in the last decade.

Distance Metric Learning (DML) methods aim to bring semantically similar data items together while keeping dissimilar ones at a distance. One major challenge for DML algorithms is scalability to both the size and dimension of input data [1]. For processing massive volumes of data generated in today’s applications, online methods are proposed.

Many of these algorithms are based on the PA [2,3,4,5,6,7,8]. The main advantages of PA-based methods are 1) closed-form solution and 2) adaptive learning rate leading to a high convergence rate. However, the following challenges are still needed to be addressed:

  1. 1-

    These algorithms are based on the Hinge loss and hence are not robust against outliers and label noise data. Nowadays many modern datasets are collected from the Internet using crowdsourcing or similar techniques. Thus, examples with wrong labels are usual in these datasets that can considerably degrade the performance of existing online DML methods.

  2. 2-

    Most DML algorithms learn a metric from pair or triplet side information defined as:

$$ {\displaystyle \begin{array}{c}S=\left\{\left.\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_i\right)\right|{\boldsymbol{x}}_i\ \mathrm{and}\ {\boldsymbol{x}}_j\ \mathrm{are}\ \mathrm{similar}\right\}\\ {}D=\left\{\left.\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_i\right)\right|{\boldsymbol{x}}_i\ \mathrm{and}\ {\boldsymbol{x}}_j\ \mathrm{are}\ \mathrm{dissimilar}\right\}\\ {}T=\left\{\left.\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_i^{+},{\boldsymbol{x}}_i^{-}\right)\right|{\boldsymbol{x}}_i\ \mathrm{should}\ \mathrm{be}\ \mathrm{more}\ \mathrm{closer}\ \mathrm{to}\ {\boldsymbol{x}}_i^{+}\ \mathrm{than}\ \mathrm{to}\ {\boldsymbol{x}}_i^{-}\right\}\end{array}} $$

Existing online methods [3,4,5,6,7,8] usually assume training triplets or pairs exist in advance. However, this assumption does not hold, and generating constraints by available batch sampling methods is both time and space consuming. Thus, we need an efficient one-pass sampling algorithm for online tasks.

  1. 3-

    Another important challenge in online DML applications, particularly in machine vision domain, is the high dimension of input data. Many existing methods learn Mahalanobis distance [3, 5, 6, 8] or bilinear similarity [2, 3] that require O(d2) parameters (d indicates the data dimension). Therefore, these methods are infeasible in high dimensional environments.

The main contributions of the paper to overcome these issues are as follows:

  1. 1-

    We address the first challenge by formulating the online Distance/Similarity learning task using the robust Rescaled Hinge loss [9]. The proposed model is rather general, and we can easily apply it to any existing PA-based methods. It significantly improves the robustness of existing methods in the presence of label noise without increasing their computational complexity.

  2. 2-

    The second challenge is tackled by developing an efficient robust one-pass triplet construction algorithm in our work.

  3. 3-

    Finally, we overcome the third challenge by developing the low-rank versions of the proposed methods that learn a rectangular projection matrix instead of a full Mahalanobis matrix. These approaches not only decrease the computational cost significantly but also retain the predictive performance of the learned metrics. Also, we can easily replace the low-rank projection matrix with a nonlinear deep neural network model. Therefore, extending our methods for online deep metric learning is straightforward.

Table 1 summarizes the main notations used throughout the paper. The rest of the paper is organized as follows: Section 2 reviews related work. In Section 3, we present the formulation of the online Distance/Similarity learning problem using the Rescaled Hinge loss as well as the development of the proposed algorithms. Experiments conducted to evaluate the proposed methods are discussed in Section 4. Finally, Section 5 concludes with remarks and recommendations for future work.

Table 1 Summary of the main notations

2 Related work

DML is a well-studied problem and attracts a lot of interest in the last decade. We refer interested readers to the surveys [1, 10] for a complete review of existing work. In this section, we only focus on related online Distance/Similarity learning algorithms. Most existing online learning methods learn a Mahalanobis distance [4,5,6,7,8] or a bilinear similarity [2, 3]. However, some more generic measures such as [5, 11] are also presented.

Mahalanobis-based methods learn a matrix M ≽0 given by:

$$ {d}_{\boldsymbol{M}}{\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)}^2={\left({\boldsymbol{x}}_i-{\boldsymbol{x}}_j\right)}^{\top}\boldsymbol{M}\left({\boldsymbol{x}}_i-{\boldsymbol{x}}_j\right) $$
(1)

Since the matrix M ≽0, it can be decomposed as M = LLT where L ∈d × r and r =  rank (M). Therefore, Mahalanobis distance learning is equivalent to find a linear transformation L in the input space. Instead, bilinear similarity-based methods learn a similarity matrix M given by:

$$ {S}_{\boldsymbol{M}}{\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)}^2={{\boldsymbol{x}}_i}^{\top }{\boldsymbol{M}\boldsymbol{x}}_j $$
(2)

The optimization problem of both Mahalanobis and bilinear methods is formulated based on the PA approach as follows:

$$ {\displaystyle \begin{array}{c}{\boldsymbol{M}}_{t+1}=\arg \underset{\boldsymbol{M}}{\ \min }\ reg\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\xi \\ {}\mathrm{subject}\ \mathrm{to}\kern1em l\left(\boldsymbol{M},{R}_t\right)\le \xi, \kern3.5em \xi \ge 0,\dots \dots \dots \boldsymbol{M}\succcurlyeq 0\end{array}} $$

where Mt is the current Distance/Similarity matrix at time step t, reg(M, Mt ) is a regularization term, and l(M, Rt) indicates the margin-based Hinge loss function. In distance-based methods, the Hinge loss is defined as:

$$ l\left(\boldsymbol{M},\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+},{\boldsymbol{p}}_t^{-}\right)\right)=\max \left\{0,1+{d}_{\boldsymbol{M}}{\left({\boldsymbol{p}}_{\boldsymbol{t}},{\boldsymbol{p}}_{\boldsymbol{t}}^{+}\right)}^2-{d}_{\boldsymbol{M}}{\left({\boldsymbol{p}}_{\boldsymbol{t}},{\boldsymbol{p}}_{\boldsymbol{t}}^{-}\right)}^2\right\} $$
(4)

whereas it is defined in similarity-based methods as:

$$ l\left(\boldsymbol{M},\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+},{\boldsymbol{p}}_t^{-}\right)\right)=\max \left\{0,1-{S}_M{\left({\boldsymbol{p}}_{\boldsymbol{t}},{\boldsymbol{p}}_{\boldsymbol{t}}^{+}\right)}^2+{S}_M{\left({\boldsymbol{p}}_{\boldsymbol{t}},{\boldsymbol{p}}_{\boldsymbol{t}}^{-}\right)}^2\right\} $$
(5)

OASISFootnote 1 [2] is a popular bilinear similarity learning method that uses the Frobenius norm as a regularization term, i.e. \( reg\left(\boldsymbol{M},{\boldsymbol{M}}_t\ \right)=\frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2. \) OASIS eliminates the p.s.d (positive semi-definite) constraint for scalability reasons. However, this property is extremely useful to produce a low-rank metric as well as to prevent overfitting.

OKSFootnote 2 [3] extends OASIS in the feature space of an RKHS kernel. Also, [3] presents OMKS which is the extension of OKS for multi-modal data.

ODMLFootnote 3 and OMDMLFootnote 4 [4] are similar to OKS and OMKS respectively, but instead of bilinear similarity, they learn Mahalanobis distance. To enforce the p.s.d constraint, these methods use full Eigen value decomposition that involves O(d3) operations at each stage. Therefore, they are infeasible for high-dimensional DML tasks. To address this problem, LSMDMLFootnote 5 [8] utilizes DRP (Dual Random Projection) [12] in an online multi-modal environment to enforce the p.s.d constraint.

SLMOMLFootnote 6 [5] is the online version of the seminal ITMLFootnote 7 [13] method. It uses the logdet regularization term that automatically enforces the p.s.d constraint at each time step. However, it has a low convergence rate and still requires O(d2) parameters.

In [6] a large-scale local online Distance/Similarity framework is presented. It learns multiple metrics for the task at hand, one metric per class in the dataset. Each metric in this framework consists of a global and a local component learned simultaneously. Having a common component for local metrics shares discriminating information among them and efficiently reduces the overfitting problem.

OPMLFootnote 8 [7] is an online DML method that learns projection matrix L (see eq. (1)) directly, so it does not require imposing the p.s.d constraint. In practice, L has a rectangular form (L ∈d × r,r ≪ d). However, OPML learns a square d × d matrix and obtains a closed-form solution with the O(d2) time complexity. Also, it adopts the Frobenius norm regularization term and the popular Hinge loss function. An interesting feature of OPML is its triplet sampling strategy which constructs triplets from incoming data in an online setting.

OAHUFootnote 9 [14] aims to dynamically adapt the complexity of the model and to effectively utilize the input constraints during the learning process. For this purpose, this method introduces the Adaptive-Bound Triplet Loss (ABTL) instead of the commonly used Hinge loss. Also, it uses an over-complete neural network model and connects a different MEI (Metric Embedding Layer) to each hidden layer of the network. The overall loss is considered as a weighted average loss of each MEI.

Table 2 summarizes the advantages and limitations of existing online metric learning methods. As seen, all studied online Distance/Similarity models are non-roust against label noise. These methods assume that the input training information is perfect. However, this assumption may be wrong in practical machine vision applications where this information is collected from the Internet by crowdsourcing or similar techniques. Although some robust DML methods such as [7, 8, 15, 16] are presented, these methods are focused on batch settings. Among them, only Bayesian approaches [7, 16] can be extended for online settings. However, although Bayesian learning helps to avoid over-fitting in a small or a dataset with noisy features, it is less effective to deal with the more complicated problem (i.e., label noise).

Table 2 Advantages and limitations of existing online metric learning methods

Many metric learning algorithms such as [8, 11, 17, 18] generate triplets from training data using the following batch procedure. Each data point xi is considered similar to its k nearest neighbors with the same label (named target neighbors of xi). Suppose xj is a target neighbor of xi. The imposters of xi are any data point xl of a different class (i.e., yi ≠ yl) which violates the following condition:

$$ d\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)+ margin<d\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_l\right) $$

where d is a distance measure such as Euclidean.

The data point xi is set dissimilar to any of its imposters. Then, the triplets are formed by the natural join of similar and dissimilar pairs. Figure 1 illustrates the concepts of target neighbors and imposters.

Fig. 1
figure 1

Illustration of target neighbors and imposters of xi [17]

Generating triplets using this procedure is both time and space consuming and is not feasible for online tasks. Although an online triplet construction algorithm presented in OPML that is very efficient in terms of computational cost, it does not consider the distribution and structure of data. Therefore, it has a lower performance in comparison with that of the batch algorithm.

3 Proposed methods

In this section, we first derive a general form of objective functions in existing online Similarity/Distance methods. Then, we propose its robust variant based on the Rescaled Hinge loss. After, we develop some algorithms that efficiently solve the problem based on HQ.

3.1 General Form of Objective Functions in Online Similarity/Distance methods:

As observed, many Distance/Similarity algorithms are based on the margin-based Hinge loss function (lhinge). Let define the variable zt as follows:

$$ {z}_t=\left\{\begin{array}{c}{S}_M{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+}\right)}^2-{S}_M{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{-}\right)}^2,\kern0.5em For\ similarity- based\ methods\kern2em (6)\\ {}{d}_{\boldsymbol{M}}{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{-}\right)}^2-{d}_{\boldsymbol{M}}{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+}\right)}^2,\kern0.5em Mahalanobis- based\ methods\kern2em (7)\end{array}\right. $$

The Hinge loss is then can be written as:

$$ l\left(\boldsymbol{M},\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+},{\boldsymbol{p}}_t^{-}\right)\right)=\max \left\{0,1-{z}_t\right\} $$
(8)

Figure 2 shows the loss function. As seen, the loss linearly grows for z ≤ 1 with no bound. The unboundedness of the Hinge loss function causes the noisy labeled data and outliers to have a large effect on the training process that results in poor performance for the learned Distance/Similarity measure.

Fig. 2
figure 2

The margin-based Hinge loss function. The loss linearly grows for z ≤ 1 with no bound

Most existing Distance/Similarity learning methods can be formulated as follows:

$$ {\displaystyle \begin{array}{c}{\boldsymbol{M}}_{t+1}=\arg \kern0.10em \underset{\boldsymbol{M}}{\min}\kern0.10em \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+{Cl}_{hinge}\left({z}_t\right)\\ {}\left[\mathrm{subject}\ \mathrm{to}\ \boldsymbol{M}\succcurlyeq 0\right]\end{array}} $$
(9)

Note that the constraint M ≽ 0 is not adopted in all methods, so we enclose it by bracket. We can derive many existing methods from this generic optimization problem. For example, if we consider\( \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)=\frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2 \) and omit the M ≽ 0 constraint, then by defining zt according to (6), we obtain the OASIS [2] and OKSFootnote 10 [3] optimization problems. Also, if we consider zt equal to (7), the optimization problem (9) reduces to the OPML [7]. Similarly, by setting\( \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)=\frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2 \), zt equal to (7), and enforcing M ≽ 0, we reach the optimization problem in [4]. Finally, if we set \( \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)={D}_{ld}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)=\mathrm{trace}\left(\boldsymbol{M}{\boldsymbol{M}}_{\boldsymbol{t}}^{-\mathbf{1}}\right)-\log \det \left(\boldsymbol{M}{\boldsymbol{M}}_{\mathbf{0}}^{-\mathbf{1}}\right)-d \) and drop the M ≽ 0 constraint, we obtain the optimization problem of [5].

One approach to alleviate the effect of label noise data in PA-based problems (such as eq. (9)) is to select a small value for the hyper-parameter C. However, it causes lower values for the adaptive learning rate. Instead, we propose to replace the Hinge function with the robust Rescaled Hinge loss.

3.2 Robust variant of the General Objective Function:

The Rescaled Hinge loss is defined as:

$$ {l}_{rhinge}(z)=\beta \left[1-\exp \left(-\eta {l}_{hinge}(z)\right)\right] $$
(10)

Figure 3 shows the diagram of the lrhinge(z) loss function with different values of η. In this function, η is a rescaling parameter and β = 1/(1 − exp(−η)) is just a normalizing constant that ensures lrhinge(0) = 1. As seen, this loss function is more robust than the Hinge against the outliers and data contaminated with label noise. We can adjust the degree of robustness using the η parameter. Also, the Hinge loss can be regarded as a special case of the Rescaled Hinge. More specifically, lrhinge(z) becomes lhinge(z) as η → 0.

Fig. 3
figure 3

The Robust Rescaled hinge loss function vs z with different η values

By replacing the Hinge loss function with the Rescaled Hinge loss in eq. (9), we obtain the following optimization problem for online robust Distance/Similarity learning.

$$ {\displaystyle \begin{array}{c}{\boldsymbol{M}}_{\boldsymbol{t}+1}=\arg\ \underset{\boldsymbol{M}}{\min }\ \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+{Cl}_{\boldsymbol{rhinge}}\left({z}_t\right)\\ {}\left[\mathrm{subject}\ \mathrm{to}\ \boldsymbol{M}\succcurlyeq 0\right]\end{array}} $$
(11)

In the next subsections, we derive two efficient algorithms that efficiently solve the above optimization problem in online fashion.

3.3 The proposed Robust methods

Since the Rescaled Hinge loss is not convex, we need an efficient algorithm to solve the optimization problem (11). The proposed algorithms are based on HQ (Half Quadratic) which is an efficient alternating approach for optimizing non-convex problems. The main idea of HQ is to add an auxiliary variable such as v to the problem using Conjugate theory [19], such that the new optimization problem becomes quadratic to the main variable (with the same optimal solution as the original non-convex problem).

Since lrhinge(z) = β[1 − exp(−ηlhinge(z))], we can obtain the following problem which is equivalent to (11).

$$ {\displaystyle \begin{array}{c}{\boldsymbol{M}}_{\boldsymbol{t}+1}=\underset{\boldsymbol{M}}{\arg\ \max }-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \exp \left(-\eta {l}_{hinge}\left({z}_t\right)\right)\\ {}\left[\mathrm{subject}\ \mathrm{to}\ \boldsymbol{M}\succcurlyeq 0\right]\end{array}} $$
(12)

According to the definition of conjugate function, we have (refer to the Appendix A of [9]),

$$ \exp \left(-\eta {l}_{hinge}(z)\right)=\underset{v<0}{\sup}\left(\eta {l}_{hinge}(z)v-g(v)\right) $$
(13)

where g(v) =  − v log(−v) + v,   (v < 0). By substituting eq. (13) in (12), we obtain

$$ {\displaystyle \begin{array}{l}f\left(\mathbf{M}\right)=-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \exp \left(-\eta {l}_{hinge}\left({z}_t\right)\right)\\ {}\kern3em =-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \underset{v_t<0}{\sup}\left(\eta {l}_{hinge}\left({z}_t\right){v}_t-g\left({v}_t\right)\right)\ \\ {}\kern3em =\underset{v_t\prec 0}{\sup}\left(-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \left(\eta {l}_{hinge}\left({z}_t\right){v}_t-g\left({v}_t\right)\right)\right)\ \end{array}} $$
(14)

The third relation in (14) holds since reg(M, Mt) is constant regarding v. Using (14), we can rewrite (12) as:

$$ {\displaystyle \begin{array}{c}\left({\boldsymbol{M}}_{\boldsymbol{t}+1},{\boldsymbol{v}}_{\boldsymbol{t}}^{\ast}\right)=\underset{\boldsymbol{M},{\boldsymbol{v}}_{\boldsymbol{t}}}{\arg\ \max }-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \left(\eta {l}_{hinge}\left({z}_t\right){v}_t-g\left({v}_t\right)\right)\\ {}\left[\mathrm{subject}\ \mathrm{to}\ \boldsymbol{M}\succcurlyeq 0\right]\end{array}} $$
(15)

To solve the above problem, we use the alternating optimization approach. First, given M, we optimize (13) over vt and then given vt, we optimize it over M. Suppose M(s) is given (the superscript s indicates the iteration number), then (15) is equivalent to:

$$ {v}_t^{(s)}=\underset{\ {v}_t}{\arg\ \max}\kern0.50em \left(\eta {l}_{hinge}\left({z}_t\right){v}_t-g\left({v}_t\right)\right) $$
(16)

The above equation has a closed-form solution obtained by setting its derivative with respect to vt equal to zero.

$$ {v}_t^{(s)}=-\exp \left(-\eta {l}_{hinge}\left({z}_t\right)\right)\kern0.5em $$
(17)

After obtaining\( {v}_t^{(s)} \), we optimize the eq. (15) respecting to M as follows:

$$ {\displaystyle \begin{array}{c}\boldsymbol{M}=\underset{\boldsymbol{M}}{\arg\ \max }-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \eta {v}_t\ {l}_{hinge}\left({z}_t\right)\\ {}\left[\mathrm{subject}\ \mathrm{to}\ \boldsymbol{M}\succcurlyeq 0\right]\end{array}} $$
(18)

The above problem is equivalent to:

$$ {\displaystyle \begin{array}{c}\boldsymbol{M}=\underset{\boldsymbol{M}}{\arg\ \min}\kern0.5em \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+{C}_t{l}_{hinge}\left({z}_t\right)\\ {}\mathrm{subject}\ \mathrm{to}\kern0.5em \left[\boldsymbol{M}\succcurlyeq 0\right],\kern0.75em {l}_{hinge}\left({z}_t\right)\le \xi, \kern1.5em \xi \ge 0\end{array}} $$
(19)

where Ct = Cβη(−vt). The robustness of the optimization problem (19) can be explained using the penalty factor Ct. Suppose the current triplet Rt contains noisy labeled data, so the hinge function (lhinge(zt)) returns a large loss for Rt. Thus, Ct = Cβη(−vt) = Cβη exp(−ηlhinge(zt)) approaches zero. Therefore, Rt has less effect on the learning process.

The obtained optimization problem, unlike existing models, assigns an adaptive weight (Ct) for each incoming triplet. By adjusting reg(M, Mt), p.s.d constraint, and zt, we can obtain a family of robust Distance/Similarity learning methods. For instance, we develop two proposed algorithms named Robust_OASIS and Robust_ODML.Footnote 11 These algorithms can be considered as robust variants of existing OASIS [2] and ODML [4] respectively.

3.3.1 Robust_OASIS

The robust similarity-based algorithm can be derived from the general optimization problem (15) by the following settings:

\( \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)=\frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2 \), drop M ≽ 0 constraint, and define zt according to (6).

Then, the following optimization problem is achieved:

$$ \left({\boldsymbol{M}}_{\boldsymbol{t}+\mathbf{1}},{\boldsymbol{v}}_{\boldsymbol{t}}^{\ast}\right)=\underset{\boldsymbol{M},{\boldsymbol{v}}_{\boldsymbol{t}}}{\arg\ \max }-\frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2+ C\beta \left(\eta {l}_{hinge}\left({z}_t\right){v}_t-g\left({v}_t\right)\right) $$
(20)

The solution of the above problem is obtained by iteratively computing vt from eq. (17) and then optimizing M by solving the following optimization problem.

$$ {\displaystyle \begin{array}{c}\boldsymbol{M}=\underset{\boldsymbol{M}}{\arg\ \min}\kern0.5em \frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2+{C}_t\xi \\ {}\mathrm{subject}\kern0.5em \mathrm{to}\kern0.5em l\left({p}_t,{p}_t^{+},{p}_t^{-}\right)=1-{S}_{\boldsymbol{M}}\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+}\right)+{\boldsymbol{S}}_{\boldsymbol{M}}\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{-}\right)\le \xi, \kern1.5em \xi \ge 0\end{array}} $$
(21)

The problem (21) has a similar solution to that obtained in [2].

$$ {\displaystyle \begin{array}{c}{\boldsymbol{M}}_{t+1}={\boldsymbol{M}}_t+\tau {\boldsymbol{A}}_t\\ {}\mathrm{where}\kern0.5em \tau =\min \left({C}_t,\frac{l\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+},{\boldsymbol{p}}_t^{-}\right)}{{\left\Vert {\boldsymbol{A}}_{\boldsymbol{t}}\right\Vert}_F^2}\right)\kern0.5em \mathrm{and}\kern0.5em {\boldsymbol{A}}_t={\boldsymbol{p}}_t{\left({\boldsymbol{p}}_t^{+}-{\boldsymbol{p}}_t^{-}\right)}^{\top}\end{array}} $$
(22)

The main difference is that now the learning rate τ is bounded to the adaptive triplet weight Ct instead of the constant C in the OASIS. Algorithm 1 summarizes the steps of Robust-OASIS.

figure e

3.3.2 Robust_ODML

The robust Mahalanobis distance learning algorithm can be derived from the general optimization problem (15) by the following settings:

\( \mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)=\frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2 \), enforce M ≽ 0 constraint, and define zt according to (7).

We then obtain the following optimization problem:

$$ {\displaystyle \begin{array}{c}\left({\boldsymbol{M}}_{\boldsymbol{t}+1},{\boldsymbol{v}}_{\boldsymbol{t}}^{\ast}\right)=\underset{\boldsymbol{M},{\boldsymbol{v}}_{\boldsymbol{t}}}{\arg\ \max}\kern0.5em -\frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2+ C\beta \left(\eta {l}_{hinge}\kern0.35em \left({z}_t\right){v}_t-g\left({v}_t\right)\right)\\ {}\mathrm{subject}\ \mathrm{to}\kern0.5em \boldsymbol{M}\succcurlyeq 0\end{array}} $$
(23)

In a similar way to the Robust-OASIS, we obtain the solution by iteratively computing vt from the eq. (17) and then optimizing M by solving the following optimization problem.

$$ {\displaystyle \begin{array}{c}\boldsymbol{M}=\underset{\boldsymbol{M}}{\arg\ \min}\kern0.5em \frac{1}{2}{\left\Vert \boldsymbol{M}-{\boldsymbol{M}}_t\right\Vert}_F^2+{C}_t\xi \\ {}\mathrm{subject}\ \mathrm{to}\kern0.5em l\left({\boldsymbol{p}}_{\boldsymbol{t}},{\boldsymbol{p}}_{\boldsymbol{t}}^{+},{\boldsymbol{p}}_{\boldsymbol{t}}^{-}\right)=1+{d}_{\boldsymbol{M}}^2\left({\boldsymbol{p}}_{\boldsymbol{t}},{\boldsymbol{p}}_{\boldsymbol{t}}^{+}\right)-{d}_{\boldsymbol{M}}^2\left({\boldsymbol{p}}_{\boldsymbol{t}},{\boldsymbol{p}}_{\boldsymbol{t}}^{-}\right)\le \xi, \kern1.5em \xi \ge 0,\kern1em \boldsymbol{M}\succcurlyeq 0\end{array}} $$
(24)

The solution of the above problem is similar to that of [4].

$$ {\displaystyle \begin{array}{c}{\boldsymbol{M}}_{t+1}={\boldsymbol{M}}_t+\tau {\boldsymbol{A}}_t\\ {}\mathrm{where}\kern0.5em \tau =\min \left({C}_t,\frac{l\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+},{\boldsymbol{p}}_t^{-}\right)}{{\left\Vert {\boldsymbol{A}}_{\boldsymbol{t}}\right\Vert}_F^2}\right)\kern0.5em \mathrm{and}\kern0.5em \\ {}{\boldsymbol{A}}_t=\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{-}\right){\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{-}\right)}^{\top }-\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{+}\right){\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{+}\right)}^{\top}\end{array}} $$
(25)

Algorithm 2 summarizes the steps of Robust-ODML.

figure f

To enforce the p.s.d constraint, the naive approach is to perform the full Eigen value decomposition of matrix M and then set its negative Eigen values to zero. This approach requires O(d3) operations, so it is infeasible for high-dimensional DML tasks. Although some improved methods are available [6, 8, 12], we address this problem by developing the low-rank versions of the proposed algorithms in the following subsections.

3.4 Low-rank Robust Distance/Similarity learning methods

Instead of a full Mahalanobis matrix M ∈d × d , the proposed low-rank methods learn a rectangular projection matrix L ∈d × r where M = LLand r is the rank of M. We follow this approach to achieve the low-rank variants of the proposed method since: 1) it automatically enforces the p.s.d constraint, 2) in many real applications data lie on a latent subspace with dimensionality r ≪ d. Thus, this approach requires fewer parameters. An important problem is how to adjust the hyper parameter r. While some sophisticated methods like Bayesian variational inference [16] or low-rank approximation [20] exist that can automatically adjust the value of r; here, we simply use the cross-validation.

The optimization problem for low-rank online Distance/Similarity learning is formulated as:

$$ {\boldsymbol{L}}_{\boldsymbol{t}+\mathbf{1}}=\underset{\boldsymbol{L}}{\arg\ \max }-\mathrm{reg}\left(\boldsymbol{L},{\boldsymbol{L}}_t\right)+ C\beta \left(\eta {l}_{hinge}\ \left({z}_t\right){v}_t-g\left({v}_t\right)\right) $$
(26)

If we rewrite both the bilinear similarity and the Mahalanobis distance as functions of L as follows:

$$ {S}_{\boldsymbol{L}}{\left(\boldsymbol{p},\boldsymbol{q}\right)}^2={\boldsymbol{p}}^{\top}\boldsymbol{Mq}={\boldsymbol{p}}^{\top }{\boldsymbol{L}\boldsymbol{L}}^{\top}\boldsymbol{q}={\left({\boldsymbol{L}}^{\boldsymbol{T}}\boldsymbol{p}\right)}^{\top}\left({\boldsymbol{L}}^{\top}\boldsymbol{q}\right) $$
(27)
$$ {d}_{\boldsymbol{L}}{\left(\boldsymbol{p},\boldsymbol{q}\right)}^2={\left(\boldsymbol{p}-\boldsymbol{q}\right)}^{\top}\boldsymbol{M}\left(\boldsymbol{p}-\boldsymbol{q}\right)={\left(\boldsymbol{p}-\boldsymbol{q}\right)}^{\top }{\boldsymbol{L}\boldsymbol{L}}^{\top}\left(\boldsymbol{p}-\boldsymbol{q}\right)={\left\Vert {\boldsymbol{L}}^{\top}\boldsymbol{p}-{\boldsymbol{L}}^{\top}\boldsymbol{q}\right\Vert}_2^2, $$
(28)

Then, the bilinear similarity learning is equivalent to finding a linear projection L and then applying dot product to the inputs in the projected space. Similarly, Mahalanobis distance learning corresponds to compute the Euclidean distance after transforming the inputs by L.

The zt variable can be expressed in terms of SL and dL as:

$$ {z}_t=\left\{\begin{array}{c}{S}_{\boldsymbol{L}}{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+}\right)}^2-{S}_{\boldsymbol{L}}{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{-}\right)}^2,\kern0.5em For\ similarity- based\ methods\kern1em (29)\\ {}{d}_{\boldsymbol{L}}{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{-}\right)}^2-{d}_{\boldsymbol{L}}{\left({\boldsymbol{p}}_t,{\boldsymbol{p}}_t^{+}\right)}^2,\kern0.5em Mahalanobis- based\ methods\kern1em (30)\end{array}\right. $$

Now, we can easily derive the proposed low-rank robust similarity learning algorithm named Robust-LOSLFootnote 12 from the generic optimization problem (26) with the following settings:

\( \mathrm{reg}\left(\boldsymbol{L},{\boldsymbol{L}}_t\right)=\frac{1}{2}{\left\Vert \boldsymbol{L}-{\boldsymbol{L}}_t\right\Vert}_2^2 \), define zt according to (29).

The obtained optimization problem can be solved by iteratively computing vt from the eq. (17) and then optimizing L by solving the following optimization problem:

$$ {\boldsymbol{L}}_{t+1}=\underset{\boldsymbol{L}}{\arg \kern0.15em \min}\kern0.50em \frac{1}{2}{\left\Vert \boldsymbol{L}-{\boldsymbol{L}}_t\right\Vert}_F^2+{C}_t{l}_{hinge}\left({z}_t\right) $$
(31)

The above unconstrained optimization problem is non-convex. However, we can solve it efficiently by optimizing a simple linear neural network model parameterized by L as illustrated in Fig. 4.

Fig. 4
figure 4

The proposed neural network model for Low-rank Robust Online Distance/Similarity learning

The sub-gradient of the loss function with respect to L can be computed from the following equation:

$$ {\displaystyle \begin{array}{c}\frac{{\partial l}_t}{\partial \boldsymbol{L}}=\left(\boldsymbol{L}-{\boldsymbol{L}}_t\right)+{C}_t\left[{\boldsymbol{p}}_t{\boldsymbol{p}}_t^{-\intercal }+{\boldsymbol{p}}_t^{-}{\boldsymbol{p}}_t^{\intercal }-{\boldsymbol{p}}_t{\boldsymbol{p}}_t^{+\intercal }-{\boldsymbol{p}}_t^{+}{\boldsymbol{p}}_t^{\intercal}\right]\boldsymbol{L}\\ {}=\left(\boldsymbol{L}-{\boldsymbol{L}}_t\right)-{C}_t\left[{\boldsymbol{p}}_t{\left({\boldsymbol{p}}_t^{+}-{\boldsymbol{p}}_t^{-}\right)}^{\intercal }+\left({\boldsymbol{p}}_t^{+}-{\boldsymbol{p}}_t^{-}\right){\boldsymbol{p}}_t^{\intercal}\right]\boldsymbol{L}\\ {}\begin{array}{c}\left(\boldsymbol{L}-{\boldsymbol{L}}_t\right)-{C}_t\left[{\boldsymbol{A}}_{\boldsymbol{t}}+{\boldsymbol{A}}_{\boldsymbol{t}}^{\intercal}\right]\boldsymbol{L}\\ {} where\ {A}_t={\boldsymbol{p}}_t{\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{+}\right)}^{\intercal}\end{array}\end{array}} $$
(32)

Thus, we can train the network using backpropagation or more sophisticated algorithms such as Adams. The steps of Robust-LOSL are summarized in Algorithm3.

figure g

Similarly, we can derive the proposed low-rank robust distance learning algorithm named Robust-LODMLFootnote 13 from the generic optimization problem (26) with the following settings:

\( \mathrm{reg}\left(\boldsymbol{L},{\boldsymbol{L}}_t\right)=\frac{1}{2}{\left\Vert \boldsymbol{L}-{\boldsymbol{L}}_t\right\Vert}_F^2 \), define zt according to (30).

We solve the obtained problem iteratively by computing vt from the eq. (17) and then updating L by optimizing the neural network model presented in Fig. 4. The sub-gradient of the loss function with respect to L can be computed from the following equation:

$$ {\displaystyle \begin{array}{c}\frac{{\partial l}_t}{\partial \boldsymbol{L}}\left|=\left(\boldsymbol{L}-{\boldsymbol{L}}_t\right)+2{C}_t\left[\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{+}\right){\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{+}\right)}^{\intercal }-\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{-}\right){\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{-}\right)}^{\intercal}\right]\boldsymbol{L}\right.\\ {}\left|=\left(\boldsymbol{L}-{\boldsymbol{L}}_t\right)+2{C}_t{\boldsymbol{A}}_t\boldsymbol{L}\right.\\ {} where\ {\boldsymbol{A}}_t=\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{-}\right){\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{-}\right)}^{\intercal }-\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{+}\right){\left({\boldsymbol{p}}_t-{\boldsymbol{p}}_t^{+}\right)}^{\intercal}\end{array}} $$
(33)

Algorithm 4 summarizes the steps of Robust_LODML. We can easily replace the linear module in the proposed low-rank model with a nonlinear deep neural network module. Thus, extending our methods for online deep Distance/Similarity learning is straightforward. Also, the experimental results in the next section confirm that Robust_LODML reduces the computational cost significantly while preserving the predictive performance of the learned metric.

figure h

3.5 Convergence Analysis

This subsection establishes the convergence of our methods with a similar analysis in [9]. According to (14),

$$ {\displaystyle \begin{array}{l}f\left(\boldsymbol{M},\boldsymbol{v}\right)=-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \left(\eta {l}_{hinge}\left({z}_t\right){v}_t-g\left({v}_t\right)\right)\\ {}\kern4em =-\mathrm{reg}\left(\boldsymbol{M},{\boldsymbol{M}}_t\right)+ C\beta \exp \left(-\eta {l}_{hinge}\left({z}_t\right)\right)\le C\beta \end{array}} $$

The inequality holds since reg(M, Mt) ≥ 0 and exp(−ηlhinge(zt)) ≤ 1. Thus, our objective function f(M, v) is upper bounded. Let \( f\left({\boldsymbol{M}}_{\boldsymbol{t}}^{\left(\boldsymbol{s}\right)},{\boldsymbol{v}}_{\boldsymbol{t}}^{(s)}\right) \) indicates the objective function in the s-th iteration of the HQ loop. According to (16) and (19), we have

$$ f\left({\boldsymbol{M}}_{\boldsymbol{t}}^{\left(\boldsymbol{s}\right)},{\boldsymbol{v}}_{\boldsymbol{t}}^{(s)}\right)\le f\left({\boldsymbol{M}}_{\boldsymbol{t}}^{\left(\boldsymbol{s}\right)},{\boldsymbol{v}}_{\boldsymbol{t}}^{\left(s+1\right)}\right)\le f\left({\boldsymbol{M}}_{\boldsymbol{t}}^{\left(\boldsymbol{s}+\mathbf{1}\right)},{\boldsymbol{v}}_{\boldsymbol{t}}^{\left(s+1\right)}\right) $$

It means that the sequence \( \left\{f\left({\boldsymbol{M}}_{\boldsymbol{t}}^{\left(\boldsymbol{s}\right)},{\boldsymbol{v}}_{\boldsymbol{t}}^{(s)}\right),s=1,2,\dots, MaxHQIter\ \right\} \) generated by our algorithms is nonincreasing. Consequently, by considering the convergence property of gradient descent methods [21], the convergence of the proposed algorithms are established.

3.6 Run Time Analysis

As seen, the proposed robust online Distance/Similarity learning model is general and can easily be applied to the existing online Distance/Similarity algorithms. Let A be an online Distance/Similarity algorithm with the time complexity TA. By applying our method to A, besides optimizing the Distance/Similarity measure, we require to compute the weight of the incoming triplet (Ct) using the eq. (17). As seen, Ct requires evaluation of lhinge(zt) which is also needed for updating the metric. Therefore, it does not imply additional costs, and the overall time complexity of the robust method will be O(MAXHQIter × TA). The experimental results confirm that the convergence of the alternating loop is fast, and the best results are obtained by taking MAXHQIter ≤ 3 in all experiments. Therefore, the obtained robust method has the same time complexity as the corresponding algorithm (A).

3.7 Online Triplet Constructing Algorithm

Generating triplets using available batch algorithms is both time and space consuming. Also, the one-pass triplet constructing strategy adopted in OPML has low performance, especially in noisy environments. To this end, we propose an online triplet constructing algorithm named OCTGFootnote 14 which is not only very efficient but also effective in comparison with the available batch methods. By utilizing the distribution and clusters of input data, the proposed algorithm can effectively detect outliers and noisy labeled data. Therefore, its performance surpasses existing methods in noisy environments.

Suppose {Vii = 1, 2, …, K} is the set of cluster centers initialized by a sample of data at the beginning of the online algorithm. Here, we use the k-means algorithm to obtain c cluster centers per class in the dataset. OCTG receives incoming data (xt, yt) at time step t and finds its closest cluster center Vt with the same class. Then, it considers any cluster center Vi from a different class (i.e., yi ≠ yt) which violates the following condition as an imposter (see Fig. 5):

$$ d\left({\boldsymbol{x}}_t,{\boldsymbol{V}}_t\right)+ margin<d\left({\boldsymbol{x}}_t,{\boldsymbol{V}}_i\ \right) $$
Fig. 5
figure 5

Illustration of imposters of the data point xt

The triplet set constructed at time step t is formed as:

$$ {T}_t=\left\{\left({\boldsymbol{x}}_t,{\boldsymbol{V}}_t,{\boldsymbol{V}}_i\ \right)|\ \mathrm{where}\ {\boldsymbol{V}}_i\ \mathrm{is}\ \mathrm{an}\ \mathrm{imposter}\ \right\} $$

As seen, the proposed methods assign a weight denoted by Ct to each incoming triplet. We assign weight wt to xt equal to the minimum weights of the generated triplets at time t. The small value for wt means that xt is a potential outlier or a noisy labeled instance. The weight and input data are then used to update the cluster centers using any existing online clustering methods.

The obtained weights can be used to enhance the performance of any metric-based algorithms such as kNN or CBIR (Content-Based Information Retrieval) in noisy environments. For example, we use the following version of kNN named Robust-kNN (instead of the standard kNN) to classify the objects in the experiments.

figure i

Figure 6 depicts the system flow of the proposed learning/test schemes.

Fig. 6
figure 6

The system flow of the proposed learning/test schemes

4 Experimental Results

This section deals with the experiments performed to evaluate the performance of proposed methods in noisy environments. First, we study the effect of label noise on the generated triplets and then discuss how these noisy triplets affect the performance of online DML methods. Subsequently, we evaluate the performance of proposed methods on real datasets at different levels of label noise. The results are compared with peer methods.

The Bilinear or dot product similarity learning is equivalent to Mahalanobis distance learning when each instance of the input triplet has a unit norm (i.e., ‖p2 = 1). Thus, the experiments focus on the Mahalanobis distance learning.

4.1 Effect of Label Noise on the Generated Triplets

As depicted in Fig. 7, we can distinguish between three different types of noisy triplets: anchor, positive, and negative noisy triplets.

Fig. 7
figure 7

Three different types of noisy triplets in the form (xi, xj, xl): (a) Anchor noisy triplet where xi is contaminated with label noise, (b) Positive noisy triplet where xj has label noise, and (c) Negative noisy triplet where xl has a wrong label

To study the effects of different types of noisy triplets, we apply 10% label noise to the Wine dataset. The noisy dataset is visualized using the T-SNE algorithm [22] in Fig. 8.

Fig. 8
figure 8

T-SNE Visualization of the Wine dataset after applying 10% label noise

The statistics of the generated triplets using both batch [17] and OCTG methods are summarized in Table 3.

Table 3 Statistics of generated triplets in the Wine dataset contaminated with 10% label noise

As the results in Table 3 indicate, by applying only 10% label noise, 68% and 46% of generated triplets by the batch and OPML triplet construction methods are contaminated respectively. On the other hand, OCTG only constructs 25% contaminated triplets (just from anchor noisy type). It is due to the fact that OCTG selects positive and anchor points from cluster centers, not data instances that may have been contaminated by label noise. The generated noisy triplets by OCTG have large losses in comparison with that of normal ones (1.67 vs 0.39). It can be explained by the fact that a labeled noise example is often far away from its cluster center while it is close to a center from other classes. Hence, the proposed robust methods assign very small weights (Ct = Cβη exp(−ηlhinge(zt))) to them in the learning process and so they have a negligible effect on the learned metric.

To analyze the effect of different types of triplet noise in a typical DML task, we run the ODML [4] with the following settings on the generated triplets by the batch method.

ODML: The ODML algorithm.

Ideal ODML: The ideal algorithm which knows the noisy triplets in advance and so ignores them in the training process.

Anchor Ideal ODML: The ideal algorithm that only knows the anchor noisy triplets in advance.

Pos Ideal ODML: The ideal algorithm that only knows the positive noisy triplets in advance.

Neg Ideal ODML: The ideal algorithm that only knows the negative noisy triplets in advance.

In this experiment, we divide the dataset into train/test with a 70/30 ratio and run the above algorithms ten times on the dataset. Figure 9 depicts the mean of obtained results by various algorithms.

Fig. 9
figure 9

The kNN (k = 3) accuracy of the learned metric of various algorithms in the Wine dataset with 10% label noise

For small values of C, the results indicate that the learned metric by ODML has no meaningful difference with that of Euclidean. For large values of C, ODML performs worse than Euclidean, and its accuracy substantially degrades in this noisy environment. Also, among the ideal methods (cannot be implemented in practice), the Anchor Ideal ODML has the same performance as Ideal ODML, and others (Pos Ideal ODML, Neg Ideal ODML) are ineffective. Thus, anchor noisy triplets are the main reason for low performance in this experiment.

We repeat the experiment by running Robust-LODML using the triplets generated by our mechanism. The mean-accuracy of kNN-Robust-LODMLFootnote 15 (k = 3, η = 3) and the weights assigned to instances by Robust-LODML are depicted in Figs. 10 and 11 respectively. As the results show, the proposed method is robust against label noise and its performance surpasses Euclidean metric even for the large values of C. Also, Robust-LODML effectively identified the contaminated instances and considerably reduces their weights (Ct) in the training process.

Fig. 10
figure 10

The kNN accuracy of the learned metric by Robust-LODML algorithm (η = 3) in the Wine dataset with 10% label noise

Fig. 11
figure 11

The tSNE visualization of the Wine dataset with 10% label noise where data points are displayed (a) with equal sizes (b) with the sizes proportional to their weights

As shown in Fig. 3, the parameter η controls the robustness of the loss function against outliers and data with noisy labels. To study its effect on the noisy data in a real experiment, we apply 20% label noise on the Wine dataset. Then, we evaluate Robust-LODML in a 5-fold cross-validation setting. Figure 12 depicts the mean accuracy of kNN-Robust-LODML (k = 3) on the dataset. As the result show, the lower η values considerably degrade the performance of Robust-LODML. Also, by properly setting the η value, the performance of our method substantially increases in the noisy environment.

Fig. 12
figure 12

Mean accuracy of kNN-Robust-LODML (k = 3) vs. η values on the Wine dataset with 20% label noise

The results are obtained by using only one dataset. In the next subsections, we evaluate the proposed methods on the variety of datasets in different label noise levels. Also, the results are compared with state-of-the-art methods.

4.2 Experimental Setup

Table 4 summarizes the statistics of evaluated datasets in the experiments. Here, all datasets except Letters are normalized so that the mean and standard deviation of each attribute becomes 0 and 1, respectively. Also, the dimension of images in Extended Yale Faces has been reduced to 100 by applying PCA to alleviate the feature noise effects. The parameter d in Table 4 denotes the input dimension after feature reduction.

Table 4 Statistics and explanations of evaluated datasets

In the experiments, triplet side information is generated using OCTG for the proposed methods whereas the one-pass triplet construction [7] is adopted for the other methods.

The results are obtained by k-fold cross validation (k = 5 is set for Letters and Extended Yale Faces and k = 10 for other datasets). The results are compared with peer distance-based methods: ODML [4], LPA-ODMLFootnote 16 [6], and OPML [7].

The hyperparameters of the competing methods are adjusted by k-fold cross-validation as follows. The parameter C in ODML and the proposed methods are selected from (10−6, 30). The η in the proposed methods is chosen from the range (0.01, 5). Also, λ in OPML is selected from (10−6, 0.05). We evaluate the performance of the learned metrics by the kNN classifier with k = 3 in the experiments.

4.3 Results and Analysis

Table 5 presents the classification accuracy of the kNN using the learned metrics of the competing methods. Here, the parameter nl shows label noise level (in percent). Figure 7 depicts the mean of 5-fold cross validation accuracy of competing methods versus nl (ranging from 0% to 20%). To make the comparison meaningful, the statistical analysis test with p − value = 5% was performed on the results. In Table 5, we marked our results by * when differences with other methods were statistically significant. Also, boxplots of some statistically different results are depicted in Fig. 14.

Table 5 The classification accuracy of the kNN using the learned metric of the competing methods

As the results in Table 5 and Fig. 13 indicate, the proposed robust methods (i.e., Robust-ODML and Robust-LODML) significantly outperform other DML methods in the presence of label noise. Also, the performance of these methods declines more slowly than other ones with the increase of noise level. That confirms our claim that using the robust loss function and robust sampling preserves the discrimination of the learned metric in a noisy environment.

Fig. 13
figure 13

Comparison of the classification accuracy of RDML with other DML methods versus label noise

Fig. 14
figure 14

Boxplots of some statistically different results with p-value = 5%

Fig. 15
figure 15

Four images from the COVID-19 dataset. First row: Normal cases, Second row: COVID-19 patients

Fig. 16
figure 16

2×Sensitivity + Precision and G-means of the competing methods on the COVID-19 dataset

Fig. 17
figure 17

Mean run time of evaluated methods in 5-fold validation setting in the COVID-19 dataset

Besides, the low-rank version of the proposed method (i.e., Robust-LODML) almost has the same accuracy as Robust-ODML. That confirms in real datasets, data lie on a latent subspace with dimensionality r ≪ d. Thus, learning the projection matrix Ld × r instead of full Mahalanobis matrix M results in the same performance while it is more efficient in terms of time and space requirements.

In the next subsection, we evaluate our proposed methods in a more challenging dataset for identifying COVID-19 patients from Chest-X-ray images.

4.4 Detecting COVID-19 Patients from Chest-X-ray images

4.4.1 Dataset description

The dataset used in our experiments is publicly available in the kaggle repositoryFootnote 17 [25]. Figure 15 depicts some examples from both classes. It contains 219 COVID-19 cases and 1341 normal images. As seen, the dataset is imbalanced and is too small to train a deep CNN model from scratch.

4.4.2 Experimental setup

To extract features from the images, we use the pretrained Resnet18 [26]. This network was trained on the ImageNet dataset (with 1.4 million labeled images and 1000 different classes). It has 71 layers, and the input layer requires images of size 224-by-224-by-3. We resize the images to the specified size and obtain 512 features from the global pooling layer, ‘pool5’, at the end of the model.

In addition of online methods, we also compared the proposed methods with the BLMNN [18] batch method. The λ and maxIter hyperparameters of BLMNN are selected from the ranges {1, 3, 5, 10, 20} and {1, 3, 5} respectively using 5-fold cross-validation.

We use 5-fold cross-validation to obtain the results in the experiments. The main concern in this task is to limit the number of missed COVID-19 cases. Hence, in addition to accuracy, we utilize a variety of metrics to evaluate our work. These metrics are Sensitivity (Recall), Precision, F1 Score, and G-mean (Geometric-mean). Here, COVID-19 and Normal are considered as positive and negative classes, respectively. The metrics are defined as follows:

$$ Accuracy=\left( TP+ TN\right)/ All\ Predictions $$
(34)
$$ Sensitivity\ (Recall)= TP/\left( FN+ TP\right) $$
(35)
$$ Precision= TP/\left( TP+ FP\right) $$
(36)
$$ F1- Score=2\ \left( Precision\times Sensitivity\right)/\left( Precision+ Sensitivity\right) $$
(37)
$$ Specificity= TN/\left( TN+ FP\right) $$
(38)
$$ G- mean=\sqrt{Sensitivity\times Specificity} $$
(39)

4.4.3 Results and analysis

Table 6 presents the classification results of the kNN using the learned metrics in the different levels of label noise. The results of both sensitivity and precision of the competing methods versus noise level are shown in Fig. 16(a). Since sensitivity is more important in this task, we multiply it by 2. Also, Fig. 16(b) presents the G-mean results versus noise level. The high value of G-mean indicates that accuracy in both classes is high and balanced.

Table 6 Classification metrics of kNN using the learned metrics of competing methods on the COVID-19 dataset

As the results indicate, all methods obtain a high performance in a low-level label noise setting. However, with the increase of noise level, the performance of the competing methods declines sharper than our proposed methods. Especially, while the BLMNN (batch method) has the advantage of processing each data multiple times, it does not perform well in high-level noise settings. It can be explained as follows: 1) the batch triplet sampling utilized in BLMNN is vulnerable to label noise as discussed in subsection 4.1, 2) while Bayesian learning is effective to deal with feature noise, it is less helpful to deal with the more complicated problem (i.e., label noise).

The proposed methods achieve high sensitivity for COVID-19 patients in noisy environments. It is very important since the primary goal of this task is to limit the number of misclassified COVID-19 cases as much as possible. For example, the Confusion matrices of the proposed methods at Noise Level = 20% are shown in Table 7. As seen, only 1.8 and 1 (as the average of 5-fold cross validation) COVID-19 patients are misclassified as Normal by the proposed methods. Also, our methods obtain good precision (or predictive positive value). High precision is crucial since high FP (False Positive) increases the burden of the healthcare system for additional care and tests such as PCR (Polymerase Chain Reaction). Therefore, based on the results, we can conclude the proposed methods perform well in detecting COVID-19 cases in the presence of label noise. However, the difference between sensitivity and specificity values indicates further improvements are possible by adopting balancing techniques in this imbalanced dataset.

Table 7 Mean of confusion matrices of proposed methods obtained by 5-fold cross validation on the COVID-19 dataset (label noise = 20%)

Also, we studied the mean run time of the competing methods in a 5-fold cross-validation setting. The results are depicted in Fig. 17. Also, Table 8 shows the summary of statistics in the experiment. Here, in the “hyper-parameters” column, we only report the value of time-related hyper-parameters. The parameter r indicates the number of columns in the projection matrix L ∈d × r. Note that, OPML only can learn a square projection matrix (r = d = 512 in these experiments), while Robust-LODML can learn a rectangular low-rank matrix. For Robust-LODML, we adjust the value of r from {128, 256, 512}. The #active column shows the mean number of active triplets.Footnote 18 It also indicates the number of times that the algorithm should update the metric.

Table 8 Summary of statistics and run-time of the competing methods in a noise free (nl = 0%) and high-level noisy (nl = 20%) settings

The overall execution time of a DML method depends on the efficiency of the triplet sampling mechanism, the required time to update the metric, and its convergence rate. In the noise-free experiment, the average number of generated triplets by the one-pass triplet construction algorithm is 1231 (refer to Table 8). However, only a few of them violate the margin constraint. The mean number of active triplets for LPA-ODML, ODML, and OPML are 65.00, 100.60, and 33.20, respectively. Thus, OPML achieves a low runtime in this experiment. On the other hand, the OCTG utilized in our methods only generates an average of 43.40 triplets. The average number of active triplets for Robust ODML and Robust LODML are 21.00 and 26.80, respectively. As seen, the execution time of both Robust-ODML and Robust-LODML are acceptable in this experiment.

In the high-level noisy environment (nl = 20%), the convergence rate of non-robust methods (i.e., LPA-ODML, ODML, and OPML) is low. Therefore, the number of active constraints is high, and their execution times exceed the robust algorithms. Here, we found that the best hyper-parameters setting for Robust-LODML is r = 128, MaxHQIter = 1. Hence, the number of its parameters is a quarter of other methods. Also, it only has an average of 292.60 active constraints. Thus, its run time is considerably smaller than other competing methods.

5 Conclusion and Future work

Existing online Distance/Similarity learning methods are usually formulated by the Hinge loss and so are not robust against outliers and label noise data. Also, they often have the wrong assumption that training triplets or pairwise constraints exist in advance. Generating triplets using available batch algorithms is both time and space consuming. To address these challenges, we formulate the online Distance/Similarity learning problem using the robust Rescaled Hinge loss [9]. Also, we develop an efficient robust one-pass triplet sampling algorithm that takes data distribution and its clusters into account.

We further extend our work by providing the low-rank variants of proposed methods that learn a rectangular projection matrix instead of a full Mahalanobis matrix.

We studied the effects of label noise in a DML task and conducted several experiments to measure the performance of the proposed methods at different noise levels. Extensive experimental results show that the proposed methods can effectively detect wrong label data and reduce their influences in DML tasks. Thus, they consistently outperform other related online Distance/Similarity learning algorithms in noisy environments.

We intend to extend the work for online deep distance/similarity learning. Some other directions for future work are:

  1. I.

    Examining the performance of the proposed methods in other applications like CBIR.

  2. II.

    Extension of the proposed methods in imbalanced environments.

  3. III.

    Enhance the performance of the proposed online triplet sampling algorithm.