Skip to main content
Log in

Online binary classification from similar and dissimilar data

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Similar-dissimilar (SD) classification aims to train a binary classifier from only similar and dissimilar data pairs, which indicate whether two instances belong to the same class (similar) or not (dissimilar). Although effective learning methods have been proposed for SD classification, they cannot deal with online learning scenarios with sequential data that can be frequently encountered in real-world applications. In this paper, we provide the first attempt to investigate the online SD classification problem. Specifically, we first adapt the unbiased risk estimator of SD classification to online learning scenarios with a conservative regularization term, which could serve as a naive method to solve the online SD classification problem. Then, by further introducing a margin criterion for whether to update the classifier or not with the received cost, we propose two improvements (one with linearly scaled cost and the other with quadratically scaled cost) that result in two online SD classification methods. Theoretically, we derive the regret, mistake, and relative loss bounds for our proposed methods, which guarantee the performance on sequential data. Extensive experiments on various datasets validate the effectiveness of our proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data availability

Not applicable.

Code availability

Not applicable.

References

  • Bao, H., Niu, G., & Sugiyama, M. (2018). Classification from pairwise similarity and unlabeled data. In ICML, pp. 452–461.

  • Bao, H., Shimada, T., Xu, L., Sato, I., & Sugiyama, M. (2020). Similarity-based classification: Connecting similarity learning to binary classification. arXiv preprint arXiv:2006.06207.

  • Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. http://archive.ics.uci.edu/ml/index.php.

  • Cao, Y., Feng, L., Xu, Y., An, B., Niu, G., & Sugiyama, M. (2021). Learning from similarity-confidence data. In ICML, pp. 1272–1282.

  • Cao, Y., Wan, Z., Ren, D., Yan, Z., & Zuo, W. (2022). Incorporating semi-supervised and positive-unlabeled learning for boosting full reference image quality assessment. In CVPR, pp. 5851–5861.

  • Chen, R., Tang, Y., Zhang, W., & Feng, W. (2022). Deep multi-view semi-supervised clustering with sample pairwise constraints. Neurocomputing, 500, 832–845.

    Article  Google Scholar 

  • Crammer, K., Kulesza, A., & Dredze, M. (2009). Adaptive regularization of weight vectors. In NeurIPS, pp. 414–422.

  • Crammer, K., & Singer, Y. (2003). Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3(Jan), 951–991.

    Google Scholar 

  • Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(Mar), 551–585.

    MathSciNet  Google Scholar 

  • Dekel, O., Gilad-Bachrach, R., Shamir, O., & Xiao, L. (2012). Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1).

  • Er, M. J., Venkatesan, R., & Wang, N. (2016). An online universal classifier for binary, multi-class and multi-label classification. In ICSMC, pp. 003701–003706. IEEE.

  • Feng, L., Lv, J.-Q., Han, B., Xu, M., Niu, G., Geng, X., An, B., & Sugiyama, M. (2020). Provably consistent partial-label learning. In NeurIPS.

  • Freund, Y., & Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3), 277–296.

    Article  Google Scholar 

  • Hoi, S. C., Sahoo, D., Lu, J., & Zhao, P. (2021). Online learning: A comprehensive survey. Neurocomputing, 459, 249–289.

    Article  Google Scholar 

  • Ishida, T., Niu, G., & Sugiyama, M. (2018). Binary classification for positive-confidence data. In NeurIPS, pp. 5917–5928.

  • Jian, L., Gao, F., Ren, P., Song, Y., & Luo, S. (2018). A noise-resilient online learning algorithm for scene classification. Remote Sensing, 10(11), 1836.

    Article  Google Scholar 

  • Kaneko, T., Sato, I., & Sugiyama, M. (2019). Online multiclass classification based on prediction margin for partial feedback. arXiv preprint arXiv:1902.01056.

  • Kiryo, R., Niu, G., Du Plessis, M. C., & Sugiyama, M. (2017). Positive-unlabeled learning with non-negative risk estimator. In NeurIPS, pp. 1675–1685.

  • Kivinen, J., Smola, A. J., & Williamson, R. C. (2004). Online learning with kernels. IEEE Transactions on Signal Processing, 52(8), 2165–2176.

    Article  MathSciNet  Google Scholar 

  • Koçak, M. A., Shasha, D. E., & Erkip, E. (2016). Conjugate conformal prediction for online binary classification. In UAI. Citeseer.

  • LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

    Article  Google Scholar 

  • Li, Z., & Liu, J. (2009). Constrained clustering by spectral kernel learning. In ICCV, pp. 421–427.

  • Li, M., Zhang, T., Chen, Y., & Smola, A. J. (2014). Efficient mini-batch training for stochastic optimization. In KDD, pp. 661–670.

  • Liu, D., Zhang, P., & Zheng, Q. (2015). An efficient online active learning algorithm for binary classification. Pattern Recognition Letters, 68, 22–26.

    Article  Google Scholar 

  • Lu, N., Niu, G., Menon, A. K., & Sugiyama, M. (2019). On the minimal supervision for training any binary classifier from only unlabeled data. In ICLR.

  • Lu, N., Zhang, T., Niu, G., & Sugiyama, M. (2020). Mitigating overfitting in supervised classification from two unlabeled datasets: A consistent risk correction approach. In AISTATS, pp. 1115–1125.

  • Lu, D., & Weng, Q. (2007). A survey of image classification methods and techniques for improving classification performance. International Journal of Remote Sensing, 28(5), 823–870.

    Article  Google Scholar 

  • MacQueen, J., et al. (1967). Some methods for classification and analysis of multivariate observations. In Berkeley symposium on mathematical statistics and probability, pp. 281–297.

  • Maheshwara, S. S., & Manwani, N. (2023). Rolnip: Robust learning using noisy pairwise comparisons. In ACML, pp. 706–721.

  • Natarajan, N., Dhillon, I. S., Ravikumar, P. K., & Tewari, A. (2013). Learning with noisy labels. In NeurIPS, pp. 1196–1204.

  • Plessis, M. C., Niu, G., & Sugiyama, M. (2015). Convex formulation for learning from positive and unlabeled data. In ICML, pp. 1386–1394.

  • Shalev-Shwartz, S., et al. (2011). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 107–194.

    Article  Google Scholar 

  • Shimada, T., Bao, H., Sato, I., & Sugiyama, M. (2020). Classification from pairwise similarities/dissimilarities and unlabeled data via empirical risk minimization. Neural Computation.

  • Shinoda, K., Kaji, H., & Sugiyama, M. (2020). Binary classification from positive data with skewed confidence. In IJCAI, pp. 3328–3334.

  • Tao, Q., Scott, S., Vinodchandran, N., & Osugi, T. T. (2004). Svm-based generalized multiple-instance learning via approximate box counting. In ICML, p. 101.

  • Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al. (2001). Constrained k-means clustering with background knowledge. In ICML, pp. 577–584.

  • Wang, H., Qiang, Y., Chen, C., Liu, W., Hu, T., Li, Z., & Chen, G. (2020). Online partial label learning. In ECML PKDD.

  • Wu, D.-D., Wang, D.-B., & Zhang, M.-L. (2022). Revisiting consistency regularization for deep partial label learning. In ICML, pp. 24212–24225.

  • Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.

  • Xu, D., & Tian, Y. (2015). A comprehensive survey of clustering algorithms. Annals of Data Science, 2(2), 165–193.

    Article  MathSciNet  Google Scholar 

  • Zhang, C., Gong, C., Liu, T., Lu, X., Wang, W., & Yang, J. (2020). Online positive and unlabeled learning. In IJCAI, pp. 2248–2254.

Download references

Funding

This research is supported by Natural Science Foundation of China (No. 62106028), Chongqing Overseas Chinese Entrepreneurship and Innovation Support Program, and CAAI-Huawei MindSpore Open Fund.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: S-S; methodology: S-S; Theoretical analysis: F-L, H-W; Writing-original draft preparation: S-S, F-L; Writing-review and editing: S-S, Z-W; Funding acquisition: B-H, T-X, B-A, F-L.

Corresponding author

Correspondence to Lei Feng.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editors: Vu Nguyen and Dana Yogatama.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Proof of Theorem 1

As we have shown, our proposed OSD-OGD algorithm actually employs the following update method:

$$\begin{aligned} \varvec{w}_{t+1} = \mathop {\textrm{arg}\,\textrm{min}}\limits _{\varvec{w} \in {\mathcal {W}}}\sum \nolimits _{i=1}^t R_{i}^{\textrm{SD}}(\varvec{w}) + \frac{1}{2\gamma }\left\| \varvec{w}\right\| _2^2, \end{aligned}$$

which is exactly the follow-the-regularized-leader procedure with Euclidean (Shalev-Shwartz, 2011). As can be easily verified, the Euclidean regularization \(\frac{1}{2\gamma } \left\| \varvec{w}\right\| _2^2\) is \(\frac{1}{\gamma }\)-strongly-convex with respect to \(\left\| \cdot \right\| _2\). Recall the assumptions that \({\mathcal {L}}_{t}\) is \(\rho _t\)-Lipschitz with respect to \(\left\| \cdot \right\| _2\) and \(\frac{1}{T}\sum \nolimits _{t=1}^T\rho _t^2\le \rho ^2\). Then, by using the Theorem 2.11 in Shalev-Shwartz (2011), we have that for \(\varvec{w}_{\star }\in {\mathcal {W}}\),

$$\begin{aligned} \sum \nolimits _{t=1}^T {\mathcal {L}}_{t}(\varvec{w}_t) -\sum \nolimits _{t=1}^T {\mathcal {L}}_{t}(\varvec{w}_{\star })&\le \frac{1}{2\gamma }(\Vert \varvec{w}_{\star }\Vert _{2}^{2} -\min _{\varvec{v}\in {\mathcal {W}}}\Vert \varvec{v}\Vert _{2}^{2}) + \gamma T \rho ^{2} \le \frac{1}{2\gamma }\Vert \varvec{w}_{\star } \Vert _{2}^{2} + \gamma T \rho ^{2}, \end{aligned}$$

because \(\min _{\varvec{v}\in {\mathcal {W}}}\Vert \varvec{v}\Vert _{2}^{2}\ge 0\) always holds. In particular, if for every hypothesis \(\varvec{w}\in {\mathcal {W}}\), it satisfies \(\left\| \varvec{w}\right\| _2\le B\) and \(\gamma =B/(\rho \sqrt{2T})\), we have

$$\begin{aligned} \sum \nolimits _{t=1}^T {\mathcal {L}}_{t}(\varvec{w}_t) -\sum \nolimits _{t=1}^T {\mathcal {L}}_{t}(\varvec{w}_{\star }) \le B\rho \sqrt{2T}, \end{aligned}$$

which completes the proof of Theorem 1.

Appendix 2: Proof of Theorem 2

Following (Crammer et al., 2006), for some \(\varvec{v}\in {\mathcal {W}}\), we define

$$\begin{aligned} \Delta _{t} = \Vert \varvec{w}_{t} - \varvec{v}\Vert ^{2}_{2} - \Vert \varvec{w}_{t+1} - \varvec{v}\Vert ^{2}_{2} \end{aligned}$$

and consider upper and lower bounds of \(\sum \nolimits _{t=1}^{T}\Delta _{t}\). By initializing \(\varvec{w}_{1}\) to zero vector and using telescoping sum, we can obtain

$$\begin{aligned} \sum \nolimits _{t=1}^{T}\Delta _{t}&= \sum \nolimits _{t=1}^{T}(\Vert \varvec{w}_{t} - \varvec{v}\Vert ^{2}_{2} - \Vert \varvec{w}_{t+1} - \varvec{v}\Vert ^{2}_{2}) \\&= \Vert \varvec{w}_{1} - \varvec{v}\Vert ^{2}_{2} - \Vert \varvec{w}_{T+1} - \varvec{v}\Vert ^{2}_{2}\\&\le \Vert \varvec{v}\Vert ^{2}_{2}. \end{aligned}$$

Since \(\varvec{w}_{t+1} =\varvec{w}_{t} - \tau _{t}\nabla R_{t}^{SD}(\varvec{w})\), we can obtain

$$\begin{aligned} \Delta _{t}&= \Vert \varvec{w}_{t} - \varvec{v}\Vert ^{2}_{2} - \Vert \varvec{w}_{t} - \tau _{t}\nabla R_{t}^{\textrm{SD}}(\varvec{w}) - \varvec{v}\Vert ^{2}_{2} \\&= 2\tau _{t}(\varvec{w}_{t} - \varvec{v})^{\top }\nabla R_{t}^{\textrm{SD}}(\varvec{w}) - \tau _{t}^{2}\Vert \nabla R_{t}^{\textrm{SD}}(\varvec{w})\Vert ^{2}_{2}. \end{aligned}$$

Since \(R_{t}^{\textrm{SD}}(\varvec{w})\) is \(\lambda\)-convex, we have

$$\begin{aligned} R_{t}^{\textrm{SD}}(\varvec{v})- R_{t}^{\textrm{SD}}(\varvec{w}_{t}) \ge (\varvec{v} - \varvec{w}_{t})^{\top }\nabla R_{t}^{\textrm{SD}}(\varvec{w}_{t}) + \frac{\lambda }{2} \Vert \varvec{v} - \varvec{w}_{t}\Vert ^{2}_{2}. \end{aligned}$$

Combining the above inequalities, we have

$$\begin{aligned} \Vert \varvec{v}\Vert ^{2} \ge \sum \nolimits _{t=1}^{T}\tau _{t}\left( 2R_{t}^{\textrm{SD}} (\varvec{w}_{t})-\tau _{t}\Vert \nabla R_{t}^{\textrm{SD}} (\varvec{w}_{t})\Vert ^{2}_{2} -2R_{t}^{\textrm{SD}} (\varvec{v})\right) . \end{aligned}$$
(14)

For OSD-LSPA, if a prediction mistake occurs, then \(R_{t}^{\textrm{SD}} (\varvec{w}_{t}) \ge G\) and \(R_{t}^{\textrm{SD}} (\varvec{w}_{t})-\tau _{t}\Vert \nabla R_{t} (\varvec{w}_{t})\Vert ^{2}_{2}\ge 0\). Therefore, we can obtain

$$\begin{aligned} \sum \nolimits _{t=1}^{T}\tau _{t}R_{t}^{\textrm{SD}} (\varvec{w}_{t})\le \Vert \varvec{v}\Vert ^{2}_{2} +2C\sum \nolimits _{t=1}^{T}R_{t}^{\textrm{SD}}(\varvec{v}) \end{aligned}$$
(15)

Using our assumption that \(\Vert \nabla R_{t}^{\textrm{SD}} (\varvec{w}_{t})\Vert ^{2}_{2}\le r^2\) and the definitions \(\tau _{t} = \min (C, \max (0,\frac{A+ \varvec{w}_{t}^{\top }\nabla R_{t}^{\textrm{SD}}(\varvec{w})}{\Vert \nabla R_{t}^{\textrm{SD}} (\varvec{w}_{t})\Vert ^{2}_{2}}))\), \(R_{t}^{\textrm{SD}} (\varvec{w}_{t}) = A + \varvec{w}_{t}^{\top }\nabla R_{t}^{\textrm{SD}}(\varvec{w}_{t})\), we conclude that if a prediction mistake occurs then it holds that

$$\begin{aligned} \min \left( CG, \frac{G^2}{r^2}\right) \le \tau _{t}R_{t}^{\textrm{D}} (\varvec{w}_{t}). \end{aligned}$$

Since \(\sum \nolimits _{t=1}^{T}E_{t}(\varvec{w}_{t})\) denote the number of prediction mistakes made on the entire sequence, it holds that

$$\begin{aligned} \min \left( CG, \frac{G^2}{r^2}\right) \sum \nolimits _{t=1}^{T}E_{t} (\varvec{w}_{t}) \le \sum \nolimits _{t=1}^{T} \tau _{t}R_{t}^{\textrm{SD}}(\varvec{w}_{t}). \end{aligned}$$
(16)

Combining Eq. (15) with Eq. (16), we conclude that

$$\begin{aligned} \sum \nolimits _{t=1}^{T}E_{t}(\varvec{w}_{t}) \le \max \left( \frac{1}{CG}, \frac{r^2}{G^2}\right) \left( \Vert \varvec{v}\Vert ^{2}_{2} + 2C\sum \nolimits _{t=1}^{T}R_{t}^{\textrm{SD}}(\varvec{v})\right) , \end{aligned}$$

which completes the proof of Theorem 2.

Appendix 3: Proof of Theorem 3

Recall Eq. (14), we have

$$\begin{aligned} \Vert \varvec{v}\Vert ^{2}_{2}&\ge \sum \nolimits _{t=1}^{T}\tau _{t}\left( 2R_{t}^{\textrm{SD}} (\varvec{w}_{t})-\tau _{t}\Vert \nabla R_{t}^{\textrm{SD}} (\varvec{w}_{t})\Vert ^{2}_{2} -2R_{t}^{\textrm{SD}} (\varvec{v})\right) . \end{aligned}$$

Defining \(\alpha =1/\sqrt{2C}\), we subtract the non-negative term \((\alpha \tau _{t} - R_{t}^{\textrm{SD}}(\varvec{v})/\alpha )^{2}\) from each summand on the right-hand side of the above inequality, to obtain

$$\begin{aligned} \Vert \varvec{v}\Vert ^{2}_{2}&\ge \sum \nolimits _{t=1}^{T}\left( 2\tau _{t}R_{t}^{\textrm{SD}} (\varvec{w}_{t})-\tau _{t}^{2}\Vert \nabla R_{t}^{\textrm{SD}} (\varvec{w}_{t})\Vert ^{2}_{2} -2\tau _{t}R_{t}^{\textrm{SD}} (\varvec{v}) - (\alpha \tau _{t} - R_{t}^{\textrm{SD}} (\varvec{v})/\alpha )^{2}\right) \\&= \sum \nolimits _{t=1}^{T}\left( 2\tau _{t}R_{t}^{\textrm{SD}} (\varvec{w}_{t})-\tau _{t}^{2}\Vert \nabla R_{t}^{\textrm{SD}} (\varvec{w}_{t})\Vert ^{2}_{2}-2\tau _{t}R_{t}^{\textrm{SD}} (\varvec{v})- (\alpha \tau _{t})^{2}\right. \\&\qquad \qquad \qquad \left. - \left( \frac{R_{t}^{\textrm{SD}} (\varvec{v})}{\alpha }\right) ^{2} + 2\tau _{t}R_{t}^{\textrm{SD}}(\varvec{w}_{t})\right) \\&= \sum \nolimits _{t=1}^{T}\left( 2\tau _{t}R_{t}^{\textrm{SD}} (\varvec{w}_{t})-\tau _{t}^{2}\left( \Vert \nabla R_{t}^{\textrm{SD}} (\varvec{w}_{t})\Vert ^{2}_{2}+ \frac{1}{2C}\right) - 2C (R_{t}^{\textrm{SD}}(\varvec{v}))^{2}\right) . \end{aligned}$$

Using the definitions \(\tau _{t}=\max (0, \frac{2C(A+\varvec{w}_{t} ^{\top }\nabla R_{t}^{\textrm{SD}} (\varvec{w}))}{2C\Vert \nabla R_{t}^{\textrm{SD}} (\varvec{w})\Vert ^{2}_{2} +1})\) and \(R_{t}^{\textrm{SD}} (\varvec{w}_{t}) = \varvec{w}_{t} ^{\top }\nabla R_{t}^{\textrm{SD}} (\varvec{w}) +A\). It is clear that when \(R_{t}^{\textrm{SD}} (\varvec{w}_{t})\le 0\), the classifier is not updated. So we consider the case that \(R_{t}^{\textrm{SD}}(\varvec{w}_{t})\ge 0\). Then we obtain

$$\begin{aligned} \Vert \varvec{v}\Vert ^{2}_{2} \ge \sum \nolimits _{t=1}^{T} \left( \frac{(R_{t}^{\textrm{SD}}(\varvec{w}_{t}))^2}{r^2 +\frac{1}{2C}} -2C (R_{t}^{\textrm{SD}}(\varvec{v}))^{2}\right) . \end{aligned}$$

Rearranging terms above, we can obtain

$$\begin{aligned} \sum \nolimits _{t=1}^{T}(R_{t}^{\textrm{SD}}(\varvec{w}_{t}))^2 \le \left( r^2 +\frac{1}{2C}\right) \left( \Vert \varvec{v}\Vert ^{2}_{2} + 2C \sum \nolimits _{t=1}^{T}(R_{t}^{\textrm{SD}}(\varvec{v}))^{2}\right) , \end{aligned}$$

which completes the proof of Theorem 3.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shu, S., Wang, H., Wang, Z. et al. Online binary classification from similar and dissimilar data. Mach Learn (2023). https://doi.org/10.1007/s10994-023-06434-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10994-023-06434-6

Keywords

Navigation