Abstract
Similar-dissimilar (SD) classification aims to train a binary classifier from only similar and dissimilar data pairs, which indicate whether two instances belong to the same class (similar) or not (dissimilar). Although effective learning methods have been proposed for SD classification, they cannot deal with online learning scenarios with sequential data that can be frequently encountered in real-world applications. In this paper, we provide the first attempt to investigate the online SD classification problem. Specifically, we first adapt the unbiased risk estimator of SD classification to online learning scenarios with a conservative regularization term, which could serve as a naive method to solve the online SD classification problem. Then, by further introducing a margin criterion for whether to update the classifier or not with the received cost, we propose two improvements (one with linearly scaled cost and the other with quadratically scaled cost) that result in two online SD classification methods. Theoretically, we derive the regret, mistake, and relative loss bounds for our proposed methods, which guarantee the performance on sequential data. Extensive experiments on various datasets validate the effectiveness of our proposed methods.
Similar content being viewed by others
Data availability
Not applicable.
Code availability
Not applicable.
References
Bao, H., Niu, G., & Sugiyama, M. (2018). Classification from pairwise similarity and unlabeled data. In ICML, pp. 452–461.
Bao, H., Shimada, T., Xu, L., Sato, I., & Sugiyama, M. (2020). Similarity-based classification: Connecting similarity learning to binary classification. arXiv preprint arXiv:2006.06207.
Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. http://archive.ics.uci.edu/ml/index.php.
Cao, Y., Feng, L., Xu, Y., An, B., Niu, G., & Sugiyama, M. (2021). Learning from similarity-confidence data. In ICML, pp. 1272–1282.
Cao, Y., Wan, Z., Ren, D., Yan, Z., & Zuo, W. (2022). Incorporating semi-supervised and positive-unlabeled learning for boosting full reference image quality assessment. In CVPR, pp. 5851–5861.
Chen, R., Tang, Y., Zhang, W., & Feng, W. (2022). Deep multi-view semi-supervised clustering with sample pairwise constraints. Neurocomputing, 500, 832–845.
Crammer, K., Kulesza, A., & Dredze, M. (2009). Adaptive regularization of weight vectors. In NeurIPS, pp. 414–422.
Crammer, K., & Singer, Y. (2003). Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3(Jan), 951–991.
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(Mar), 551–585.
Dekel, O., Gilad-Bachrach, R., Shamir, O., & Xiao, L. (2012). Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1).
Er, M. J., Venkatesan, R., & Wang, N. (2016). An online universal classifier for binary, multi-class and multi-label classification. In ICSMC, pp. 003701–003706. IEEE.
Feng, L., Lv, J.-Q., Han, B., Xu, M., Niu, G., Geng, X., An, B., & Sugiyama, M. (2020). Provably consistent partial-label learning. In NeurIPS.
Freund, Y., & Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3), 277–296.
Hoi, S. C., Sahoo, D., Lu, J., & Zhao, P. (2021). Online learning: A comprehensive survey. Neurocomputing, 459, 249–289.
Ishida, T., Niu, G., & Sugiyama, M. (2018). Binary classification for positive-confidence data. In NeurIPS, pp. 5917–5928.
Jian, L., Gao, F., Ren, P., Song, Y., & Luo, S. (2018). A noise-resilient online learning algorithm for scene classification. Remote Sensing, 10(11), 1836.
Kaneko, T., Sato, I., & Sugiyama, M. (2019). Online multiclass classification based on prediction margin for partial feedback. arXiv preprint arXiv:1902.01056.
Kiryo, R., Niu, G., Du Plessis, M. C., & Sugiyama, M. (2017). Positive-unlabeled learning with non-negative risk estimator. In NeurIPS, pp. 1675–1685.
Kivinen, J., Smola, A. J., & Williamson, R. C. (2004). Online learning with kernels. IEEE Transactions on Signal Processing, 52(8), 2165–2176.
Koçak, M. A., Shasha, D. E., & Erkip, E. (2016). Conjugate conformal prediction for online binary classification. In UAI. Citeseer.
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Li, Z., & Liu, J. (2009). Constrained clustering by spectral kernel learning. In ICCV, pp. 421–427.
Li, M., Zhang, T., Chen, Y., & Smola, A. J. (2014). Efficient mini-batch training for stochastic optimization. In KDD, pp. 661–670.
Liu, D., Zhang, P., & Zheng, Q. (2015). An efficient online active learning algorithm for binary classification. Pattern Recognition Letters, 68, 22–26.
Lu, N., Niu, G., Menon, A. K., & Sugiyama, M. (2019). On the minimal supervision for training any binary classifier from only unlabeled data. In ICLR.
Lu, N., Zhang, T., Niu, G., & Sugiyama, M. (2020). Mitigating overfitting in supervised classification from two unlabeled datasets: A consistent risk correction approach. In AISTATS, pp. 1115–1125.
Lu, D., & Weng, Q. (2007). A survey of image classification methods and techniques for improving classification performance. International Journal of Remote Sensing, 28(5), 823–870.
MacQueen, J., et al. (1967). Some methods for classification and analysis of multivariate observations. In Berkeley symposium on mathematical statistics and probability, pp. 281–297.
Maheshwara, S. S., & Manwani, N. (2023). Rolnip: Robust learning using noisy pairwise comparisons. In ACML, pp. 706–721.
Natarajan, N., Dhillon, I. S., Ravikumar, P. K., & Tewari, A. (2013). Learning with noisy labels. In NeurIPS, pp. 1196–1204.
Plessis, M. C., Niu, G., & Sugiyama, M. (2015). Convex formulation for learning from positive and unlabeled data. In ICML, pp. 1386–1394.
Shalev-Shwartz, S., et al. (2011). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 107–194.
Shimada, T., Bao, H., Sato, I., & Sugiyama, M. (2020). Classification from pairwise similarities/dissimilarities and unlabeled data via empirical risk minimization. Neural Computation.
Shinoda, K., Kaji, H., & Sugiyama, M. (2020). Binary classification from positive data with skewed confidence. In IJCAI, pp. 3328–3334.
Tao, Q., Scott, S., Vinodchandran, N., & Osugi, T. T. (2004). Svm-based generalized multiple-instance learning via approximate box counting. In ICML, p. 101.
Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al. (2001). Constrained k-means clustering with background knowledge. In ICML, pp. 577–584.
Wang, H., Qiang, Y., Chen, C., Liu, W., Hu, T., Li, Z., & Chen, G. (2020). Online partial label learning. In ECML PKDD.
Wu, D.-D., Wang, D.-B., & Zhang, M.-L. (2022). Revisiting consistency regularization for deep partial label learning. In ICML, pp. 24212–24225.
Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.
Xu, D., & Tian, Y. (2015). A comprehensive survey of clustering algorithms. Annals of Data Science, 2(2), 165–193.
Zhang, C., Gong, C., Liu, T., Lu, X., Wang, W., & Yang, J. (2020). Online positive and unlabeled learning. In IJCAI, pp. 2248–2254.
Funding
This research is supported by Natural Science Foundation of China (No. 62106028), Chongqing Overseas Chinese Entrepreneurship and Innovation Support Program, and CAAI-Huawei MindSpore Open Fund.
Author information
Authors and Affiliations
Contributions
Conceptualization: S-S; methodology: S-S; Theoretical analysis: F-L, H-W; Writing-original draft preparation: S-S, F-L; Writing-review and editing: S-S, Z-W; Funding acquisition: B-H, T-X, B-A, F-L.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editors: Vu Nguyen and Dana Yogatama.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Proof of Theorem 1
As we have shown, our proposed OSD-OGD algorithm actually employs the following update method:
which is exactly the follow-the-regularized-leader procedure with Euclidean (Shalev-Shwartz, 2011). As can be easily verified, the Euclidean regularization \(\frac{1}{2\gamma } \left\| \varvec{w}\right\| _2^2\) is \(\frac{1}{\gamma }\)-strongly-convex with respect to \(\left\| \cdot \right\| _2\). Recall the assumptions that \({\mathcal {L}}_{t}\) is \(\rho _t\)-Lipschitz with respect to \(\left\| \cdot \right\| _2\) and \(\frac{1}{T}\sum \nolimits _{t=1}^T\rho _t^2\le \rho ^2\). Then, by using the Theorem 2.11 in Shalev-Shwartz (2011), we have that for \(\varvec{w}_{\star }\in {\mathcal {W}}\),
because \(\min _{\varvec{v}\in {\mathcal {W}}}\Vert \varvec{v}\Vert _{2}^{2}\ge 0\) always holds. In particular, if for every hypothesis \(\varvec{w}\in {\mathcal {W}}\), it satisfies \(\left\| \varvec{w}\right\| _2\le B\) and \(\gamma =B/(\rho \sqrt{2T})\), we have
which completes the proof of Theorem 1.
Appendix 2: Proof of Theorem 2
Following (Crammer et al., 2006), for some \(\varvec{v}\in {\mathcal {W}}\), we define
and consider upper and lower bounds of \(\sum \nolimits _{t=1}^{T}\Delta _{t}\). By initializing \(\varvec{w}_{1}\) to zero vector and using telescoping sum, we can obtain
Since \(\varvec{w}_{t+1} =\varvec{w}_{t} - \tau _{t}\nabla R_{t}^{SD}(\varvec{w})\), we can obtain
Since \(R_{t}^{\textrm{SD}}(\varvec{w})\) is \(\lambda\)-convex, we have
Combining the above inequalities, we have
For OSD-LSPA, if a prediction mistake occurs, then \(R_{t}^{\textrm{SD}} (\varvec{w}_{t}) \ge G\) and \(R_{t}^{\textrm{SD}} (\varvec{w}_{t})-\tau _{t}\Vert \nabla R_{t} (\varvec{w}_{t})\Vert ^{2}_{2}\ge 0\). Therefore, we can obtain
Using our assumption that \(\Vert \nabla R_{t}^{\textrm{SD}} (\varvec{w}_{t})\Vert ^{2}_{2}\le r^2\) and the definitions \(\tau _{t} = \min (C, \max (0,\frac{A+ \varvec{w}_{t}^{\top }\nabla R_{t}^{\textrm{SD}}(\varvec{w})}{\Vert \nabla R_{t}^{\textrm{SD}} (\varvec{w}_{t})\Vert ^{2}_{2}}))\), \(R_{t}^{\textrm{SD}} (\varvec{w}_{t}) = A + \varvec{w}_{t}^{\top }\nabla R_{t}^{\textrm{SD}}(\varvec{w}_{t})\), we conclude that if a prediction mistake occurs then it holds that
Since \(\sum \nolimits _{t=1}^{T}E_{t}(\varvec{w}_{t})\) denote the number of prediction mistakes made on the entire sequence, it holds that
Combining Eq. (15) with Eq. (16), we conclude that
which completes the proof of Theorem 2.
Appendix 3: Proof of Theorem 3
Recall Eq. (14), we have
Defining \(\alpha =1/\sqrt{2C}\), we subtract the non-negative term \((\alpha \tau _{t} - R_{t}^{\textrm{SD}}(\varvec{v})/\alpha )^{2}\) from each summand on the right-hand side of the above inequality, to obtain
Using the definitions \(\tau _{t}=\max (0, \frac{2C(A+\varvec{w}_{t} ^{\top }\nabla R_{t}^{\textrm{SD}} (\varvec{w}))}{2C\Vert \nabla R_{t}^{\textrm{SD}} (\varvec{w})\Vert ^{2}_{2} +1})\) and \(R_{t}^{\textrm{SD}} (\varvec{w}_{t}) = \varvec{w}_{t} ^{\top }\nabla R_{t}^{\textrm{SD}} (\varvec{w}) +A\). It is clear that when \(R_{t}^{\textrm{SD}} (\varvec{w}_{t})\le 0\), the classifier is not updated. So we consider the case that \(R_{t}^{\textrm{SD}}(\varvec{w}_{t})\ge 0\). Then we obtain
Rearranging terms above, we can obtain
which completes the proof of Theorem 3.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shu, S., Wang, H., Wang, Z. et al. Online binary classification from similar and dissimilar data. Mach Learn (2023). https://doi.org/10.1007/s10994-023-06434-6
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10994-023-06434-6