Advertisement

Transporting Labels via Hierarchical Optimal Transport for Semi-Supervised Learning

Conference paper
  • 896 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12349)

Abstract

Semi-Supervised Learning (SSL) based on Convolutional Neural Networks (CNNs) have recently been proven as powerful tools for standard tasks such as image classification when there is not a sufficient amount of labeled data available during the training. In this work, we consider the general setting of the SSL problem for image classification, where the labeled and unlabeled data come from the same underlying distribution. We propose a new SSL method that adopts a hierarchical Optimal Transport (OT) technique to find a mapping from empirical unlabeled measures to corresponding labeled measures by leveraging the minimum amount of transportation cost in the label space. Based on this mapping, pseudo-labels for the unlabeled data are inferred, which are then used along with the labeled data for training the CNN. We evaluated and compared our method with state-of-the-art SSL approaches on standard datasets to demonstrate the superiority of our SSL method.

Keywords

Semi-Supervised Learning Hierarchical optimal transport 

References

  1. 1.
    Agueh, M., Carlier, G.: Barycenters in the wasserstein space. SIAM J. Math. Anal. 43(2), 904–924 (2011)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Álvarez-Esteban, P.C., del Barrio, E., Cuesta-Albertos, J., Matrán, C.: A fixed-point approach to barycenters in wasserstein space. J. Math. Anal. Appl. 441(2), 744–762 (2016)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Alvarez-Melis, D., Jaakkola, T., Jegelka, S.: Structured optimal transport. In: International Conference on Artificial Intelligence and Statistics, pp. 1771–1780 (2018)Google Scholar
  4. 4.
    Amari, S.: Information Geometry and Its Applications. AMS, vol. 194. Springer, Tokyo (2016).  https://doi.org/10.1007/978-4-431-55978-8CrossRefzbMATHGoogle Scholar
  5. 5.
    Amari, S., Karakida, R., Oizumi, M.: Information geometry connecting wasserstein distance and kullback–leibler divergence via the entropy-relaxed transportation problem. Inform. Geom. 1(1), 13–37 (2018).  https://doi.org/10.1007/s41884-018-0002-8MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Anderes, E., Borgwardt, S., Miller, J.: Discrete wasserstein barycenters: optimal transport for discrete data. Math. Methods Oper. Res. 84(2), 389–409 (2016)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017)
  8. 8.
    Athiwaratkun, B., Finzi, M., Izmailov, P., Wilson, A.G.: There are many consistent explanations of unlabeled data: why you should average. In: International Conference on Learning Representations (2019)Google Scholar
  9. 9.
    Bachman, P., Alsharif, O., Precup, D.: Learning with pseudo-ensembles. In: Advances in Neural Information Processing Systems, pp. 3365–3373 (2014)Google Scholar
  10. 10.
    Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7(Nov), 2399–2434 (2006)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Bertsimas, D., Tsitsiklis, J.N.: Introduction to Linear Optimization, vol. 6. Athena Scientific Belmont, MA (1997)Google Scholar
  12. 12.
    Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised learning (chapelle, o. et al., eds.; 2006) [book reviews]. IEEE Trans. Neural Netw. 20(3), 542–542 (2009)CrossRefGoogle Scholar
  13. 13.
    Chapelle, O., Weston, J., Schölkopf, B.: Cluster kernels for semi-supervised learning. In: Advances in Neural Information Processing Systems, pp. 601–608 (2003)Google Scholar
  14. 14.
    Chen, Y., Ye, J., Li, J.: Aggregated wasserstein distance and state registration for hidden markov models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019)Google Scholar
  15. 15.
    Courty, N., Flamary, R., Tuia, D., Rakotomamonjy, A.: Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 39(9), 1853–1865 (2017)CrossRefGoogle Scholar
  16. 16.
    Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: Advances in Neural Information Processing Systems, pp. 2292–2300 (2013)Google Scholar
  17. 17.
    Cuturi, M., Doucet, A.: Fast computation of wasserstein barycenters. In: International Conference on Machine Learning, pp. 685–693 (2014)Google Scholar
  18. 18.
    Damodaran, B.B., Kellenberger, B., Flamary, R., Tuia, D., Courty, N.: Deepjdot: deep joint distribution optimal transport for unsupervised domain adaptation. In: European Conference on Computer Vision, pp. 467–483. Springer (2018)Google Scholar
  19. 19.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)Google Scholar
  20. 20.
    Dong-DongChen, W., WeiGao, Z.H.: Tri-net for semi-supervised deep learning. IJCAI (2018)Google Scholar
  21. 21.
    Frogner, C., Zhang, C., Mobahi, H., Araya, M., Poggio, T.A.: Learning with a wasserstein loss. In: Advances in Neural Information Processing Systems, pp. 2053–2061 (2015)Google Scholar
  22. 22.
    Genevay, A., Chizat, L., Bach, F., Cuturi, M., Peyré, G.: Sample complexity of sinkhorn divergences. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1574–1583 (2019)Google Scholar
  23. 23.
    Ho, N., Nguyen, X.L., Yurochkin, M., Bui, H.H., Huynh, V., Phung, D.: Multilevel clustering via wasserstein means. In: Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 1501–1509. JMLR. org (2017)Google Scholar
  24. 24.
    Iscen, A., Tolias, G., Avrithis, Y., Chum, O.: Label propagation for deep semi-supervised learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5070–5079 (2019)Google Scholar
  25. 25.
    Jia, Y., Kwong, S., Hou, J.: Semi-supervised spectral clustering with structured sparsity regularization. IEEE Signal Process. Lett. 25(3), 403–407 (2018)CrossRefGoogle Scholar
  26. 26.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  27. 27.
    Kolouri, S., Park, S.R., Thorpe, M., Slepcev, D., Rohde, G.K.: Optimal mass transport: signal processing and machine-learning applications. IEEE Signal Process. Mag. 34(4), 43–59 (2017)CrossRefGoogle Scholar
  28. 28.
    Kolouri, S., Zou, Y., Rohde, G.K.: Sliced wasserstein kernels for probability distributions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5258–5267 (2016)Google Scholar
  29. 29.
    Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)Google Scholar
  30. 30.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  31. 31.
    Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016)
  32. 32.
    Lee, C.Y., Batra, T., Baig, M.H., Ulbricht, D.: Sliced wasserstein discrepancy for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10285–10295 (2019)Google Scholar
  33. 33.
    Lee, D.H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, vol. 3, p. 2. ICML (2013)Google Scholar
  34. 34.
    Lee, J., Dabagia, M., Dyer, E., Rozell, C.: Hierarchical optimal transport for multimodal distribution alignment. In: Advances in Neural Information Processing Systems, pp. 13453–13463 (2019)Google Scholar
  35. 35.
    Liu, X., Van De Weijer, J., Bagdanov, A.D.: Exploiting unlabeled data in cnns by self-supervised learning to rank. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1862–1878 (2019)CrossRefGoogle Scholar
  36. 36.
    Luo, Y., Zhu, J., Li, M., Ren, Y., Zhang, B.: Smooth neighbors on teacher graphs for semi-supervised learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8896–8905 (2018)Google Scholar
  37. 37.
    Mi, L., Zhang, W., Gu, X., Wang, Y.: Variational wasserstein clustering. arXiv preprint arXiv:1806.09045 (2018)
  38. 38.
    Miyato, T., Maeda, S., Ishii, S., Koyama, M.: Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1979–1993 (2018)CrossRefGoogle Scholar
  39. 39.
    Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning, vol. 2011, p. 5 (2011)Google Scholar
  40. 40.
    Nguyen, X., et al.: Borrowing strengh in hierarchical bayes: posterior concentration of the dirichlet base measure. Bernoulli 22(3), 1535–1571 (2016)MathSciNetCrossRefGoogle Scholar
  41. 41.
    Oliver, A., Odena, A., Raffel, C.A., Cubuk, E.D., Goodfellow, I.: Realistic evaluation of deep semi-supervised learning algorithms. In: Advances in Neural Information Processing Systems, pp. 3235–3246 (2018)Google Scholar
  42. 42.
    Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., Hinton, G.: Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548 (2017)
  43. 43.
    Pollard, D.: Quantization and the method of k-means. IEEE Trans. Inform. Theory 28(2), 199–205 (1982)MathSciNetCrossRefGoogle Scholar
  44. 44.
    Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervised learning with ladder networks. In: Advances in Neural Information Processing Systems, pp. 3546–3554 (2015)Google Scholar
  45. 45.
    Sajjadi, M., Javanmardi, M., Tasdizen, T.: Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In: Advances in Neural Information Processing Systems, pp. 1163–1171 (2016)Google Scholar
  46. 46.
    Santambrogio, F.: Optimal transport for applied mathematicians. Birkauser NY 55, 58–63 (2015)zbMATHGoogle Scholar
  47. 47.
    Schmitzer, B., Schnörr, C.: A hierarchical approach to optimal transport. In: Kuijper, A., Bredies, K., Pock, T., Bischof, H. (eds.) SSVM 2013. LNCS, vol. 7893, pp. 452–464. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-38267-3_38CrossRefGoogle Scholar
  48. 48.
    Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
  49. 49.
    Shi, W., Gong, Y., Ding, C., MaXiaoyu Tao, Z., Zheng, N.: Transductive semi-supervised deep learning using min-max features. In: The European Conference on Computer Vision (ECCV), September 2018Google Scholar
  50. 50.
    Solomon, J., et al.: Convolutional wasserstein distances: efficient optimal transportation on geometric domains. ACM Trans. Graph. (TOG) 34(4), 66 (2015)CrossRefGoogle Scholar
  51. 51.
    Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11141, pp. 270–279. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01424-7_27CrossRefGoogle Scholar
  52. 52.
    Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems, pp. 1195–1204 (2017)Google Scholar
  53. 53.
    Villani, C.: Optimal transport: old and new, vol. 338. Springer Science & Business Media (2008)Google Scholar
  54. 54.
    Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: Advances in Neural Information Processing Systems, pp. 3630–3638 (2016)Google Scholar
  55. 55.
    Vural, E., Guillemot, C.: A study of the classification of low-dimensional data with supervised manifold learning. J. Mach. Learn. Res. 18, 1–157 (2017)MathSciNetzbMATHGoogle Scholar
  56. 56.
    Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_31CrossRefGoogle Scholar
  57. 57.
    Yan, Y., Li, W., Wu, H., Min, H., Tan, M., Wu, Q.: Semi-supervised optimal transport for heterogeneous domain adaptation. In: IJCAI, pp. 2969–2975 (2018)Google Scholar
  58. 58.
    Ye, J., Wu, P., Wang, J.Z., Li, J.: Fast discrete distribution clustering using wasserstein barycenter with sparse support. IEEE Trans. Signal Process. 65(9), 2317–2332 (2017)MathSciNetCrossRefGoogle Scholar
  59. 59.
    Yu, B., Wu, J., Ma, J., Zhu, Z.: Tangent-normal adversarial regularization for semi-supervised learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10676–10684 (2019)Google Scholar
  60. 60.
    Yurochkin, M., Claici, S., Chien, E., Mirzazadeh, F., Solomon, J.M.: Hierarchical optimal transport for document representation. In: Advances in Neural Information Processing Systems, pp. 1599–1609 (2019)Google Scholar
  61. 61.
    Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Advances in Neural Information Processing Systems, pp. 321–328 (2004)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.West Virginia UniversityMorgantownUSA

Personalised recommendations