Unsupervised Discrete Representation Learning

  • Weihua Hu
  • Takeru Miyato
  • Seiya Tokui
  • Eiichi Matsumoto
  • Masashi SugiyamaEmail author
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11700)


Learning discrete representations of data is a central machine learning task because of the compactness of the representations and ease of interpretation. The task includes clustering and hash learning as special cases. Deep neural networks are promising to be used because they can model the non-linearity of data and scale to large datasets. However, their model complexity is huge, and therefore, we need to carefully regularize the networks in order to learn useful and interpretable representations that exhibit intended invariance for applications of interest. To this end, we propose a method called Information Maximizing Self-Augmented Training (IMSAT). In IMSAT, we use data augmentation to impose the invariance on discrete representations. More specifically, we encourage the predicted representations of augmented data points to be close to those of the original data points in an end-to-end fashion. At the same time, we maximize the information-theoretic dependency between data and their predicted discrete representations. Our IMSAT is able to discover interpretable representations that exhibit intended invariance. Extensive experiments on benchmark datasets show that IMSAT produces state-of-the-art results for both clustering and unsupervised hash learning.


Discrete representation learning Clustering Hash learning 



MS was supported by KAKENHI 17H01760.

Supplementary material


  1. 1.
    Bachman, P., Alsharif, O., Precup, D.: Learning with pseudo-ensembles. In: NIPS (2014)Google Scholar
  2. 2.
    Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds)Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006).
  3. 3.
    Bertsekas, D.P.: Nonlinear Programming. Athena Scientific Belmont (1999)Google Scholar
  4. 4.
    Bishop, C.M.: Training with noise is equivalent to tikhonov regularization. Neural Comput. 7(1), 108–116 (1995)CrossRefGoogle Scholar
  5. 5.
    Bridle, J.S., Heading, A.J.R., MacKay, D.J.C.: Unsupervised classifiers, mutual information and ‘phantom targets’. In: NIPS, pp. 1096–1101 (1991)Google Scholar
  6. 6.
    Brown, G.: A new perspective for information theoretic feature selection. In: AISTATS (2009)Google Scholar
  7. 7.
    Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervised feature learning. Ann Arbor 1001(48109), 2 (2010)Google Scholar
  8. 8.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (2012)zbMATHGoogle Scholar
  9. 9.
    Dilokthanakul, N., et al.: Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648 (2016)
  10. 10.
    Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with convolutional neural networks. In: NIPS, pp. 766–774 (2014)Google Scholar
  11. 11.
    Erin Liong, V., Lu, J., Wang, G., Moulin, P., Zhou, J.: Deep hashing for compact binary codes learning. In: CVPR (2015)Google Scholar
  12. 12.
    Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: AISTATS (2011)Google Scholar
  13. 13.
    Gomes, R., Krause, A., Perona, P.: Discriminative clustering by regularized information maximization. In: NIPS (2010)Google Scholar
  14. 14.
    Gong, Y., Lazebnik, S., Gordo, A., Perronnin, F.: Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2916–2929 (2013)CrossRefGoogle Scholar
  15. 15.
    Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT Press, Cambridge (2016)zbMATHGoogle Scholar
  16. 16.
    Grandvalet, Y., Bengio, Y., et al.: Semi-supervised learning by entropy minimization. In: NIPS (2004)Google Scholar
  17. 17.
    He, K., Wen, F., Sun, J.: K-means hashing: an affinity-preserving quantization method for learning binary compact codes. In: CVPR (2013)Google Scholar
  18. 18.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: CVPR (2015)Google Scholar
  19. 19.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  20. 20.
    Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
  21. 21.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  22. 22.
    Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: ICCV (2009)Google Scholar
  23. 23.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  24. 24.
    Koch, G.: Siamese neural networks for one-shot image recognition. Ph.D. thesis, University of Toronto (2015)Google Scholar
  25. 25.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  26. 26.
    Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logist. Q. 2(1–2), 83–97 (1955)MathSciNetCrossRefGoogle Scholar
  27. 27.
    Kulis, B., Darrell, T.: Learning to hash with binary reconstructive embeddings. In: NIPS (2009)Google Scholar
  28. 28.
    Lai, H., Pan, Y., Liu, Y., Yan, S.: Simultaneous feature learning and hash coding with deep neural networks. In: CVPR (2015)Google Scholar
  29. 29.
    Lake, B.M., Salakhutdinov, R., Gross, J., Tenenbaum, J.B.: One shot learning of simple visual concepts. In: Cognitive Science (2011)Google Scholar
  30. 30.
    Lang, K.: Newsweeder: learning to filter netnews. In: ICML, pp. 331–339 (1995)Google Scholar
  31. 31.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  32. 32.
    Leen, T.K.: From data distributions to regularization in invariant learning. Neural Comput. 7(5), 974–981 (1995)MathSciNetCrossRefGoogle Scholar
  33. 33.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)Google Scholar
  34. 34.
    Li, W.J., Wang, S., Kang, W.C.: Feature learning based deep supervised hashing with pairwise labels. In: IJCAI (2015)Google Scholar
  35. 35.
    McGill, W.J.: Multivariate information transmission. Psychometrika 19(2), 97–116 (1954)CrossRefGoogle Scholar
  36. 36.
    Miyato, T., Maeda, S.I., Koyama, M., Nakae, K., Ishii, S.: Distributional smoothing with virtual adversarial training. In: ICLR (2016)Google Scholar
  37. 37.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: ICML (2010)Google Scholar
  38. 38.
    Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)Google Scholar
  39. 39.
    Ng, A.Y., Jordan, M.I., Weiss, Y., et al.: On spectral clustering: analysis and an algorithm. In: NIPS (2001)Google Scholar
  40. 40.
    Norouzi, M., Blei, D.M.: Minimal loss hashing for compact binary codes. In: ICML (2011)Google Scholar
  41. 41.
    Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)CrossRefGoogle Scholar
  42. 42.
    van den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, pp. 6306–6315 (2017)Google Scholar
  43. 43.
    Reed, R., Oh, S., Marks, R.: Regularization using jittered training data. In: IJCNN, vol. 3, pp. 147–152. IEEE (1992)Google Scholar
  44. 44.
    Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: explicit invariance during feature extraction. In: ICML (2011)Google Scholar
  45. 45.
    Sajjadi, M., Javanmardi, M., Tasdizen, T.: Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In: NIPS (2016)Google Scholar
  46. 46.
    Salakhutdinov, R., Hinton, G.: Semantic hashing. Int. J. Approximate Reasoning 50(7), 969–978 (2009)CrossRefGoogle Scholar
  47. 47.
    Springenberg, J.T.: Unsupervised and semi-supervised learning with categorical generative adversarial networks. In: ICLR (2015)Google Scholar
  48. 48.
    Tokui, S., Oono, K., Hido, S., Clayton, J.: Chainer: a next-generation open source framework for deep learning. In: NIPS Workshop on Machine Learning Systems (LearningSys) (2015)Google Scholar
  49. 49.
    Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)CrossRefGoogle Scholar
  50. 50.
    Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: ICML (2008)Google Scholar
  51. 51.
    Wang, J., Liu, W., Kumar, S., Chang, S.F.: Learning to hash for indexing big data–a survey. Proc. IEEE 104(1), 34–57 (2016)CrossRefGoogle Scholar
  52. 52.
    Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NIPS (2009)Google Scholar
  53. 53.
    Xia, R., Pan, Y., Lai, H., Liu, C., Yan, S.: Supervised hashing for image retrieval via image representation learning. In: AAAI (2014)Google Scholar
  54. 54.
    Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: ICML (2016)Google Scholar
  55. 55.
    Xu, J., et al.: Convolutional neural networks for text hashing. In: IJCAI (2015)Google Scholar
  56. 56.
    Xu, L., Neufeld, J., Larson, B., Schuurmans, D.: Maximum margin clustering. In: NIPS (2004)Google Scholar
  57. 57.
    Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. In: NIPS (2004)Google Scholar
  58. 58.
    Zhang, R., Lin, L., Zhang, R., Zuo, W., Zhang, L.: Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE Trans. Image Process. 24(12), 4766–4779 (2015)MathSciNetCrossRefGoogle Scholar
  59. 59.
    Zheng, Y., Tan, H., Tang, B., Zhou, H., et al.: Variational deep embedding: a generative approach to clustering. arXiv preprint arXiv:1611.05148 (2016)
  60. 60.
    Zhou, Z.H.: A brief introduction to weakly supervised learning. Nat. Sci. Rev. 5(1), 44–53 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Weihua Hu
    • 1
  • Takeru Miyato
    • 2
  • Seiya Tokui
    • 2
  • Eiichi Matsumoto
    • 2
  • Masashi Sugiyama
    • 3
    • 4
    Email author
  1. 1.Stanford UniversityStanfordUSA
  2. 2.Preferred NetworksTokyoJapan
  3. 3.RIKEN AIPTokyoJapan
  4. 4.The University of TokyoTokyoJapan

Personalised recommendations