Skip to main content
Log in

Visualization-based disentanglement of latent space

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

In recent years, selecting manipulation of data attributes by changing latent code using auto-encoder has received considerable scholarly attention . However, the representation of the data encoded by the auto-encoder cannot be visually observed. Furthermore, the attribute values and the latent code of the dimension do not conform to a linear monotonic relationship. From a practical point of view, we propose a novel method that uses the encoder–decoder architecture to disentangle data into two visualizable representations that are encoded as latent spaces. Consequently, the encoded latent space can be used to manipulate data attributes in a simple and intuitive way. The experiments on image dataset and music dataset show that the proposed approach leads to produce complete interpretable latent spaces, which can be used to manipulate a wide range of data attributes and to generate realistic music via analogy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. https://github.com/Runze-huang/VD-ED.

  2. https://github.com/Runze-huang/VD-ED/music/sample.

References

  1. Shao Z, Huang M, Wen J, Xu W, Zhu X (2019) Long and diverse text generation with planning-based hierarchical variational model. In: International joint conference on natural language processing (IJCNLP), pp 3255–3266

  2. Shen D, Celikyilmaz A, Zhang Y, Chen L, Wang X, Gao J, Carin L (2019) Towards generating long and coherent text with multi-level latent variable models. In: Meeting of the association for computational linguistics (ACL), pp 2079–2089

  3. Zhang Y, Wang Y, Zhang L, Zhang Z, Gai K (2019) Improve diverse text generation by self labeling conditional variational auto encoder. In: International conference on acoustics speech and signal processing (ICASSP), pp 2767–2771

  4. Hsu W, Zhang Y, Weiss R, Chung Y, Wang Y, Wu Y, Glass J (2019) Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In: International conference on acoustics speech and signal processing (ICASSP), pp 5901–5905

  5. Hsu W, Zhang Y, Weiss R, Zen H, Wu Y, Wang Y, Cao Y, Jia Y, Chen Z, Shen J (2019) Hierarchical Generative Modeling for Controllable Speech Synthesis. In: international conference on learning representations (ICLR)

  6. Luo Y, Agres K, Herremans D (2019) Learning disentangled representations of timbre and pitch for musical instrument sounds using gaussian mixture variational autoencoders. In: International symposium/conference on music information retrieval (ISMIR), pp 746–753

  7. Wang Y, Stanton D, Zhang Y, Ryan R, Battenberg E, Shor J, Xiao Y, Jia Y, Ren F, Saurous RA (2018) Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In: International conference on machine learning (ICML), pp 5167–5176

  8. Razavi A, Oord Avd, Vinyals O (2019) Generating diverse high-fidelity images with VQ-VAE-2. In: Neural information processing systems (NIPS)

  9. Ślot K, Kapusta P, Kucharski J (2020) Autoencoder-based image processing framework for object appearance modifications. Neural Computing and Applications (NCAA)

  10. Qi CR, Yi L, Su H, Guibas LJ (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Neural Information Processing Systems (NIPS)

  11. Brunner G, Konrad A, Wang Y, Wattenhofer R (2018) MIDI-VAE: modeling dynamics and instrumentation of music with applications to style transfer. In: International symposium/conference on music information retrieval (ISMIR), pp 747–754

  12. 9. Esling P, Chemlaromeusantos A, Bitton A (2018) Bridging audio analysis, perception and synthesis with perceptually-regularized variational timbre spaces. In: International symposium/conference on music information retrieval (ISMIR), pp 175–181

  13. Roberts A, Engel J, Raffel C, Hawthorne C, Eck DJaL (2018) A hierarchical latent vector model for learning long-term structure in music. In: International conference on machine learning (ICML)

  14. Rubenstein PK, Scholkopf B, Tolstikhin I (2018) Learning disentangled representations with wasserstein auto-encoders. In: International conference on learning representations (ICLR)

  15. Kingma DP, Welling M (2014) Auto-encoding variational Bayes. In: International conference on learning representations (ICLR)

  16. Hadjeres G, Nielsen F, Pachet F, Ieee (2017) GLSR-VAE: geodesic latent space regularization for variational autoencoder architectures. In: IEEE symposium series on computational intelligence (SSCI)

  17. Brunner G, Konrad A, Wang Y, Wattenhofer R (2018) MIDI-VAE: modeling dynamics and instrumentation of music with applications to style transfer. In: International symposium/conference on music information retrieval (ISMIR)

  18. Pati A, Lerch A, Hadjeres G (2019) Learning to traverse latent spaces for musical score inpainting. In: international symposium/conference on music information retrieval (ISMIR)

  19. Rezaabad AL, Vishwanath S (2020) Learning representations by maximizing mutual information in variational autoencoders. In: International symposium on information theory (ISIT)

  20. Gao S, Brekelmans R, Steeg GV, Galstyan A (2019) Auto-encoding total correlation explanation. In: International conference on artificial intelligence and statistics

  21. Achille A, Soatto S (2018) Information dropout: learning optimal representations through noisy computation. IEEE Trans Pattern Anal Mach Intell (TPAMI) 40(12):2897–2905

    Article  Google Scholar 

  22. Kim H, Mnih A (2018) Disentangling by factorising. In: International conference on machine learning (ICML)

  23. Castro DCD, Tan J, Kainz B, Konukoglu E, Glocker B (2019) Morpho-MNIST: quantitative assessment and diagnostics for representation learning. J Mach Learn Res (JMLR) 20(178):1–29

    MathSciNet  MATH  Google Scholar 

  24. Foxley E (2011) Nottingham database. https://github.com/jukedeck/nottingham-dataset

  25. Yingzhen L, Mandt S (2018) Disentangled sequential autoencoder. In: International conference on machine learning (ICML)

  26. Jha AH, Anand S, Singh M, Veeravasarapu VSR (2018) Disentangling factors of variation with cycle-consistent variational auto-encoders. In: European conference on computer vision (ECCV)

  27. Hadad N, Wolf L, Shahar M (2018) A two-step disentanglement method. In: Computer vision and pattern recognition (CVPR)

  28. Zhao S, Song J, Ermon S (2017) InfoVAE: information maximizing variational autoencoders. arXiv:1706.02262 [cs, stat]

  29. Houthooft R, Chen X, Duan Y, Schulman J, Turck FD, Abbeel P (2016) VIME: variational information maximizing exploration. In: Neural information processing systems (NIPS)

  30. Esmaeili B, Wu H, Jain S, Bozkurt A, Siddharth N, Paige B, Brooks DH, Dy JG, Meent J-Wvd (2019) Structured disentangled representations. In: International conference on artificial intelligence and statistics

  31. Carter S, Nielsen M (2017) Using artificial intelligence to augment human intelligence. vol 2. https://doi.org/10.23915/DISTILL.00009

  32. Locatello F, Bauer S, Lucic M, Ratsch G, Gelly S, Scholkopf B, Bachem O (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In: International conference on learning representations (ICLR)

  33. Pesteie M, Abolmaesumi P, Rohling RN (2019) Adaptive augmentation of medical data using independently conditional variational auto-encoders. IEEE Trans Med Imaging 38(12):2807–2820

    Article  Google Scholar 

  34. Sohn K, Yan X, Lee H (2015) Learning structured output representation using deep conditional generative models. In: Neural information processing systems (NIPS)

  35. Pandey G, Dukkipati A (2017) Variational methods for conditional multimodal deep learning. In: International Joint Conference on Neural Network (IJCNN)

  36. Kulkarni TD, Whitney WF, Kohli P, Tenenbaum JB (2015) Deep convolutional inverse graphics network. In: Neural information processing systems (NIPS)

  37. Pati A, Lerch A (2020) Attribute-based regularization of latent spaces for variational auto-encoders. Neural Computing and Applications (NCAA)

  38. Kaliakatsos-Papakostas M, Floros A, Vrahatis MN (2020) Artificial intelligence methods for music generation: a review and future perspectives. In 217:217–245

    Google Scholar 

  39. Yang R, Chen T, Zhang Y, Xia G (2019) Inspecting and interacting with meaningful music representations using VAE. In: New interfaces for musical expression, pp 307–312

  40. Esling P, Chemla-Romeu-Santos A, Bitton A (2018) Bridging audio analysis, perception and synthesis with perceptually-regularized variational timbre spaces. In: International symposium/conference on music information retrieval

  41. Jing L, Xinyu Y, Shulei J, Juan L (2019) MG-VAE: Deep chinese folk songs generation with specific regional styles. In: Conference on sound and music technology (CSMT)

  42. Yun-Ning H, Yi-AN C, Yi-Hsuan Y (2018) Learning disentangled representations for timber and pitch. arXiv:1811:03271v1 [cs.SD]

  43. Yang R, Wang D, Wang Z, Chen T, Jiang J, Xia G (2019) Deep music analogy via latent representation disentanglement. In: International symposium/conference on music information retrieval (ISMIR), pp 596–603

  44. Chung J, Gülčehre vC, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 [cs, stat]

  45. Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, Lerchner A (2016) beta-VAE: learning basic visual concepts with a constrained variational framework. In: International conference on learning representations (ICLR)

  46. Adel T, Ghahramani Z, Weller A (2018) Discovering interpretable representations for both deep generative and discriminative models. In: International conference on machine learning (ICML)

  47. Eastwood C, Williams CKI (2018) A framework for the quantitative evaluation of disentangled representations. In: International Conference on Learning Representations (ICLR)

  48. Chen TQ, Li X, Grosse RB, Duvenaud D (2018) Isolating sources of disentanglement in variational autoencoders. In: International Conference on Learning Representations (ICLR)

  49. Ridgeway K, Mozer MC (2018) Learning deep disentangled embeddings with the F-statistic loss. In: Neural Information Processing Systems (NIPS)

  50. Kumar A, Sattigeri P, Balakrishnan A (2017) Variational inference of disentangled latent concepts from unlabeled observations. In: International Conference on Learning Representations (ICLR)

  51. Scheffe H (1999) The analysis of variance, vol 72. Wiley

  52. Sak H, Senior AW, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Conference of the international speech communication association

Download references

Acknowledgements

The authors would like to acknowledge the supports by the National Natural Science Foundation of China (Grant No. 61471124), Key Industrial Guidance Projects of Fujian Science and Technology Department (Grant No. 2020H0007).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qianying Zheng.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Network architecture

Image-based models: For the MNIST digits dataset, a stacked convolutional encoder–decoder architecture is used. The encoder consists of four two-dimensional convolutional layers followed by a stack of three linear layers. The structure of decoder is similar to the encoder and consists of a stack of three linear layers followed by four two-dimensional convolutional layers. The network details are shown in Table  3.

Music-based models: For the music dataset, the model architecture is based on other previous works. A hierarchical recurrent GRUs architecture is used. Figure 6 shows the schematic of the decoder architecture, and the network details are shown in Table  4.

Appendix 2: Additional results

Some additional generated examples of the MNIST handwritten digits are shown in Fig. 21.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, R., Zheng, Q. & Zhou, H. Visualization-based disentanglement of latent space. Neural Comput & Applic 33, 16213–16228 (2021). https://doi.org/10.1007/s00521-021-06223-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06223-z

Keywords

Navigation