Abstract
In recent years, selecting manipulation of data attributes by changing latent code using auto-encoder has received considerable scholarly attention . However, the representation of the data encoded by the auto-encoder cannot be visually observed. Furthermore, the attribute values and the latent code of the dimension do not conform to a linear monotonic relationship. From a practical point of view, we propose a novel method that uses the encoder–decoder architecture to disentangle data into two visualizable representations that are encoded as latent spaces. Consequently, the encoded latent space can be used to manipulate data attributes in a simple and intuitive way. The experiments on image dataset and music dataset show that the proposed approach leads to produce complete interpretable latent spaces, which can be used to manipulate a wide range of data attributes and to generate realistic music via analogy.
Similar content being viewed by others
References
Shao Z, Huang M, Wen J, Xu W, Zhu X (2019) Long and diverse text generation with planning-based hierarchical variational model. In: International joint conference on natural language processing (IJCNLP), pp 3255–3266
Shen D, Celikyilmaz A, Zhang Y, Chen L, Wang X, Gao J, Carin L (2019) Towards generating long and coherent text with multi-level latent variable models. In: Meeting of the association for computational linguistics (ACL), pp 2079–2089
Zhang Y, Wang Y, Zhang L, Zhang Z, Gai K (2019) Improve diverse text generation by self labeling conditional variational auto encoder. In: International conference on acoustics speech and signal processing (ICASSP), pp 2767–2771
Hsu W, Zhang Y, Weiss R, Chung Y, Wang Y, Wu Y, Glass J (2019) Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In: International conference on acoustics speech and signal processing (ICASSP), pp 5901–5905
Hsu W, Zhang Y, Weiss R, Zen H, Wu Y, Wang Y, Cao Y, Jia Y, Chen Z, Shen J (2019) Hierarchical Generative Modeling for Controllable Speech Synthesis. In: international conference on learning representations (ICLR)
Luo Y, Agres K, Herremans D (2019) Learning disentangled representations of timbre and pitch for musical instrument sounds using gaussian mixture variational autoencoders. In: International symposium/conference on music information retrieval (ISMIR), pp 746–753
Wang Y, Stanton D, Zhang Y, Ryan R, Battenberg E, Shor J, Xiao Y, Jia Y, Ren F, Saurous RA (2018) Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In: International conference on machine learning (ICML), pp 5167–5176
Razavi A, Oord Avd, Vinyals O (2019) Generating diverse high-fidelity images with VQ-VAE-2. In: Neural information processing systems (NIPS)
Ślot K, Kapusta P, Kucharski J (2020) Autoencoder-based image processing framework for object appearance modifications. Neural Computing and Applications (NCAA)
Qi CR, Yi L, Su H, Guibas LJ (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Neural Information Processing Systems (NIPS)
Brunner G, Konrad A, Wang Y, Wattenhofer R (2018) MIDI-VAE: modeling dynamics and instrumentation of music with applications to style transfer. In: International symposium/conference on music information retrieval (ISMIR), pp 747–754
9. Esling P, Chemlaromeusantos A, Bitton A (2018) Bridging audio analysis, perception and synthesis with perceptually-regularized variational timbre spaces. In: International symposium/conference on music information retrieval (ISMIR), pp 175–181
Roberts A, Engel J, Raffel C, Hawthorne C, Eck DJaL (2018) A hierarchical latent vector model for learning long-term structure in music. In: International conference on machine learning (ICML)
Rubenstein PK, Scholkopf B, Tolstikhin I (2018) Learning disentangled representations with wasserstein auto-encoders. In: International conference on learning representations (ICLR)
Kingma DP, Welling M (2014) Auto-encoding variational Bayes. In: International conference on learning representations (ICLR)
Hadjeres G, Nielsen F, Pachet F, Ieee (2017) GLSR-VAE: geodesic latent space regularization for variational autoencoder architectures. In: IEEE symposium series on computational intelligence (SSCI)
Brunner G, Konrad A, Wang Y, Wattenhofer R (2018) MIDI-VAE: modeling dynamics and instrumentation of music with applications to style transfer. In: International symposium/conference on music information retrieval (ISMIR)
Pati A, Lerch A, Hadjeres G (2019) Learning to traverse latent spaces for musical score inpainting. In: international symposium/conference on music information retrieval (ISMIR)
Rezaabad AL, Vishwanath S (2020) Learning representations by maximizing mutual information in variational autoencoders. In: International symposium on information theory (ISIT)
Gao S, Brekelmans R, Steeg GV, Galstyan A (2019) Auto-encoding total correlation explanation. In: International conference on artificial intelligence and statistics
Achille A, Soatto S (2018) Information dropout: learning optimal representations through noisy computation. IEEE Trans Pattern Anal Mach Intell (TPAMI) 40(12):2897–2905
Kim H, Mnih A (2018) Disentangling by factorising. In: International conference on machine learning (ICML)
Castro DCD, Tan J, Kainz B, Konukoglu E, Glocker B (2019) Morpho-MNIST: quantitative assessment and diagnostics for representation learning. J Mach Learn Res (JMLR) 20(178):1–29
Foxley E (2011) Nottingham database. https://github.com/jukedeck/nottingham-dataset
Yingzhen L, Mandt S (2018) Disentangled sequential autoencoder. In: International conference on machine learning (ICML)
Jha AH, Anand S, Singh M, Veeravasarapu VSR (2018) Disentangling factors of variation with cycle-consistent variational auto-encoders. In: European conference on computer vision (ECCV)
Hadad N, Wolf L, Shahar M (2018) A two-step disentanglement method. In: Computer vision and pattern recognition (CVPR)
Zhao S, Song J, Ermon S (2017) InfoVAE: information maximizing variational autoencoders. arXiv:1706.02262 [cs, stat]
Houthooft R, Chen X, Duan Y, Schulman J, Turck FD, Abbeel P (2016) VIME: variational information maximizing exploration. In: Neural information processing systems (NIPS)
Esmaeili B, Wu H, Jain S, Bozkurt A, Siddharth N, Paige B, Brooks DH, Dy JG, Meent J-Wvd (2019) Structured disentangled representations. In: International conference on artificial intelligence and statistics
Carter S, Nielsen M (2017) Using artificial intelligence to augment human intelligence. vol 2. https://doi.org/10.23915/DISTILL.00009
Locatello F, Bauer S, Lucic M, Ratsch G, Gelly S, Scholkopf B, Bachem O (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In: International conference on learning representations (ICLR)
Pesteie M, Abolmaesumi P, Rohling RN (2019) Adaptive augmentation of medical data using independently conditional variational auto-encoders. IEEE Trans Med Imaging 38(12):2807–2820
Sohn K, Yan X, Lee H (2015) Learning structured output representation using deep conditional generative models. In: Neural information processing systems (NIPS)
Pandey G, Dukkipati A (2017) Variational methods for conditional multimodal deep learning. In: International Joint Conference on Neural Network (IJCNN)
Kulkarni TD, Whitney WF, Kohli P, Tenenbaum JB (2015) Deep convolutional inverse graphics network. In: Neural information processing systems (NIPS)
Pati A, Lerch A (2020) Attribute-based regularization of latent spaces for variational auto-encoders. Neural Computing and Applications (NCAA)
Kaliakatsos-Papakostas M, Floros A, Vrahatis MN (2020) Artificial intelligence methods for music generation: a review and future perspectives. In 217:217–245
Yang R, Chen T, Zhang Y, Xia G (2019) Inspecting and interacting with meaningful music representations using VAE. In: New interfaces for musical expression, pp 307–312
Esling P, Chemla-Romeu-Santos A, Bitton A (2018) Bridging audio analysis, perception and synthesis with perceptually-regularized variational timbre spaces. In: International symposium/conference on music information retrieval
Jing L, Xinyu Y, Shulei J, Juan L (2019) MG-VAE: Deep chinese folk songs generation with specific regional styles. In: Conference on sound and music technology (CSMT)
Yun-Ning H, Yi-AN C, Yi-Hsuan Y (2018) Learning disentangled representations for timber and pitch. arXiv:1811:03271v1 [cs.SD]
Yang R, Wang D, Wang Z, Chen T, Jiang J, Xia G (2019) Deep music analogy via latent representation disentanglement. In: International symposium/conference on music information retrieval (ISMIR), pp 596–603
Chung J, Gülčehre vC, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 [cs, stat]
Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, Lerchner A (2016) beta-VAE: learning basic visual concepts with a constrained variational framework. In: International conference on learning representations (ICLR)
Adel T, Ghahramani Z, Weller A (2018) Discovering interpretable representations for both deep generative and discriminative models. In: International conference on machine learning (ICML)
Eastwood C, Williams CKI (2018) A framework for the quantitative evaluation of disentangled representations. In: International Conference on Learning Representations (ICLR)
Chen TQ, Li X, Grosse RB, Duvenaud D (2018) Isolating sources of disentanglement in variational autoencoders. In: International Conference on Learning Representations (ICLR)
Ridgeway K, Mozer MC (2018) Learning deep disentangled embeddings with the F-statistic loss. In: Neural Information Processing Systems (NIPS)
Kumar A, Sattigeri P, Balakrishnan A (2017) Variational inference of disentangled latent concepts from unlabeled observations. In: International Conference on Learning Representations (ICLR)
Scheffe H (1999) The analysis of variance, vol 72. Wiley
Sak H, Senior AW, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Conference of the international speech communication association
Acknowledgements
The authors would like to acknowledge the supports by the National Natural Science Foundation of China (Grant No. 61471124), Key Industrial Guidance Projects of Fujian Science and Technology Department (Grant No. 2020H0007).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Network architecture
Image-based models: For the MNIST digits dataset, a stacked convolutional encoder–decoder architecture is used. The encoder consists of four two-dimensional convolutional layers followed by a stack of three linear layers. The structure of decoder is similar to the encoder and consists of a stack of three linear layers followed by four two-dimensional convolutional layers. The network details are shown in Table 3.
Music-based models: For the music dataset, the model architecture is based on other previous works. A hierarchical recurrent GRUs architecture is used. Figure 6 shows the schematic of the decoder architecture, and the network details are shown in Table 4.
Appendix 2: Additional results
Some additional generated examples of the MNIST handwritten digits are shown in Fig. 21.
Rights and permissions
About this article
Cite this article
Huang, R., Zheng, Q. & Zhou, H. Visualization-based disentanglement of latent space. Neural Comput & Applic 33, 16213–16228 (2021). https://doi.org/10.1007/s00521-021-06223-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-06223-z