An Exploration of the Latent Space of a Convolutional Variational Autoencoder for the Generation of Musical Instrument Tones

Natsiou, Anastasia; O’Leary, Seán; Longo, Luca

doi:10.1007/978-3-031-44070-0_24

Anastasia Natsiou⁶,
Seán O’Leary⁶ &
Luca Longo⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1903))

Included in the following conference series:

World Conference on Explainable Artificial Intelligence

640 Accesses

Abstract

Variational Autoencoders (VAEs) constitute one of the most significant deep generative models for the creation of synthetic samples. In the field of audio synthesis, VAEs have been widely used for the generation of natural and expressive sounds, such as music or speech. However, VAEs are often considered black boxes and the attributes that contribute to the synthesis of a sound are yet unsolved. Existing research focused on the way input data can influence the generation of latent space, and how this latent space can create synthetic data, is still insufficient. In this manuscript, we investigate the interpretability of the latent space of VAEs and the impact of each attribute of this space on the generation of synthetic instrumental notes. The contribution to the body of knowledge of this research is to offer, for both the XAI and sound community, an approach for interpreting how the latent space generates new samples. This is based on sensitivity and feature ablation analyses, and descriptive statistics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Exploring Variational Auto-encoder Architectures, Configurations, and Datasets for Generative Music Explainable AI

Article Open access 15 January 2024

MuseBar: Alleviating Posterior Collapse in Recurrent VAEs Toward Music Generation

Latent Timbre Synthesis

Article 20 October 2020

Notes

References

Ahmed, T., Longo, L.: Examining the size of the latent space of convolutional variational autoencoders trained with spectral topographic maps of EEG frequency bands. IEEE Access 10, 107575–107586 (2022). https://doi.org/10.1109/ACCESS.2022.3212777
Article Google Scholar
Aouameur, C., Esling, P., Hadjeres, G.: Neural drum machine: an interactive system for real-time synthesis of drum sounds. In: International Conference on Computational Creativity (2019)
Google Scholar
Arık, S.Ö., Jun, H., Diamos, G.: Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Sig. Process. Lett. 26(1), 94–98 (2018)
Article Google Scholar
Arrieta, A.B., et al.: Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020)
Article Google Scholar
Baldi, P., Hornik, K.: Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 2(1), 53–58 (1989)
Article Google Scholar
Caillon, A., Bitton, A., Gatinet, B., Esling, P.: Timbre latent space: exploration and creative aspects. In: Timbre International Conference (2020)
Google Scholar
Caillon, A., Esling, P.: RAVE: a variational autoencoder for fast and high-quality neural audio synthesis. In: International Conference on Learning Representations (2022)
Google Scholar
Chikkankod, A.V., Longo, L.: On the dimensionality and utility of convolutional autoencoder’s latent space trained with topology-preserving spectral EEG head-maps. Mach. Learn. Knowl. Extr. 4(4), 1042–1064 (2022). https://doi.org/10.3390/make4040053. https://www.mdpi.com/2504-4990/4/4/53
Covert, I., Lundberg, S.M., Lee, S.I.: Understanding global feature contributions with additive importance measures. In: Advances in Neural Information Processing Systems, vol. 33, pp. 17212–17223 (2020)
Google Scholar
De Cheveigné, A., Kawahara, H.: YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 111(4), 1917–1930 (2002)
Article Google Scholar
Défossez, A., Zeghidour, N., Usunier, N., Bottou, L., Bach, F.: SING: symbol-to-instrument neural generator. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: a generative model for music. arXiv e-prints (2020)
Google Scholar
Engel, J., Agrawal, K.K., Chen, S., Gulrajani, I., Donahue, C., Roberts, A.: GANSynth: adversarial neural audio synthesis. In: International Conference on Learning Representations (2019)
Google Scholar
Engel, J., Gu, C., Roberts, A., et al.: DDSP: differentiable digital signal processing. In: International Conference on Learning Representations (2019)
Google Scholar
Franzson, D.B., Shepardsson, V., Magnusson, T.: Autocoder: a variational autoencoder for spectral synthesis (2022)
Google Scholar
Graving, J., Couzin, I.: VAE-SNE: a deep generative model for simultaneous dimensionality reduction and clustering. BioRxiv (2020)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2017). http://arxiv.org/abs/1412.6980. arXiv:1412.6980
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Kobayashi, K., Miyake, M., Takahashi, M., Hamamoto, R.: Observing deep radiomics for the classification of glioma grades. Sci. Rep. 11(1), 10942 (2021)
Article Google Scholar
Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. arXiv:1910.06711 (2019). http://arxiv.org/abs/1910.06711
Lee, S., Kim, M., Shin, S., Lee, D., Jang, I., Lim, W.: Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound. arXiv preprint arXiv:2211.08715 (2022)
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Luo, Y.J., Agres, K., Herremans, D.: Learning disentangled representations of timbre and pitch for musical instrument sounds using Gaussian mixture variational autoencoders. arXiv preprint arXiv:1906.08152 (2019)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
MATH Google Scholar
Maćkiewicz, A., Ratajczak, W.: Principal components analysis (PCA). Comput. Geosci. 19(3), 303–342 (1993)
Article Google Scholar
Natsiou, A., Longo, L., O’Leary, S.: An investigation of the reconstruction capacity of stacked convolutional autoencoders for log-mel-spectrograms. In: 2022 16th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pp. 155–162 (2022). https://doi.org/10.1109/SITIS57111.2022.00038
Natsiou, A., O’Leary, S.: Audio representations for deep learning in sound synthesis: a review. In: 2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA), pp. 1–8. IEEE (2021)
Google Scholar
Nguyen, Q.P., Lim, K.W., Divakaran, D.M., Low, K.H., Chan, M.C.: GEE: a gradient-based explainable variational autoencoder for network anomaly detection. In: 2019 IEEE Conference on Communications and Network Security (CNS), pp. 91–99. IEEE (2019)
Google Scholar
Reed, C., et al.: Exploring XAI for the arts: explaining latent space in generative music (2022)
Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
Google Scholar
Saseendran, A., Skubch, K., Falkner, S., Keuper, M.: Shape your space: a Gaussian mixture regularization approach to deterministic autoencoders. In: Advances in Neural Information Processing Systems, vol. 34, pp. 7319–7332 (2021)
Google Scholar
Shan, S., Hantrakul, L., Chen, J., Avent, M., Trevelyan, D.: Differentiable wavetable synthesis. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4598–4602. IEEE (2022)
Google Scholar
Subramani, K., Rao, P., D’Hooge, A.: VaPar Synth-a variational parametric model for audio synthesis. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 796–800. IEEE (2020)
Google Scholar
Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: International Conference on Machine Learning, pp. 3319–3328. PMLR (2017)
Google Scholar
Tatar, K., Bisig, D., Pasquier, P.: Latent timbre synthesis: audio-based variational auto-encoders for music composition and sound design applications. Neural Comput. Appl. 33, 67–84 (2021). https://doi.org/10.1007/s00521-020-05424-2
Article Google Scholar
Vigliensoni, G., McCallum, L., Fiebrink, R.: Creating latent spaces for modern music genre rhythms using minimal training data. In: Conference on Computational Creativity (2020)
Google Scholar
Vilone, G., Longo, L.: A quantitative evaluation of global, rule-based explanations of post-hoc, model agnostic methods. Front. Artif. Intell. 4, 160 (2021). https://doi.org/10.3389/frai.2021.717899
Article Google Scholar
Vilone, G., Rizzo, L., Longo, L.: A comparative analysis of rule-based, model-agnostic methods for explainable artificial intelligence. In: Longo, L., Rizzo, L., Hunter, E., Pakrashi, A. (eds.) Proceedings of the 28th Irish Conference on Artificial Intelligence and Cognitive Science, Dublin, Republic of Ireland, 7–8 December 2020. CEUR Workshop Proceedings, vol. 2771, pp. 85–96. CEUR-WS.org (2020)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Watcharasupat, K.N., Lerch, A.: Evaluation of latent space disentanglement in the presence of interdependent attributes. In: International Society for Music and Information Retrieval Conference (ISMIR) (2021)
Google Scholar
Xu, J., et al.: Multi-VAE: learning disentangled view-common and view-peculiar visual representations for multi-view clustering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9234–9243 (2021)
Google Scholar

Download references

Acknowledgement

This work was funded by Science Foundation Ireland and its Centre for Research Training in Machine Learning (18/CRT/6183).

Author information

Authors and Affiliations

School of Computer Science, Artificial Intelligence and Cognitive Load Research Lab, Technological University Dublin, Dublin, Republic of Ireland
Anastasia Natsiou, Seán O’Leary & Luca Longo

Authors

Anastasia Natsiou
View author publications
You can also search for this author in PubMed Google Scholar
Seán O’Leary
View author publications
You can also search for this author in PubMed Google Scholar
Luca Longo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anastasia Natsiou .

Editor information

Editors and Affiliations

Technological University Dublin, Dublin, Ireland
Luca Longo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Natsiou, A., O’Leary, S., Longo, L. (2023). An Exploration of the Latent Space of a Convolutional Variational Autoencoder for the Generation of Musical Instrument Tones. In: Longo, L. (eds) Explainable Artificial Intelligence. xAI 2023. Communications in Computer and Information Science, vol 1903. Springer, Cham. https://doi.org/10.1007/978-3-031-44070-0_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-44070-0_24
Published: 21 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44069-4
Online ISBN: 978-3-031-44070-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Exploration of the Latent Space of a Convolutional Variational Autoencoder for the Generation of Musical Instrument Tones

Abstract

Access this chapter

Similar content being viewed by others

Exploring Variational Auto-encoder Architectures, Configurations, and Datasets for Generative Music Explainable AI

MuseBar: Alleviating Posterior Collapse in Recurrent VAEs Toward Music Generation

Latent Timbre Synthesis

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

An Exploration of the Latent Space of a Convolutional Variational Autoencoder for the Generation of Musical Instrument Tones

Abstract

Access this chapter

Similar content being viewed by others

Exploring Variational Auto-encoder Architectures, Configurations, and Datasets for Generative Music Explainable AI

MuseBar: Alleviating Posterior Collapse in Recurrent VAEs Toward Music Generation

Latent Timbre Synthesis

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation