Skip to main content

An Exploration of the Latent Space of a Convolutional Variational Autoencoder for the Generation of Musical Instrument Tones

  • Conference paper
  • First Online:
Explainable Artificial Intelligence (xAI 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1903))

Included in the following conference series:

  • 652 Accesses


Variational Autoencoders (VAEs) constitute one of the most significant deep generative models for the creation of synthetic samples. In the field of audio synthesis, VAEs have been widely used for the generation of natural and expressive sounds, such as music or speech. However, VAEs are often considered black boxes and the attributes that contribute to the synthesis of a sound are yet unsolved. Existing research focused on the way input data can influence the generation of latent space, and how this latent space can create synthetic data, is still insufficient. In this manuscript, we investigate the interpretability of the latent space of VAEs and the impact of each attribute of this space on the generation of synthetic instrumental notes. The contribution to the body of knowledge of this research is to offer, for both the XAI and sound community, an approach for interpreting how the latent space generates new samples. This is based on sensitivity and feature ablation analyses, and descriptive statistics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. 1.

  2. 2.


  1. Ahmed, T., Longo, L.: Examining the size of the latent space of convolutional variational autoencoders trained with spectral topographic maps of EEG frequency bands. IEEE Access 10, 107575–107586 (2022).

    Article  Google Scholar 

  2. Aouameur, C., Esling, P., Hadjeres, G.: Neural drum machine: an interactive system for real-time synthesis of drum sounds. In: International Conference on Computational Creativity (2019)

    Google Scholar 

  3. Arık, S.Ö., Jun, H., Diamos, G.: Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Sig. Process. Lett. 26(1), 94–98 (2018)

    Article  Google Scholar 

  4. Arrieta, A.B., et al.: Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020)

    Article  Google Scholar 

  5. Baldi, P., Hornik, K.: Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 2(1), 53–58 (1989)

    Article  Google Scholar 

  6. Caillon, A., Bitton, A., Gatinet, B., Esling, P.: Timbre latent space: exploration and creative aspects. In: Timbre International Conference (2020)

    Google Scholar 

  7. Caillon, A., Esling, P.: RAVE: a variational autoencoder for fast and high-quality neural audio synthesis. In: International Conference on Learning Representations (2022)

    Google Scholar 

  8. Chikkankod, A.V., Longo, L.: On the dimensionality and utility of convolutional autoencoder’s latent space trained with topology-preserving spectral EEG head-maps. Mach. Learn. Knowl. Extr. 4(4), 1042–1064 (2022).

  9. Covert, I., Lundberg, S.M., Lee, S.I.: Understanding global feature contributions with additive importance measures. In: Advances in Neural Information Processing Systems, vol. 33, pp. 17212–17223 (2020)

    Google Scholar 

  10. De Cheveigné, A., Kawahara, H.: YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 111(4), 1917–1930 (2002)

    Article  Google Scholar 

  11. Défossez, A., Zeghidour, N., Usunier, N., Bottou, L., Bach, F.: SING: symbol-to-instrument neural generator. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  12. Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: a generative model for music. arXiv e-prints (2020)

    Google Scholar 

  13. Engel, J., Agrawal, K.K., Chen, S., Gulrajani, I., Donahue, C., Roberts, A.: GANSynth: adversarial neural audio synthesis. In: International Conference on Learning Representations (2019)

    Google Scholar 

  14. Engel, J., Gu, C., Roberts, A., et al.: DDSP: differentiable digital signal processing. In: International Conference on Learning Representations (2019)

    Google Scholar 

  15. Franzson, D.B., Shepardsson, V., Magnusson, T.: Autocoder: a variational autoencoder for spectral synthesis (2022)

    Google Scholar 

  16. Graving, J., Couzin, I.: VAE-SNE: a deep generative model for simultaneous dimensionality reduction and clustering. BioRxiv (2020)

    Google Scholar 

  17. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2017). arXiv:1412.6980

  18. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  19. Kobayashi, K., Miyake, M., Takahashi, M., Hamamoto, R.: Observing deep radiomics for the classification of glioma grades. Sci. Rep. 11(1), 10942 (2021)

    Article  Google Scholar 

  20. Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. arXiv:1910.06711 (2019).

  21. Lee, S., Kim, M., Shin, S., Lee, D., Jang, I., Lim, W.: Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound. arXiv preprint arXiv:2211.08715 (2022)

  22. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  23. Luo, Y.J., Agres, K., Herremans, D.: Learning disentangled representations of timbre and pitch for musical instrument sounds using Gaussian mixture variational autoencoders. arXiv preprint arXiv:1906.08152 (2019)

  24. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)

    MATH  Google Scholar 

  25. Maćkiewicz, A., Ratajczak, W.: Principal components analysis (PCA). Comput. Geosci. 19(3), 303–342 (1993)

    Article  Google Scholar 

  26. Natsiou, A., Longo, L., O’Leary, S.: An investigation of the reconstruction capacity of stacked convolutional autoencoders for log-mel-spectrograms. In: 2022 16th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pp. 155–162 (2022).

  27. Natsiou, A., O’Leary, S.: Audio representations for deep learning in sound synthesis: a review. In: 2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA), pp. 1–8. IEEE (2021)

    Google Scholar 

  28. Nguyen, Q.P., Lim, K.W., Divakaran, D.M., Low, K.H., Chan, M.C.: GEE: a gradient-based explainable variational autoencoder for network anomaly detection. In: 2019 IEEE Conference on Communications and Network Security (CNS), pp. 91–99. IEEE (2019)

    Google Scholar 

  29. Reed, C., et al.: Exploring XAI for the arts: explaining latent space in generative music (2022)

    Google Scholar 

  30. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)

    Google Scholar 

  31. Saseendran, A., Skubch, K., Falkner, S., Keuper, M.: Shape your space: a Gaussian mixture regularization approach to deterministic autoencoders. In: Advances in Neural Information Processing Systems, vol. 34, pp. 7319–7332 (2021)

    Google Scholar 

  32. Shan, S., Hantrakul, L., Chen, J., Avent, M., Trevelyan, D.: Differentiable wavetable synthesis. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4598–4602. IEEE (2022)

    Google Scholar 

  33. Subramani, K., Rao, P., D’Hooge, A.: VaPar Synth-a variational parametric model for audio synthesis. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 796–800. IEEE (2020)

    Google Scholar 

  34. Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: International Conference on Machine Learning, pp. 3319–3328. PMLR (2017)

    Google Scholar 

  35. Tatar, K., Bisig, D., Pasquier, P.: Latent timbre synthesis: audio-based variational auto-encoders for music composition and sound design applications. Neural Comput. Appl. 33, 67–84 (2021).

    Article  Google Scholar 

  36. Vigliensoni, G., McCallum, L., Fiebrink, R.: Creating latent spaces for modern music genre rhythms using minimal training data. In: Conference on Computational Creativity (2020)

    Google Scholar 

  37. Vilone, G., Longo, L.: A quantitative evaluation of global, rule-based explanations of post-hoc, model agnostic methods. Front. Artif. Intell. 4, 160 (2021).

    Article  Google Scholar 

  38. Vilone, G., Rizzo, L., Longo, L.: A comparative analysis of rule-based, model-agnostic methods for explainable artificial intelligence. In: Longo, L., Rizzo, L., Hunter, E., Pakrashi, A. (eds.) Proceedings of the 28th Irish Conference on Artificial Intelligence and Cognitive Science, Dublin, Republic of Ireland, 7–8 December 2020. CEUR Workshop Proceedings, vol. 2771, pp. 85–96. (2020)

    Google Scholar 

  39. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  40. Watcharasupat, K.N., Lerch, A.: Evaluation of latent space disentanglement in the presence of interdependent attributes. In: International Society for Music and Information Retrieval Conference (ISMIR) (2021)

    Google Scholar 

  41. Xu, J., et al.: Multi-VAE: learning disentangled view-common and view-peculiar visual representations for multi-view clustering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9234–9243 (2021)

    Google Scholar 

Download references


This work was funded by Science Foundation Ireland and its Centre for Research Training in Machine Learning (18/CRT/6183).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Anastasia Natsiou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Natsiou, A., O’Leary, S., Longo, L. (2023). An Exploration of the Latent Space of a Convolutional Variational Autoencoder for the Generation of Musical Instrument Tones. In: Longo, L. (eds) Explainable Artificial Intelligence. xAI 2023. Communications in Computer and Information Science, vol 1903. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44069-4

  • Online ISBN: 978-3-031-44070-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics