Abstract
Variational Autoencoders (VAEs) constitute one of the most significant deep generative models for the creation of synthetic samples. In the field of audio synthesis, VAEs have been widely used for the generation of natural and expressive sounds, such as music or speech. However, VAEs are often considered black boxes and the attributes that contribute to the synthesis of a sound are yet unsolved. Existing research focused on the way input data can influence the generation of latent space, and how this latent space can create synthetic data, is still insufficient. In this manuscript, we investigate the interpretability of the latent space of VAEs and the impact of each attribute of this space on the generation of synthetic instrumental notes. The contribution to the body of knowledge of this research is to offer, for both the XAI and sound community, an approach for interpreting how the latent space generates new samples. This is based on sensitivity and feature ablation analyses, and descriptive statistics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahmed, T., Longo, L.: Examining the size of the latent space of convolutional variational autoencoders trained with spectral topographic maps of EEG frequency bands. IEEE Access 10, 107575–107586 (2022). https://doi.org/10.1109/ACCESS.2022.3212777
Aouameur, C., Esling, P., Hadjeres, G.: Neural drum machine: an interactive system for real-time synthesis of drum sounds. In: International Conference on Computational Creativity (2019)
Arık, S.Ö., Jun, H., Diamos, G.: Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Sig. Process. Lett. 26(1), 94–98 (2018)
Arrieta, A.B., et al.: Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020)
Baldi, P., Hornik, K.: Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 2(1), 53–58 (1989)
Caillon, A., Bitton, A., Gatinet, B., Esling, P.: Timbre latent space: exploration and creative aspects. In: Timbre International Conference (2020)
Caillon, A., Esling, P.: RAVE: a variational autoencoder for fast and high-quality neural audio synthesis. In: International Conference on Learning Representations (2022)
Chikkankod, A.V., Longo, L.: On the dimensionality and utility of convolutional autoencoder’s latent space trained with topology-preserving spectral EEG head-maps. Mach. Learn. Knowl. Extr. 4(4), 1042–1064 (2022). https://doi.org/10.3390/make4040053. https://www.mdpi.com/2504-4990/4/4/53
Covert, I., Lundberg, S.M., Lee, S.I.: Understanding global feature contributions with additive importance measures. In: Advances in Neural Information Processing Systems, vol. 33, pp. 17212–17223 (2020)
De Cheveigné, A., Kawahara, H.: YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 111(4), 1917–1930 (2002)
Défossez, A., Zeghidour, N., Usunier, N., Bottou, L., Bach, F.: SING: symbol-to-instrument neural generator. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: a generative model for music. arXiv e-prints (2020)
Engel, J., Agrawal, K.K., Chen, S., Gulrajani, I., Donahue, C., Roberts, A.: GANSynth: adversarial neural audio synthesis. In: International Conference on Learning Representations (2019)
Engel, J., Gu, C., Roberts, A., et al.: DDSP: differentiable digital signal processing. In: International Conference on Learning Representations (2019)
Franzson, D.B., Shepardsson, V., Magnusson, T.: Autocoder: a variational autoencoder for spectral synthesis (2022)
Graving, J., Couzin, I.: VAE-SNE: a deep generative model for simultaneous dimensionality reduction and clustering. BioRxiv (2020)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2017). http://arxiv.org/abs/1412.6980. arXiv:1412.6980
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Kobayashi, K., Miyake, M., Takahashi, M., Hamamoto, R.: Observing deep radiomics for the classification of glioma grades. Sci. Rep. 11(1), 10942 (2021)
Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. arXiv:1910.06711 (2019). http://arxiv.org/abs/1910.06711
Lee, S., Kim, M., Shin, S., Lee, D., Jang, I., Lim, W.: Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound. arXiv preprint arXiv:2211.08715 (2022)
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Luo, Y.J., Agres, K., Herremans, D.: Learning disentangled representations of timbre and pitch for musical instrument sounds using Gaussian mixture variational autoencoders. arXiv preprint arXiv:1906.08152 (2019)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
Maćkiewicz, A., Ratajczak, W.: Principal components analysis (PCA). Comput. Geosci. 19(3), 303–342 (1993)
Natsiou, A., Longo, L., O’Leary, S.: An investigation of the reconstruction capacity of stacked convolutional autoencoders for log-mel-spectrograms. In: 2022 16th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pp. 155–162 (2022). https://doi.org/10.1109/SITIS57111.2022.00038
Natsiou, A., O’Leary, S.: Audio representations for deep learning in sound synthesis: a review. In: 2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA), pp. 1–8. IEEE (2021)
Nguyen, Q.P., Lim, K.W., Divakaran, D.M., Low, K.H., Chan, M.C.: GEE: a gradient-based explainable variational autoencoder for network anomaly detection. In: 2019 IEEE Conference on Communications and Network Security (CNS), pp. 91–99. IEEE (2019)
Reed, C., et al.: Exploring XAI for the arts: explaining latent space in generative music (2022)
Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
Saseendran, A., Skubch, K., Falkner, S., Keuper, M.: Shape your space: a Gaussian mixture regularization approach to deterministic autoencoders. In: Advances in Neural Information Processing Systems, vol. 34, pp. 7319–7332 (2021)
Shan, S., Hantrakul, L., Chen, J., Avent, M., Trevelyan, D.: Differentiable wavetable synthesis. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4598–4602. IEEE (2022)
Subramani, K., Rao, P., D’Hooge, A.: VaPar Synth-a variational parametric model for audio synthesis. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 796–800. IEEE (2020)
Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: International Conference on Machine Learning, pp. 3319–3328. PMLR (2017)
Tatar, K., Bisig, D., Pasquier, P.: Latent timbre synthesis: audio-based variational auto-encoders for music composition and sound design applications. Neural Comput. Appl. 33, 67–84 (2021). https://doi.org/10.1007/s00521-020-05424-2
Vigliensoni, G., McCallum, L., Fiebrink, R.: Creating latent spaces for modern music genre rhythms using minimal training data. In: Conference on Computational Creativity (2020)
Vilone, G., Longo, L.: A quantitative evaluation of global, rule-based explanations of post-hoc, model agnostic methods. Front. Artif. Intell. 4, 160 (2021). https://doi.org/10.3389/frai.2021.717899
Vilone, G., Rizzo, L., Longo, L.: A comparative analysis of rule-based, model-agnostic methods for explainable artificial intelligence. In: Longo, L., Rizzo, L., Hunter, E., Pakrashi, A. (eds.) Proceedings of the 28th Irish Conference on Artificial Intelligence and Cognitive Science, Dublin, Republic of Ireland, 7–8 December 2020. CEUR Workshop Proceedings, vol. 2771, pp. 85–96. CEUR-WS.org (2020)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Watcharasupat, K.N., Lerch, A.: Evaluation of latent space disentanglement in the presence of interdependent attributes. In: International Society for Music and Information Retrieval Conference (ISMIR) (2021)
Xu, J., et al.: Multi-VAE: learning disentangled view-common and view-peculiar visual representations for multi-view clustering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9234–9243 (2021)
Acknowledgement
This work was funded by Science Foundation Ireland and its Centre for Research Training in Machine Learning (18/CRT/6183).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Natsiou, A., O’Leary, S., Longo, L. (2023). An Exploration of the Latent Space of a Convolutional Variational Autoencoder for the Generation of Musical Instrument Tones. In: Longo, L. (eds) Explainable Artificial Intelligence. xAI 2023. Communications in Computer and Information Science, vol 1903. Springer, Cham. https://doi.org/10.1007/978-3-031-44070-0_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-44070-0_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44069-4
Online ISBN: 978-3-031-44070-0
eBook Packages: Computer ScienceComputer Science (R0)