Skip to main content
Log in

Latent Timbre Synthesis

Audio-based variational auto-encoders for music composition and sound design applications

  • S. I : Neural Networks in Art, sound and Design
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

We present the Latent Timbre Synthesis, a new audio synthesis method using deep learning. The synthesis method allows composers and sound designers to interpolate and extrapolate between the timbre of multiple sounds using the latent space of audio frames. We provide the details of two Variational Autoencoder architectures for the Latent Timbre Synthesis and compare their advantages and drawbacks. The implementation includes a fully working application with a graphical user interface, called interpolate_two, which enables practitioners to generate timbres between two audio excerpts of their selection using interpolation and extrapolation in the latent space of audio frames. Our implementation is open source, and we aim to improve the accessibility of this technology by providing a guide for users with any technical background. Our study includes a qualitative analysis where nine composers evaluated the Latent Timbre Synthesis and the interpolate_two application within their practices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. The source code is available at https://www.gitlab.com/ktatar/latent-timbre-synthesis.

  2. We provide sound examples at https://kivanctatar.com/Latent-Timbre-Synthesis.

  3. Appendix A summarizes the calculation and parameters of CQT.

  4. https://librosa.github.io/librosa/.

  5. We outline the inverse CQT algorithm in Appendix B

  6. We summarize the details of CQT calculation in Appendix A

  7. https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.windows.hann.html.

  8. Example audio reconstructions using trained models, training statistics with loss values, and hyper-parameter settings are available on the project page: https://kivanctatar.com/latent-timbre-synthesis.

  9. Exploration and exploitation are two search strategies in optimization applications [35, Sect. 5.3].

  10. The samples are available to download at the following two links: https://freesound.org/people/Erokia/packs/26656/ and https://freesound.org/people/Erokia/packs/26994/.

  11. Pre-trained models and example sounds are available at https://kivanctatar.com/latent-timbre-synthesis.

  12. The complete set of answers given by the composers are available at https://medienarchiv.zhdk.ch/entries/40dda1c8-6287-4356-adf4-ecdccec46119.

References

  1. Akten M (2018) Grannma MagNet. https://www.memo.tv/works/grannma-magnet/. Library Catalog: www.memo.tv

  2. Briot JP, Pachet F (2020) Deep learning for music generation: challenges and directions. Neural Computing and Applications 32(4):981–993. https://doi.org/10.1007/s00521-018-3813-6

  3. Dieleman S Sander Dieleman: Generating music in the raw audio domain. https://www.youtube.com/watch?v=y8mOZSJA7Bc

  4. Dieleman S, Oord Avd, Simonyan K (2018) The challenge of realistic music generation: modelling raw audio at scale. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), p. 11. Montreal QC, Canada

  5. Engel J, Hantrakul LH, Gu C, Roberts A (2020) Ddsp: Differentiable digital signal processing. In: International Conference on Learning Representations. https://openreview.net/forum?id=B1x1ma4tDr

  6. Esling P, Chemla-Romeu-Santos A, Bitton A (2018) Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics. arXiv:1805.08501 [cs, eess]. http://arxiv.org/abs/1805.08501. ArXiv: 1805.08501

  7. Gabor D (1947) Acoustical Quanta and the Theory of Hearing. Nature 159(4044):591–594. https://doi.org/10.1038/159591a0

  8. Grey JM (1977) Multidimensional perceptual scaling of musical timbres. The Journal of the Acoustical Society of America 61(5):1270–1277. 10.1121/1.381428. https://doi.org/10.1121/1.381428

  9. Griffin DW, Lim JS (1984) Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32(2):236–243. https://doi.org/10.1109/TASSP.1984.1164317

  10. Hantrakul L, Engel J, Roberts A, Gu C (2019) Fast and Flexible Neural Audio Synthesis. In: Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR 2019), p. 7

  11. He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE, Las Vegas, NV, USA. https://doi.org/10.1109/CVPR.2016.90

  12. Iverson P, Krumhansl CL (1993) Isolating the dynamic attributes of musical timbrea. The Journal of the Acoustical Society of America 94(5), 2595–2603. Publisher: Acoustical Society of America

  13. Kingma DP, Welling M (2014) Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat] . http://arxiv.org/abs/1312.6114. ArXiv: 1312.6114

  14. Kingma DP, Welling M (2019) An Introduction to Variational Autoencoders. Foundations and Trends in Machine Learning 12(4), 307–392. http://arxiv.org/abs/1906.02691. ArXiv: 1906.02691

  15. Krumhansl CL (1989) Why is musical timbre so hard to understand. Structure and perception of electroacoustic sound and music 9:43–53

    Google Scholar 

  16. Kumar K, Kumar R, de Boissiere T, Gestin L, Teoh WZ, Sotelo J, de Brebisson A, Bengio Y, Courville A (2019) MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), p. 12. Vancouver, BC, Canada

  17. Lakatos S (2000) A common perceptual space for harmonic and percussive timbres. Perception & psychophysics 62(7), 1426–1439. Publisher: Springer

  18. LeCun Y, Cortes C, Burges C MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/

  19. Luigi R (1967) The Art of Noise. A Great Bear Pamphlet

  20. Maaten Lvd (2014) Accelerating t-sne using tree-based algorithms. Journal of machine learning research 15(1):3221–3245

  21. McAdams S, Winsberg S, Donnadieu S, De Soete G, Krimphoff J (1995) Perceptual scaling of synthesized musical timbres: Common dimensions, specificities, and latent subject classes. Psychological research 58(3), 177–192. Publisher: Springer

  22. McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015) librosa: Audio and Music Signal Analysis in Python. In: Proceedings of The 14th Python in Science Conference (SCIPY 2015)

  23. Müller M (2015) Fundamentals of Music Processing. Springer International Publishing, Cham . https://doi.org/10.1007/978-3-319-21945-5

  24. Nieto O, Bello JP (2016) Systematic Exploration Of Computational Music Structure Research. In: Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR 2016), p. 7. New York, NY, USA

  25. Oord Avd, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499

  26. Oord Avd, Li Y, Babuschkin I, Simonyan K, Vinyals O, Kavukcuoglu K, Driessche Gvd, Lockhart E, Cobo LC, Stimberg F, Casagrande N, Grewe D, Noury S, Dieleman S, Elsen E, Kalchbrenner N, Zen H, Graves A, King H, Walters T, Belov D, Hassabis D (2017) Parallel WaveNet: Fast High-Fidelity Speech Synthesis. arXiv:1711.10433 [cs]. http://arxiv.org/abs/1711.10433. ArXiv: 1711.10433

  27. Perraudin N, Balazs P, Sondergaard PL (2013) A fast Griffin-Lim algorithm. In: 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4. IEEE, New Paltz, NY. 10.1109/WASPAA.2013.6701851. http://ieeexplore.ieee.org/document/6701851/

  28. Roads C (2004) Microsound. The MIT Press, Cambridge, Mass

    Google Scholar 

  29. Roads C (2015) Composing electronic music: a new aesthetic. Oxford University Press, Oxford

    Book  Google Scholar 

  30. Schaeffer P (1964) Traité des objets musicaux, nouv. edn. Seuil

  31. Schörkhuber C, Klapuri A (2010) Constant-Q Transform Toolbox For Music Processing. In: Proceedings of the 7th Sound and Music Computing Conference (SMC 2010), p. 8. Barcelona, Spain

  32. Smalley D (1997) Spectromorphology: explaining sound-shapes. Organised Sound 2(02):107–126. 10.1017/S1355771897009059. http://journals.cambridge.org/article_S1355771897009059

  33. Stockhausen K (1972) Four Criteria of Electronic Music with Examples from Kontakte . https://www.youtube.com/watch?v=7xyGtI7KKIY&list=PLRBdTyZ76lvAFOtZvocPjpRVTL6htJzoP

  34. Sønderby CK, Raiko T, Maaløe L, Sønderby SK, Winther O (2016) How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks. In: Proceedings of the 23rd international conference on Machine learning (ICML 2016). ACM Press, Pittsburgh, Pennsylvania

  35. Tatar K, Macret M, Pasquier P (2016) Automatic Synthesizer Preset Generation with PresetGen. Journal of New Music Research 45(2):124–144. https://doi.org/10.1080/09298215.2016.1175481

  36. Tatar K, Pasquier P (2017) MASOM: A Musical Agent Architecture based on Self Organizing Maps, Affective Computing, and Variable Markov Models. In: Proceedings of the 5th International Workshop on Musical Metacreation (MUME 2017). Atlanta, Georgia, USA

  37. Tatar K, Pasquier P (2019) Musical agents: A typology and state of the art towards Musical Metacreation. Journal of New Music Research 48(1):56–105. https://doi.org/10.1080/09298215.2018.1511736

  38. Tatar K, Pasquier P, Siu R (2019) Audio-based Musical Artificial Intelligence and Audio-Reactive Visual Agents in Revive. In: Proceedings of the joint International Computer Music Conference and New York City Electroacoustic Music Festival 2019 (ICMC-NYCEMF 2019), p. 8. International Computer Music Association, New York City, NY, USA

  39. Technavio: Global Music Synthesizers Market 2019-2023. https://www.technavio.com/report/global-music-synthesizers-market-industry-analysis

  40. Vaggione H (2001) Some ontological remarks about music composition processes. Computer Music Journal 25(1):54–61

    Article  MathSciNet  Google Scholar 

  41. Varese E, Wen-chung C (1966) The liberation of Sound. Perspectives of New Music 5(1), 11–19 . https://www.jstor.org/stable/832385?origin=JSTOR-pdf&seq=1#page_scan_tab_contents

  42. Velasco GA, Holighaus N, Dörfler M, Grill T (2011) Constructing An Invertible Constant-Q Transform With Nonstationary Gabor Frames. In: Proceedings of the 14th International Conference on Digital Audio Effects (DAFx-11)), p. 7. Paris, France

  43. Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122

Download references

Acknowledgements

This research has been supported by the Swiss National Science Foundation, Natural Sciences and Engineering Research Council of Canada, Social Sciences and Humanities Research Council of Canada, and Compute Canada.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kıvanç Tatar.

Ethics declarations

Conflict of interest

There is no potential conflict of interest related to this work within our knowledge.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Constant-Q transform

We can calculate the CQT of an audio recording [31], a discrete time domain signal x(n), using the following formula:

$$\begin{aligned} X^{CQ} (k,n) = \sum _{j=n- \lfloor N_k /2 \rfloor } ^{n+ \lfloor N_k /2 \rfloor } x(j) a_k ^*(j-n+N_k /2) \end{aligned}$$
(2)

where k represents the CQT frequency bins with a range of [1, K], and \(X^{CQ} (k,n)\) is the CQT transform. \(N_k\) is the window length of a CQT bin, that is inversely proportional to \(f_k\) that we define in Eq. 4 Notice that, \(\lfloor \cdot \rfloor\) is the rounding towards negative infinity. \(a_k ^ *\) is the negative conjugate of the basis function \(a_k (n)\) and,

$$\begin{aligned} a_k (n) = \frac{1}{N_k} w \left(\frac{n}{N_k}\right) exp \left[-i 2\pi n \frac{f_k}{f_s} \right] \end{aligned}$$
(3)

where w(t) is the window function, \(f_k\) is the center frequency of bin k, and \(f_s\) is the sampling rate. CQT requires a fundamental frequency parameter \(f_1\), which is the center frequency of the lowest bin. The center frequencies of remaining bins are calculated using,

$$\begin{aligned} f_k = f_1 2 ^ {\frac{k-1}{B}} \end{aligned}$$
(4)

where B is the number of bins per octave.

CQT is a wavelet-based transform because the window size is inversely proportional to the \(f_k\) while ensuring the same Q-factor for all bins k. We can calculate the Q-factor using,

$$\begin{aligned} Q = \frac{qf_s}{f_k (2^\frac{1}{B} - 1)} \end{aligned}$$
(5)

where q is scaling factor with the range [0,1] and equals to 1 as the default setting. We direct our readers to the original publication for the specific details of the CQT [31], which also proposed a fast algorithm to compute CQT and inverse CQT (i-CQT), given in Fig. 7.

Fig. 7
figure 7

A fast algorithm to compute CQT and i-CQT, described in [31] and implemented in Librosa [22]

Appendix B Phase estimation algorithms

Given an audio signal x(n) and its frequency transform X(i),

figure a

where N is the total number of GLA iterations, T and IT is the frequency transform and inverse frequency transform function respectively; such as Short-Fourier Transform, or Constant-Q Transform in our case. Note that, the space of audio spectrograms is a subset of the complex number space. The iterative process of Griffin-Lim moves the complex spectrogram of the estimated signal \({\hat{x}}(n)\) towards the complex number space of audio signals in each iteration, as proven in [9].

figure b

The Fast Griffin-Lim algorithm (F-GLA) is a revision of the original Griffin-Lim algorithm. A previous study [27] showed that the F-GLA revision significantly improves signal-to-noise ratio (SNR) compared to the GLA, where the setting \(\alpha = 1\) (a constant in algorithm 2) resulted in the highest SNR value.

Appendix C Interview questions

  1. 1.

    Describe your compositional process when working with the Timbre Space tools

  2. 2.

    What was the theme and concept of your composition?

  3. 3.

    How did you incorporate the Timbre Space tools into your work?

  4. 4.

    How did working with the Timbre Space tools change your composition workflow? What was unique?

  5. 5.

    What additional tools/technologies apart from the Timbre Space tools were involved in your work?

  6. 6.

    How would you describe the sound qualities of Timbre Space?

  7. 7.

    What were the unique aesthetic possibilities of the Timbre Space tools?

  8. 8.

    What kind of dataset(s) did you train Timbre Space with?

  9. 9.

    If you trained Timbre Space with several datasets, what kind of relationship did you notice between the datasets and the musical results obtained from Timbre Space?

  10. 10.

    Did you feel control, and authorship over the musical material generated?

  11. 11.

    Did you achieve the aesthetic result you intended?

  12. 12.

    What were the positive aspects when working with the tool?

  13. 13.

    What were the frustrations when working with the tool? How can it be Improved?

  14. 14.

    Would you use it again (if the above were addressed)?

  15. 15.

    For whom else or what musical genres/sectors would this tool be particularly useful (if the criticism was addressed)?

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tatar, K., Bisig, D. & Pasquier, P. Latent Timbre Synthesis. Neural Comput & Applic 33, 67–84 (2021). https://doi.org/10.1007/s00521-020-05424-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-05424-2

Keywords

Navigation