Applying visual domain style transfer and texture synthesis techniques to audio: insights and challenges

  • Muhammad Huzaifah bin Md ShahrinEmail author
  • Lonce Wyse
Deep learning for music and audio


Style transfer is a technique for combining two images based on the activations and feature statistics in a deep learning neural network architecture. This paper studies the analogous task in the audio domain and takes a critical look at the problems that arise when adapting the original vision-based framework to handle spectrogram representations. We conclude that CNN architectures with features based on 2D representations and convolutions are better suited for visual images than for time–frequency representations of audio. Despite the awkward fit, experiments show that the Gram matrix determined “style” for audio is more closely aligned with timbral signatures without temporal structure, whereas network layer activity determining audio “content” seems to capture more of the pitch and rhythmic structures. We shed insight on several reasons for the domain differences with illustrative examples. We motivate the use of several types of one-dimensional CNNs that generate results that are better aligned with intuitive notions of audio texture than those based on existing architectures built for images. These ideas also prompt an exploration of audio texture synthesis with architectural variants for extensions to infinite textures, multi-textures, parametric control of receptive fields and the constant-Q transform as an alternative frequency scaling for the spectrogram.


Style transfer Texture synthesis Sound modelling Convolutional neural networks 



This research was supported by an NVIDIA Corporation Academic Programs GPU Grant.


  1. 1.
    Athineos M, Ellis D (2003) Sound texture modelling with linear prediction in both time and frequency domains. In: 2003 IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 5. IEEE, pp V–648Google Scholar
  2. 2.
    Beauregard GT, Harish M, Wyse L (2015) Single pass spectrogram inversion. In: 2015 IEEE international conference on digital signal processing (DSP). IEEE, pp 427–431Google Scholar
  3. 3.
    Bizley JK, Cohen YE (2013) The what, where and how of auditory-object perception. Nat Rev Neurosci 14(10):693CrossRefGoogle Scholar
  4. 4.
    Choi K, Fazekas G, Sandler M (2016) Explaining deep convolutional neural networks on music classification. arXiv preprint arXiv:160702444
  5. 5.
    Cui B, Qi C, Wang A (2017) Multi-style transfer: generalizing fast style transfer to several genresGoogle Scholar
  6. 6.
    Davies ER (2008) Handbook of texture analysis. Imperial College Press, London, UK, chap introduction to texture analysis, pp 1–31Google Scholar
  7. 7.
    Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Computer vision and pattern recognition, 2009. IEEE, pp 248–255Google Scholar
  8. 8.
    Deng L, Abdel-Hamid O, Yu D (2013) A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In: 2013 IEEE international conference on acoustics. Speech and signal processing (ICASSP). IEEE, pp 6669–6673Google Scholar
  9. 9.
    Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: 2014 IEEE international conference on acoustics. Speech and signal processing (ICASSP). IEEE, pp 6964–6968Google Scholar
  10. 10.
    Dubnov S, Bar-Joseph Z, El-Yaniv R, Lischinski D, Werman M (2002) Synthesizing sound textures through wavelet tree learning. IEEE Comput Graph Appl 4:38–48CrossRefGoogle Scholar
  11. 11.
    Dumoulin V, Shlens J, Kudlur M (2017) A learned representation for artistic style. In: Proceedings of ICLRGoogle Scholar
  12. 12.
    Ellis D (2013) Spectrograms: constant-q (log-frequency) and conventional (linear).
  13. 13.
    Gatys LA, Ecker AS, Bethge M (2015a) A neural algorithm of artistic style. arXiv preprint arXiv:150806576
  14. 14.
    Gatys LA, Ecker AS, Bethge M (2015b) Texture synthesis using convolutional neural networks. In: Advances in neural information processing systems, pp 262–270Google Scholar
  15. 15.
    Gatys LA, Bethge M, Hertzmann A, Shechtman E (2016) Preserving color in neural artistic style transfer. arXiv preprint arXiv:160605897
  16. 16.
    Gatys LA, Ecker AS, Bethge M, Hertzmann A, Shechtman E (2017) Controlling perceptual factors in neural style transfer. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  17. 17.
    Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust 32(2):236–243CrossRefGoogle Scholar
  18. 18.
    Grinstein E, Duong N, Ozerov A, Perez P (2017) Audio style transfer. arXiv preprint arXiv:171011385
  19. 19.
    Hoskinson R, Pai D (2001) Manipulation and resynthesis with natural grains. In: Proceedings of the 2001 international computer music conference, ICMCGoogle Scholar
  20. 20.
    Huzaifah bin Md Shahrin M (2017) Comparison of time-frequency representations for environmental sound classification using convolutional neural networks. arXiv preprint arXiv:170607156
  21. 21.
    Jing Y, Yang Y, Feng Z, Ye J, Song M (2017) Neural style transfer: a review. arXiv preprint arXiv:170504058
  22. 22.
    Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision. Springer, pp 694–711Google Scholar
  23. 23.
    Julesz B (1962) Visual pattern discrimination. IRE Trans Inf Theory 8(2):84–92CrossRefGoogle Scholar
  24. 24.
    Ling ZH, Kang SY, Zen H, Senior A, Schuster M, Qian XJ, Meng HM, Deng L (2015) Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Process Mag 32(3):35–52CrossRefGoogle Scholar
  25. 25.
    Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 5188–5196Google Scholar
  26. 26.
    McDermott JH, Simoncelli EP (2011) Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron 71(5):926–940CrossRefGoogle Scholar
  27. 27.
    Novak R, Nikulin Y (2016) Improving the neural algorithm of artistic style. arXiv preprint arXiv:160504603
  28. 28.
    Perez A, Proctor C, Jain A (2017) Style transfer for prosodic speech. Tech. rep., Tech. Rep., Stanford UniversityGoogle Scholar
  29. 29.
    Piczak KJ (2015) Esc: dataset for environmental sound classification. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, pp 1015–1018Google Scholar
  30. 30.
    Rand TC (1974) Dichotic release from masking for speech. J Acoust Soc Am 55(3):678–680CrossRefGoogle Scholar
  31. 31.
    Salamon J, Bello JP (2017) Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett 24(3):279–283CrossRefGoogle Scholar
  32. 32.
    Schwarz D, Schnell N (2010) Descriptor-based sound texture sampling. In: Sound and music computing (SMC), pp 510–515Google Scholar
  33. 33.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
  34. 34.
    Ulyanov D, Lebedev V (2016) Audio texture synthesis and style transfer.
  35. 35.
    Ulyanov D, Lebedev V, Vedaldi A, Lempitsky VS (2016a) Texture networks: Feed-forward synthesis of textures and stylized images. In: ICML, pp 1349–1357Google Scholar
  36. 36.
    Ulyanov D, Vedaldi A, Lempitsky VS (2016b) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:160708022
  37. 37.
    Ulyanov D, Vedaldi A, Lempitsky VS (2017) Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), vol 1, p 3Google Scholar
  38. 38.
    Ustyuzhaninov I, Brendel W, Gatys LA, Bethge M (2016) Texture synthesis using shallow convolutional networks with random filters. arXiv preprint arXiv:160600021
  39. 39.
    Verma P, Smith JO (2018) Neural style transfer for audio spectograms. arXiv preprint arXiv:180101589
  40. 40.
    Wyse L (2017) Audio spectrogram representations for processing with convolutional neural networks. In: Proceedings of the first international workshop on deep learning and music joint with IJCNN, pp 37–41Google Scholar
  41. 41.
    Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, pp 818–833Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.National University of Singapore, NUS Graduate School for Integrative Sciences and EngineeringSingaporeSingapore
  2. 2.National University of Singapore, Communications and New Media DepartmentSingaporeSingapore

Personalised recommendations