Skip to main content

MITT: Musical Instrument Timbre Transfer Based on the Multichannel Attention-Guided Mechanism

  • Conference paper
  • First Online:
Intelligent Computing Theories and Application (ICIC 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12836))

Included in the following conference series:

  • 1691 Accesses

Abstract

Research on neural style transfer and domain translation has clearly demonstrated the ability of deep learning algorithms to manipulate images based on their artistic style. The idea of image translation has been applied to the task of music-style transfer and to the timbre transfer of musical instrument recordings; however, the results have not been ideal. Generally, the task of instrument timbre transfer depends on the ability to extract a separated manipulable instrument timbre feature. However, as the distinction between a musical note and its timbre is often not sufficiently clear, generated samples by current timbre transfer models usually contain irrelevant waveforms. Here, we propose a method of timbre transfer, for musical instrument sounds, capable of converting one instrument sound to another while preserving note information (duration, pitch, rhythm, etc.). The multichannel attention-guided mechanism is used to enable timbre transfer between spectrograms, enhancing the ability of the model guidance generator to capture the most distinguishable components (harmonic components) in the process. The proposed model uses a Markov discriminator to optimize the generator, enabling it to accurately learn a spectrogram’s higher-order feature. Experimental results demonstrate that the proposed instrument timbre transfer model effectively captures the harmonic components in the target domain and produces explicit high-frequency details.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Mor, N., Wolf, L., Polyak, A., et al.: A universal music translation network. arXiv preprint arXiv:1805.07848 (2016)

  2. Oord, A., Dieleman, S., Zen, H., et al.: Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)

  3. Huang, S., Li, Q., Anil, C., et al.: Timbretron: a wavenet (cyclegan (cqt (audio))) pipeline for musical timbre transfer. arXiv preprint arXiv:1811.09620 (2016)

  4. Brown, J.C.: Calculation of a constant Q spectral transform. J. Acoustical Soc. Am. 89(1), 425–434 (1991)

    Article  Google Scholar 

  5. Zhu, J.Y., Park, T., Isola, P., et al.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232. IEEE, Italy (2017)

    Google Scholar 

  6. Bitton, A., Esling, P., Chemla-Romeu-Santos, A.: Modulated Variational auto-Encoders for many-to-many musical timbre transfer. arXiv preprint arXiv:1810.00222 (2018)

  7. Jain, D.K., Kumar, A., Cai, L., et al.: ATT: Attention-based Timbre Transfer. In: 2020 International Joint Conference on Neural Networks, pp. 1–6. IEEE, UK (2020)

    Google Scholar 

  8. Tang, H., Liu, H., Xu, D., et al.: Attentiongan: unpaired image-to-image translation using attention-guided generative adversarial networks. arXiv preprint arXiv:1911.11897 (2019)

  9. Isola, P., Zhu, J.Y., Zhou, T., et al.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134. IEEE, Italy (2017).

    Google Scholar 

  10. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014)

  11. Reed, S., Akata, Z., Yan, X., et al.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069 (2016)

    Google Scholar 

  12. Yamamoto, R., Song, E., Kim, J.M.: Parallel WaveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020–2020 IEEE International Conference on Acoustics. Speech and Signal Processing, pp. 6199–6203. IEEE, USA (2020)

    Google Scholar 

  13. Yi, Z., Zhang, H., Tan, P., et al.: Dualgan: Unsupervised dual learning for image-to-image translation In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2849–2857. IEEE, USA (2017)

    Google Scholar 

  14. Pasini, M.: Melgan-vc: voice conversion and audio style transfer on arbitrarily long samples using spectrograms. arXiv preprint arXiv:1910.03713 (2019)

  15. Schörkhuber, C., Klapuri, A., Sontacchi, A.: Pitch shifting of audio signals using the constant-q transform. In: Proceedings of the DAFx Conference (2012)

    Google Scholar 

  16. Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. arXiv preprint arXiv:1703.00848 (2017)

  17. Choi, Y., Choi, M., Kim, M., et al.: Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797. IEEE, USA (2018)

    Google Scholar 

  18. Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal Unsupervised Image-to-Image Translation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) European Conference on Computer Vision, ECCV 2018. Lecture Notes in Computer Science, vol. 11207. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_11

  19. Kim, J., Kim, M., Kang, H., et al.: U-GAT-IT: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. arXiv preprint arXiv:1907.10830 (2019)

  20. Tang, H., Xu, D., Sebe, N., Yan, Y.: Attention-guided generative adversarial networks for unsupervised image-to-image translation. In: 2019 International Joint Conference on Neural Networks, pp. 1–8. IEEE, Hungary (2019).

    Google Scholar 

  21. Engel, J., Agrawal, K.K., Chen, S., et al.: Gansynth: adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710 (2019)

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, H., Chen, Y. (2021). MITT: Musical Instrument Timbre Transfer Based on the Multichannel Attention-Guided Mechanism. In: Huang, DS., Jo, KH., Li, J., Gribova, V., Bevilacqua, V. (eds) Intelligent Computing Theories and Application. ICIC 2021. Lecture Notes in Computer Science(), vol 12836. Springer, Cham. https://doi.org/10.1007/978-3-030-84522-3_47

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-84522-3_47

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-84521-6

  • Online ISBN: 978-3-030-84522-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics