Abstract
Many improvements have been made in the field of generative modelling. State-of-the-art unsupervised models have been able to transfer the style of existing media with photo-realistic quality. However, these improvements have been largely limited to graphical data. Music has been proven to be more difficult to model. Magenta’s MusicVAE can quite successfully generate abstract rhythms and melodies. However, MusicVAE is a large model that requires vast amounts of computing power before it starts to make realistic predictions. Moreover, its input is heavily quantized which makes it impossible to model musical variations such as swing. This paper proposes a lightweight but high-resolution variational recurrent autoencoder that can be used to transfer the style of input samples while maintaining characteristics of the original sample. This model can be trained in a few hours on small datasets and allows researchers and musicians to experiment with musical style transfer. In addition, a novel technique based on normalized compression distance is used to evaluate the model by measuring the similarity of generated samples to target classes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baldi, P.: Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pp. 37–49 (2012)
Bellec, G., Salaj, D., Subramoney, A., Legenstein, R., Maass, W.: Long short-term memory and learning-to-learn in networks of spiking neurons. arXiv preprint arXiv:1803.09574 (2018)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Creswell, A., Bharath, A.A., Sengupta, B.: Conditional autoencoders with adversarial information factorization. arXiv preprint arXiv:1711.05175 (2017)
D’Errico, M.A.: Behind the beat: technical and practical aspects of instrumental hip-hop composition. Ph.D. thesis, Tufts University (2011)
Dumoulin, V., Visin, F.: A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 (2016)
Fujii, S., Hirashima, M., Kudo, K., Ohtsuki, T., Nakamura, Y., Oda, S.: Synchronization error of drum kit playing with a metronome at different tempi by professional drummers. Music Percept.: Interdiscip. J. 28(5), 491–503 (2011)
Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 3(Aug), 115–143 (2002)
Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 799–804. Springer, Heidelberg (2005). https://doi.org/10.1007/11550907_126
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. arXiv preprint arXiv:1804.04732 (2018)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 4743–4751. Curran Associates, Inc. (2016). http://papers.nips.cc/paper/6581-improved-variational-inference-with-inverse-autoregressive-flow.pdf
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Liao, J., Yao, Y., Yuan, L., Hua, G., Kang, S.B.: Visual attribute transfer through deep image analogy. arXiv preprint arXiv:1705.01088 (2017)
Lippens, S., Martens, J.P., De Mulder, T.: A comparison of human and automatic musical genre classification. In: 2004 Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), vol. 4, pp. iv-233–iv-236. IEEE (2004)
Louboutin, C., Meredith, D.: Using general-purpose compression algorithms for music analysis. J. New Music Res. 45(1), 1–16 (2016)
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of icml, vol. 30, p. 3 (2013)
Meinshausen, N., Bühlmann, P.: Stability selection. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 72(4), 417–473 (2010)
Meredith, D.: COSIATEC and SIATECCompress: pattern discovery by geometric compression. In: International Society for Music Information Retrieval Conference. International Society for Music Information Retrieval (2013)
Meredith, D.: Computational Music Analysis, vol. 62. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-319-25931-4
Mor, N., Wolf, L., Polyak, A., Taigman, Y.: A universal music translation network. arXiv preprint arXiv:1805.07848 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 91–99. Curran Associates, Inc. (2015). http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082 (2014)
Roberts, A., Engel, J., Raffel, C., Hawthorne, C., Eck, D.: A hierarchical latent vector model for learning long-term structure in music. arXiv preprint arXiv:1803.05428 (2018)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Technical report, California University San Diego La Jolla Institute for Cognitive Science (1985)
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(Dec), 3371–3408 (2010)
Wang, X., Yu, F., Dou, Z.Y., Gonzalez, J.E.: Skipnet: learning dynamic routing in convolutional networks. arXiv preprint arXiv:1711.09485 (2017)
Watson, J., Holmes, C., et al.: Approximate models and robust decisions. Stat. Sci. 31(4), 465–489 (2016)
Witek, M.A., Carlsen, K.: Simultaneous rhythmic events with different schematic affiliations: microtiming and dynamic attending in two contemporary R&B grooves. In: Musical Rhythm in the Age of Digital Reproduction, pp. 51–68. Routledge (2016)
Yunpeng, C., Xiaojie, J., Bingyi, K., Jiashi, F., Shuicheng, Y.: Sharing residual units through collective tensor factorization in deep neural networks. arXiv preprint arXiv:1703.02180 (2017)
Zeiler, M.D., Krishnan, D., Taylor, G.W., Fergus, R.: Deconvolutional networks (2010)
Acknowledgments
The author wishes to thank Stefan Schlobach, Albert Meroño Peñuela and Peter Bloem for inspiration and useful discussions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
Both the implementation of the model described in this paper and a number of synthesized examples of generated MIDI files can be found at https://github.com/voschezang/drum-style-transfer.
1.1 A.1 Parameters
Table 1 shows the values of the most important parameters.
1.2 A.2 Structure of the Model
The encoder and decoders can be seen as a pipeline where a sequence of transformations is applied to an input. Table 2 shows a brief overview of each layer.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Voschezang, M. (2019). Style Transfer of Abstract Drum Patterns Using a Light-Weight Hierarchical Autoencoder. In: Atzmueller, M., Duivesteijn, W. (eds) Artificial Intelligence. BNAIC 2018. Communications in Computer and Information Science, vol 1021. Springer, Cham. https://doi.org/10.1007/978-3-030-31978-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-31978-6_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31977-9
Online ISBN: 978-3-030-31978-6
eBook Packages: Computer ScienceComputer Science (R0)