Skip to main content
Log in

Bar transformer: a hierarchical model for learning long-term structure and generating impressive pop music

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Recently many deep learning-based automatic music generation models have been proposed. How to generate long pieces of pop music with distinctive musical characteristics remains a challenging problem, as it relies heavily on musical structures. Some transformer-based models take advantage of self-attention for generating long-sequence music; however, most pay little attention to well-organized musical structures. In this article, we propose a novel note-to-bar hierarchical model named the Bar Transformer to address long-term dependency issues and generate impressive and structurally meaningful music. In particular, we propose a novel note-to-bar approach that pre-processes the notes within each individual bar to provide a strong structural constraint to increase our model’s awareness of the note-to-bar structure in music. The Bar Transformer is constructed using an encoder-decoder framework, including a two-layer encoder and an arrangement decoder. In the two-layer encoder, the bottom is a note-level encoder, which outputs embeddings by learning the relation between notes within an individual bar, and the top is a bar-level encoder, which uses these embeddings to encode each bar from the melody and chord. The decoder is an arrangement decoder used to generalize the interrelationships among the bars and simultaneously generate melodies and chords. The experimental results of the structural analysis and the aural evaluations demonstrate that our approach outperforms the Music Transformer model and other regressive models used for music generation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://musescore.com/

References

  1. Briot JP (2021) From artificial neural networks to deep learning for music generation: history, concepts and trends. Neural Comput Applic 33(1):39–65. https://doi.org/10.1007/s00521-020-05399-0

    Article  MathSciNet  Google Scholar 

  2. Briot JP, Pachet F (2020) Deep learning for music generation: challenges and directions. Neural Comput Applic 32(4):981–993. https://doi.org/10.1007/s00521-018-3813-6

    Article  Google Scholar 

  3. Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners. In: Advances in neural information processing systems (NeurIPS), pp 1877–1901

  4. Brunner G, Wang Y, Wattenhofer R et al (2017) Jambot: Music theory aware chord based generation of polyphonic music with lstms. In: 2017 IEEE 29th international conference on tools with artificial intelligence (ICTAI), IEEE, pp 519–526. https://doi.org/10.1109/ICTAI.2017.00085

  5. Brunner G, Konrad A, Wang Y et al (2018) Midi-vae: Modeling dynamics and instrumentation of music with applications to style transfer. In: Proceedings of the 19th international society for music information retrieval conference(ISMIR), pp 747–754

  6. Choi K, Hawthorne C, Simon I et al (2020) Encoding musical style with transformer autoencoders. In: International conference on machine learning(ICML), pp 1899–1908

  7. Chu H, Urtasun R, Fidler S (2017) Song from pi: a musically plausible network for pop music generation. In: 5th International conference on learning representations(ICLR)

  8. Chuan CH, Herremans D (2018) Modeling temporal tonal relations in polyphonic music through deep networks with a novel image-based representation. In: Proceedings of the AAAI conference on artificial intelligence(AAAI), pp 2159–2166

  9. Chung J, Ahn S, Bengio Y (2017) Hierarchical multiscale recurrent neural networks. In: 5th International conference on learning representations(ICLR)

  10. Devlin J, Chang MW, Lee K et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the north american chapter of the association for computational linguistics: human language technologies (NAACL-HLT), pp 4171–4186. https://doi.org/10.18653/v1/N19-1423

  11. Dong HW, Yang YH (2018) Convolutional generative adversarial networks with binary neurons for polyphonic music generation. In: Proceedings of the 19th international society for music information retrieval conference (ISMIR), pp 190–196

  12. Dong HW, Hsiao WY, Yang LC et al (2018) Musegan: multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In: Proceedings of the AAAI conference on artificial intelligence(AAAI), pp 34–41

  13. Furner M, Islam MZ, Li CT (2021) Knowledge discovery and visualisation framework using machine learning for music information retrieval from broadcast radio data. Expert Syst Appl 182:115,236. https://doi.org/10.1016/j.eswa.2021.115236

    Article  Google Scholar 

  14. Gao T, Cui Y, Ding F (2021) Seqvae: sequence variational autoencoder with policy gradient. Appl Intell, pp 1–8

  15. Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE International conference on acoustics, speech and signal processing, IEEE, pp 6645–6649 https://doi.org/10.1109/ICASSP.2013.6638947

  16. Guo Z, Dimos M, Dorien H (2021) Hierarchical recurrent neural networks for conditional melody generation with long-term structure. In: International joint conference on neural networks(IJCNN), pp 1–8. https://doi.org/10.1109/IJCNN52387.2021.9533493https://doi.org/10.1109/IJCNN52387.2021.9533493

  17. Hadjeres G, Pachet F, Nielsen F (2017) Deepbach: a steerable model for bach chorales generation. In: International conference on machine learning(ICML), pp 1362–1371

  18. Huang CZA, Vaswani A, Uszkoreit J et al (2019) Music transformer: generating music with long-term structure. In: 7th International conference on learning representations(ICLR)

  19. Huang YS, Yang YH (2020) Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In: Proceedings of the 28th ACM international conference on multimedia, pp 1180–1188

  20. Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International conference on learning representations(ICLR)

  21. Liang FT, Gotham M, Johnson M et al (2017) Automatic stylistic composition of bach chorales with deep lstm. In: Proceedings of the 18th international society for music information retrieval conference(ISMIR), pp 449–456

  22. Ockelford A (2017) Repetition in music: Theoretical and metatheoretical perspectives. Routledge

  23. Oord AVD, Dieleman S, Zen H et al (2016) Wavenet: a generative model for raw audio. In: The 9th ISCA speech synthesis workshop, pp 125

  24. Pappagari R, Zelasko P, Villalba J et al (2019) Hierarchical transformers for long document classification. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, pp 838-844. https://doi.org/10.1109/ASRU46091.2019.9003958https://doi.org/10.1109/ASRU46091.2019.9003958

  25. Paszke A, Gross S, Massa F et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems(NeurIPS), pp 8024–8035

  26. Pauwels J, O’Hanlon K, Gómez E et al (2019) 20 years of automatic chord recognition from audio. In: Proceedings of the 20th International society for music information retrieval conference (ISMIR), pp 54–63

  27. Payne C (2019) Musenet. https://openaicom/blog/musenet

  28. Roberts A, Engel J, Raffel C et al (2018) A hierarchical latent vector model for learning long-term structure in music. In: International conference on machine learning(ICML), pp 4364–4373

  29. Roig C, Tardón LJ, Barbancho I et al (2018) A non-homogeneous beat-based harmony markov model. Knowl-Based Syst 142:85–94. https://doi.org/10.1016/j.knosys.2017.11.027

    Article  Google Scholar 

  30. Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. In: Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies (NAACL-HLT), pp 464–468

  31. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems(NeurIPS), pp 5998–6008

  32. Villegas R, Yang J, Zou Y et al (2017) Learning to generate long-term future via hierarchical prediction. In: International conference on machine learning(ICML), pp 3560–3569

  33. Waite E (2016) Project magenta: generating long-term structure in songs and stories. https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn/

  34. Wang Z, Zhang Y, Zhang Y et al (2020) Pianotree vae: Structured representation learning for polyphonic music. In: Proceedings of the 21th international society for music information retrieval conference(ISMIR), pp 368–375

  35. Wu J, Hu C, Wang Y et al (2020) A hierarchical recurrent neural network for symbolic melody generation. IEEE Trans Cybern 50(6):2749–2757. https://doi.org/10.1109/TCYB.2019.2953194

    Article  Google Scholar 

  36. Wu J, Liu X, Hu X et al (2020) Popmnet: Generating structured pop music melodies using neural networks. Artif Intell 286:103,303. https://doi.org/10.1016/j.artint.2020.103303

    Article  Google Scholar 

  37. Yang LC, Chou SY, Yang YH (2017) Midinet: A convolutional generative adversarial network for symbolic-domain music generation. In: Proceedings of the 18th International society for music information retrieval conference(ISMIR), pp 324–331

  38. Ycart A, Benetos E (2020) Learning and evaluation methodologies for polyphonic music sequence prediction with lstms. EEE/ACM Trans Audio, Speech, Language Process 28:1328–1341

    Article  Google Scholar 

  39. Zhang N (2020) Learning adversarial transformer for symbolic music generation. IEEE Trans Neural Netw Learn Syst, pp 1–10. https://doi.org/10.1109/TNNLS.2020.2990746

  40. Zhu H, Liu Q, Yuan NJ et al (2018) Xiaoice band: a melody and arrangement generation framework for pop music. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2837–2846

  41. Zhu H, Liu Q, Yuan NJ et al (2020) Pop music generation: from melody to multi-style arrangement. ACM Trans Knowl Discov 14(5):1–31. https://doi.org/10.1145/3374915

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant No.62076077 and No.61903090, the Guangxi Science and Technology Major Project under Grant No.AA22068057, the Guangxi Scientific Research Basic Ability Enhancement Program for Young and Middle-aged Teachers under Grant No.2022KY0183, and the School Foundation of Guilin University of Aerospace Technology under Grant No.XJ21KT32.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuxue Ding.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Additional Figures

Appendix A: Additional Figures

To facilitate reading and comparison, we added additional three-dimensional structure histograms for Figs. 5 and 6. Three-dimensional structure histograms from the real melodies in the dataset and the sample melodies generated by different models. Each histogram represents the structure of the first 32 bars of the melody. The elements on the i-th X-axis and j-th Y-axis denote that the i-th bar repeats the j-th bar, where i > j. The element on the z-axis denotes the repetition distance. The different repetition distances are represented by the heights of the colored cylinders. Best viewed in color and zoomed in.

figure b
figure c

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qin, Y., Xie, H., Ding, S. et al. Bar transformer: a hierarchical model for learning long-term structure and generating impressive pop music. Appl Intell 53, 10130–10148 (2023). https://doi.org/10.1007/s10489-022-04049-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04049-3

Keywords

Navigation