Bar transformer: a hierarchical model for learning long-term structure and generating impressive pop music

Qin, Yang; Xie, Huiming; Ding, Shuxue; Tan, Benying; Li, Yujie; Zhao, Bin; Ye, Mao

doi:10.1007/s10489-022-04049-3

Bar transformer: a hierarchical model for learning long-term structure and generating impressive pop music

Published: 15 August 2022

Volume 53, pages 10130–10148, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yang Qin¹,
Huiming Xie²,
Shuxue Ding ORCID: orcid.org/0000-0002-4963-3883¹,
Benying Tan¹,
Yujie Li¹,
Bin Zhao¹ &
…
Mao Ye¹

851 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Recently many deep learning-based automatic music generation models have been proposed. How to generate long pieces of pop music with distinctive musical characteristics remains a challenging problem, as it relies heavily on musical structures. Some transformer-based models take advantage of self-attention for generating long-sequence music; however, most pay little attention to well-organized musical structures. In this article, we propose a novel note-to-bar hierarchical model named the Bar Transformer to address long-term dependency issues and generate impressive and structurally meaningful music. In particular, we propose a novel note-to-bar approach that pre-processes the notes within each individual bar to provide a strong structural constraint to increase our model’s awareness of the note-to-bar structure in music. The Bar Transformer is constructed using an encoder-decoder framework, including a two-layer encoder and an arrangement decoder. In the two-layer encoder, the bottom is a note-level encoder, which outputs embeddings by learning the relation between notes within an individual bar, and the top is a bar-level encoder, which uses these embeddings to encode each bar from the melody and chord. The decoder is an arrangement decoder used to generalize the interrelationships among the bars and simultaneously generate melodies and chords. The experimental results of the structural analysis and the aural evaluations demonstrate that our approach outperforms the Music Transformer model and other regressive models used for music generation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial intelligence in the creative industries: a review

Article Open access 02 July 2021

Generative AI and Intellectual Property Rights

Bioelectric networks: the cognitive glue enabling evolutionary scaling from physiology to mind

Article Open access 19 May 2023

Notes

https://musescore.com/

References

Briot JP (2021) From artificial neural networks to deep learning for music generation: history, concepts and trends. Neural Comput Applic 33(1):39–65. https://doi.org/10.1007/s00521-020-05399-0
Article MathSciNet Google Scholar
Briot JP, Pachet F (2020) Deep learning for music generation: challenges and directions. Neural Comput Applic 32(4):981–993. https://doi.org/10.1007/s00521-018-3813-6
Article Google Scholar
Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners. In: Advances in neural information processing systems (NeurIPS), pp 1877–1901
Brunner G, Wang Y, Wattenhofer R et al (2017) Jambot: Music theory aware chord based generation of polyphonic music with lstms. In: 2017 IEEE 29th international conference on tools with artificial intelligence (ICTAI), IEEE, pp 519–526. https://doi.org/10.1109/ICTAI.2017.00085
Brunner G, Konrad A, Wang Y et al (2018) Midi-vae: Modeling dynamics and instrumentation of music with applications to style transfer. In: Proceedings of the 19th international society for music information retrieval conference(ISMIR), pp 747–754
Choi K, Hawthorne C, Simon I et al (2020) Encoding musical style with transformer autoencoders. In: International conference on machine learning(ICML), pp 1899–1908
Chu H, Urtasun R, Fidler S (2017) Song from pi: a musically plausible network for pop music generation. In: 5th International conference on learning representations(ICLR)
Chuan CH, Herremans D (2018) Modeling temporal tonal relations in polyphonic music through deep networks with a novel image-based representation. In: Proceedings of the AAAI conference on artificial intelligence(AAAI), pp 2159–2166
Chung J, Ahn S, Bengio Y (2017) Hierarchical multiscale recurrent neural networks. In: 5th International conference on learning representations(ICLR)
Devlin J, Chang MW, Lee K et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the north american chapter of the association for computational linguistics: human language technologies (NAACL-HLT), pp 4171–4186. https://doi.org/10.18653/v1/N19-1423
Dong HW, Yang YH (2018) Convolutional generative adversarial networks with binary neurons for polyphonic music generation. In: Proceedings of the 19th international society for music information retrieval conference (ISMIR), pp 190–196
Dong HW, Hsiao WY, Yang LC et al (2018) Musegan: multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In: Proceedings of the AAAI conference on artificial intelligence(AAAI), pp 34–41
Furner M, Islam MZ, Li CT (2021) Knowledge discovery and visualisation framework using machine learning for music information retrieval from broadcast radio data. Expert Syst Appl 182:115,236. https://doi.org/10.1016/j.eswa.2021.115236
Article Google Scholar
Gao T, Cui Y, Ding F (2021) Seqvae: sequence variational autoencoder with policy gradient. Appl Intell, pp 1–8
Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE International conference on acoustics, speech and signal processing, IEEE, pp 6645–6649 https://doi.org/10.1109/ICASSP.2013.6638947
Guo Z, Dimos M, Dorien H (2021) Hierarchical recurrent neural networks for conditional melody generation with long-term structure. In: International joint conference on neural networks(IJCNN), pp 1–8. https://doi.org/10.1109/IJCNN52387.2021.9533493 https://doi.org/10.1109/IJCNN52387.2021.9533493
Hadjeres G, Pachet F, Nielsen F (2017) Deepbach: a steerable model for bach chorales generation. In: International conference on machine learning(ICML), pp 1362–1371
Huang CZA, Vaswani A, Uszkoreit J et al (2019) Music transformer: generating music with long-term structure. In: 7th International conference on learning representations(ICLR)
Huang YS, Yang YH (2020) Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In: Proceedings of the 28th ACM international conference on multimedia, pp 1180–1188
Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International conference on learning representations(ICLR)
Liang FT, Gotham M, Johnson M et al (2017) Automatic stylistic composition of bach chorales with deep lstm. In: Proceedings of the 18th international society for music information retrieval conference(ISMIR), pp 449–456
Ockelford A (2017) Repetition in music: Theoretical and metatheoretical perspectives. Routledge
Oord AVD, Dieleman S, Zen H et al (2016) Wavenet: a generative model for raw audio. In: The 9th ISCA speech synthesis workshop, pp 125
Pappagari R, Zelasko P, Villalba J et al (2019) Hierarchical transformers for long document classification. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, pp 838-844. https://doi.org/10.1109/ASRU46091.2019.9003958 https://doi.org/10.1109/ASRU46091.2019.9003958
Paszke A, Gross S, Massa F et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems(NeurIPS), pp 8024–8035
Pauwels J, O’Hanlon K, Gómez E et al (2019) 20 years of automatic chord recognition from audio. In: Proceedings of the 20th International society for music information retrieval conference (ISMIR), pp 54–63
Payne C (2019) Musenet. https://openaicom/blog/musenet
Roberts A, Engel J, Raffel C et al (2018) A hierarchical latent vector model for learning long-term structure in music. In: International conference on machine learning(ICML), pp 4364–4373
Roig C, Tardón LJ, Barbancho I et al (2018) A non-homogeneous beat-based harmony markov model. Knowl-Based Syst 142:85–94. https://doi.org/10.1016/j.knosys.2017.11.027
Article Google Scholar
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. In: Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies (NAACL-HLT), pp 464–468
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems(NeurIPS), pp 5998–6008
Villegas R, Yang J, Zou Y et al (2017) Learning to generate long-term future via hierarchical prediction. In: International conference on machine learning(ICML), pp 3560–3569
Waite E (2016) Project magenta: generating long-term structure in songs and stories. https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn/
Wang Z, Zhang Y, Zhang Y et al (2020) Pianotree vae: Structured representation learning for polyphonic music. In: Proceedings of the 21th international society for music information retrieval conference(ISMIR), pp 368–375
Wu J, Hu C, Wang Y et al (2020) A hierarchical recurrent neural network for symbolic melody generation. IEEE Trans Cybern 50(6):2749–2757. https://doi.org/10.1109/TCYB.2019.2953194
Article Google Scholar
Wu J, Liu X, Hu X et al (2020) Popmnet: Generating structured pop music melodies using neural networks. Artif Intell 286:103,303. https://doi.org/10.1016/j.artint.2020.103303
Article Google Scholar
Yang LC, Chou SY, Yang YH (2017) Midinet: A convolutional generative adversarial network for symbolic-domain music generation. In: Proceedings of the 18th International society for music information retrieval conference(ISMIR), pp 324–331
Ycart A, Benetos E (2020) Learning and evaluation methodologies for polyphonic music sequence prediction with lstms. EEE/ACM Trans Audio, Speech, Language Process 28:1328–1341
Article Google Scholar
Zhang N (2020) Learning adversarial transformer for symbolic music generation. IEEE Trans Neural Netw Learn Syst, pp 1–10. https://doi.org/10.1109/TNNLS.2020.2990746
Zhu H, Liu Q, Yuan NJ et al (2018) Xiaoice band: a melody and arrangement generation framework for pop music. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2837–2846
Zhu H, Liu Q, Yuan NJ et al (2020) Pop music generation: from melody to multi-style arrangement. ACM Trans Knowl Discov 14(5):1–31. https://doi.org/10.1145/3374915
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant No.62076077 and No.61903090, the Guangxi Science and Technology Major Project under Grant No.AA22068057, the Guangxi Scientific Research Basic Ability Enhancement Program for Young and Middle-aged Teachers under Grant No.2022KY0183, and the School Foundation of Guilin University of Aerospace Technology under Grant No.XJ21KT32.

Author information

Authors and Affiliations

School of Artificial Intelligence, Guilin University of Electronic Technology, No.1 Jinji Road, Guilin, 541004, China
Yang Qin, Shuxue Ding, Benying Tan, Yujie Li, Bin Zhao & Mao Ye
Engineering Comprehensive Training Center, Guilin University of Aerospace Technology, No.2 Jinji Road, Guilin, 541004, China
Huiming Xie

Authors

Yang Qin
View author publications
You can also search for this author in PubMed Google Scholar
Huiming Xie
View author publications
You can also search for this author in PubMed Google Scholar
Shuxue Ding
View author publications
You can also search for this author in PubMed Google Scholar
Benying Tan
View author publications
You can also search for this author in PubMed Google Scholar
Yujie Li
View author publications
You can also search for this author in PubMed Google Scholar
Bin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Mao Ye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuxue Ding.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Additional Figures

To facilitate reading and comparison, we added additional three-dimensional structure histograms for Figs. 5 and 6. Three-dimensional structure histograms from the real melodies in the dataset and the sample melodies generated by different models. Each histogram represents the structure of the first 32 bars of the melody. The elements on the i-th X-axis and j-th Y-axis denote that the i-th bar repeats the j-th bar, where i > j. The element on the z-axis denotes the repetition distance. The different repetition distances are represented by the heights of the colored cylinders. Best viewed in color and zoomed in.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Qin, Y., Xie, H., Ding, S. et al. Bar transformer: a hierarchical model for learning long-term structure and generating impressive pop music. Appl Intell 53, 10130–10148 (2023). https://doi.org/10.1007/s10489-022-04049-3

Download citation

Accepted: 27 July 2022
Published: 15 August 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10489-022-04049-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bar transformer: a hierarchical model for learning long-term structure and generating impressive pop music

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in the creative industries: a review

Generative AI and Intellectual Property Rights

Bioelectric networks: the cognitive glue enabling evolutionary scaling from physiology to mind

Notes

References

Acknowledgements