HRPE: Hierarchical Relative Positional Encoding for Transformer-Based Structured Symbolic Music Generation

Li, Pengfei; Wu, Jingcheng; Ji, Zihao

doi:10.1007/978-981-97-0576-4_9

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2007))

Included in the following conference series:

Summit on Music Intelligence

91 Accesses

Abstract

Musicians often structure their compositions hierarchically to imbue their music with rich expressiveness. As a result, generating musically meaningful music with well-organized structures has been a significant research goal for many scholars. Several approaches have been proposed to achieve this objective, typically involving multi-step generation pipelines or sophisticated model architectures based on domain knowledge, which can increase model complexity and generalization difficulty. In this study, we demonstrate that a hierarchical positional encoding adapted for music is sufficient to enhance model performance and generate coherent music with hierarchical structures. We incorporate hierarchical positional information into the Transformer model by modifying the attention matrix with relative position bias at different levels, enabling the model to learn long-short-term dependencies jointly and become less sensitive to positional shifts of several notes. Additionally, we investigate the design of section-level relative positional encoding through ablation studies. To validate our approach, we annotate two datasets (POP909-S and POP2000-S) with music sections and present evidence for both single-track monophonic music and multi-track polyphonic music generation tasks. Experimental results demonstrate that our approach outperforms state-of-the-art Transformer models in both subjective and objective evaluations. We plan to release the source code and annotated datasets upon acceptance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bai, H., et al.: Segatron: segment-aware transformer for language modeling and understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 12526–12534 (2021)
Google Scholar
Chang, C.J., Lee, C.Y., Yang, Y.H.: Variable-length music score infilling via XLNET and musically specialized positional encoding. arXiv preprint arXiv:2108.05064 (2021)
Dai, S., Jin, Z., Gomes, C., Dannenberg, R.B.: Controllable deep melody generation via hierarchical music structure representation. Cornell University. arXiv:2109.00663 (2021)
Dai, S., Zhang, H., Dannenberg, R.B.: Automatic analysis and influence of hierarchical structure on melody, rhythm and harmony in popular music. Cornell University. arXiv:2010.07518 (2020)
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019)
Dong, H.W., Chen, K., Dubnov, S., McAuley, J., Berg-Kirkpatrick, T.: Multitrack music transformer: learning long-term dependencies in music with diverse instruments. arXiv preprint arXiv:2207.06983 (2022)
Dufter, P., Schmitt, M., Schütze, H.: Position information in transformers: an overview. Comput. Linguist. 48(3), 733–763 (2022)
Article Google Scholar
Guo, Z., Kang, J., Herremans, D.: A domain-knowledge-inspired music embedding space and a novel attention mechanism for symbolic music modeling. arXiv preprint arXiv:2212.00973 (2022)
Haviv, A., Ram, O., Press, O., Izsak, P., Levy, O.: Transformer language models without positional encodings still learn positional information. arXiv preprint arXiv:2203.16634 (2022)
He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: decoding-enhanced BERT with disentangled attention. arXiv preprint arXiv:2006.03654 (2020)
Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019)
Hsiao, W.Y., Liu, J.Y., Yeh, Y.C., Yang, Y.H.: Compound word transformer: learning to compose full-song music over dynamic directed hypergraphs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 178–186 (2021)
Google Scholar
Huang, C.Z.A., et al.: Music transformer: generating music with long-term structure. In: International Conference on Learning Representations (2019)
Google Scholar
Huang, Y.S., Yang, Y.H.: Pop music transformer: beat-based modeling and generation of expressive pop piano compositions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1180–1188 (2020)
Google Scholar
Ke, G., He, D., Liu, T.Y.: Rethinking positional encoding in language pre-training. arXiv preprint arXiv:2006.15595 (2020)
Lu, P., Tan, X., Yu, B., Qin, T., Zhao, S., Liu, T.Y.: Meloform: generating melody with musical form based on expert systems and neural networks. arXiv preprint arXiv:2208.14345 (2022)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Vaswani, A.,et al., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, Z., et al.: Pop909: a pop-song dataset for music arrangement generation. In: International Symposium/Conference on Music Information Retrieval (2020)
Google Scholar
Yu, B., et al.: Museformer: transformer with fine-and coarse-grained attention for music generation. arXiv preprint arXiv:2210.10349 (2022)
Zhang, K., et al.: WuYun: exploring hierarchical skeleton-guided melody generation using knowledge-enhanced deep learning. arXiv preprint arXiv:2301.04488 (2023)
Zhang, N.: Learning adversarial transformer for symbolic music generation. IEEE Trans. Neural Networks Learn. Syst. 34, 1754–1763 (2020)
Article Google Scholar
Zhang, X., Zhang, J., Qiu, Y., Wang, L., Zhou, J.: Structure-enhanced pop music generation via harmony-aware learning. Cornell University. arXiv:2109.06441 (2021)
Zhao, J., Xia, G.: Accomontage: accompaniment arrangement via phrase selection and style transfer. arXiv preprint arXiv:2108.11213 (2021)
Zou, Y., Zou, P., Zhao, Y., Zhang, K., Zhang, R., Wang, X.: Melons: generating melody with long-term structure using transformers and structure graph. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 191–195. IEEE (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

StarX, Beijing, China
Pengfei Li, Jingcheng Wu & Zihao Ji

Authors

Pengfei Li
View author publications
You can also search for this author in PubMed Google Scholar
Jingcheng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zihao Ji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pengfei Li .

Editor information

Editors and Affiliations

Central Conservatory of Music, Beijing, China
Xiaobing Li
Xi’an Jiaotong University, Xi’an, China
Xiaohong Guan
Zhengzhou University, Zhengzhou, China
Yun Tie
Central Conservatory of Music, Beijing, China
Xinran Zhang
Central Conservatory of Music, Beijing, China
Qingwen Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, P., Wu, J., Ji, Z. (2024). HRPE: Hierarchical Relative Positional Encoding for Transformer-Based Structured Symbolic Music Generation. In: Li, X., Guan, X., Tie, Y., Zhang, X., Zhou, Q. (eds) Music Intelligence. SOMI 2023. Communications in Computer and Information Science, vol 2007. Springer, Singapore. https://doi.org/10.1007/978-981-97-0576-4_9

Download citation

DOI: https://doi.org/10.1007/978-981-97-0576-4_9
Published: 04 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0575-7
Online ISBN: 978-981-97-0576-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

HRPE: Hierarchical Relative Positional Encoding for Transformer-Based Structured Symbolic Music Generation