An Improved Time Series Network Model Based on Multitrack Music Generation

Zhao, Junchuan

doi:10.1007/978-981-19-2456-9_120

Junchuan Zhao⁴⁰

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE))

Included in the following conference series:

INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND APPLICATIONS

8446 Accesses

Abstract

Deep learning architecture has become a cutting-edge method for automatic music generation, but there are still problems such as loss of music style and music structure. This paper presents an improved network structure of time series model based on multi-track music. A context generator is added to the traditional architecture. The context generator is responsible for generating cross-track contextual music features between tracks. The purpose is to better generate single-track and multi-track music features and tunes in time and space. A modified mapping model was further added to further modify the prediction results. Experiments show that compared with traditional methods, the proposed will partially improve the objective music evaluation index results.

You have full access to this open access chapter, Download conference paper PDF

Music Generation Using Deep Learning

An automatic music generation method based on RSCLN_Transformer network

Article 12 January 2024

Keywords

1 Introduction

Deep learning is rapid developed technology in the field of AI. From the sky to the ocean, from drones to unmanned vehicles, deep learning is playing its huge potential and capabilities. In the medical field, the machine’s disease recognition rate of lung photos has surpassed that of humans; the images and music generated by GAN technology can be fake and real; in the commercial field, micropayments can already be made through human faces; AlphaGo has defeated the real Go master in the official competition.

On the other hand, in my opinion, music is an art that conveys emotions and emotions through sound. It is a way of human self-expression. The creation of music can help people entertain and express their feelings. It is feasible to use deep learning to imitate the patterns and behaviors of existing songs, and to create music content that is real music to human ears. There have been many researchers and research results in the field of music generation based on artificial intelligence and deep learning.

Multi-track music composing [1] requires professional knowledge and a command of the interfaces of digital music software. Besides, few have focused on multi-track composing with emotion great human involvement. According to these, the author presents platform using our life elements. The system can be roughly split into three main parts.

An end-to-end generation framework called XiaoIce Band was proposed [2], which generates a track with several tracks. The CRMCG model utilizes the encoder-decoder framework to generate both rhythm and melody. For rhythm generation, in order to make generated rhythm in harmony with existing part of music, they take previous generation of music (previous melody and rhythm) into consideration. For melody generation, they take previous melody, currently generated rhythm and corresponding chord to generate melody sequence. Since rhythm is closely related to melody, the loss function of rhythm generation only updates parameters related with rhythm loss, whereas the loss function of melody generation updates all parameters by melody loss. The MICA model is used to solve task, it treats the melody sequence as the input of encoder and the multiple sequences as outputs of decoder. The designed between the hidden layers to learn the relationships and keep the harmony between different tracks.

The Attention Cell is used to capture the relevant parts of other tasks for current task. The author conducted melody generation and arrangement generation tasks to evaluate the effectiveness of the CRMCG and MICA. For melody generation task, they choose the Magenta and GANMidi as baseline methods, meanwhile, chord progression analysis and rest analysis are used to evaluate the CRMCG model. For arrangement generation task, they choose HRNN as baseline methods, meanwhile, harmony analysis and arrangement analysis are used to evaluate the CRMCG model.

The paper [3] proposed a method to generate multiple chord music using GAN. This model will process a transformation from MIDI files and chord music to multiple bass, piano, drum, and guitar tracks and piano rolls., And its dimension is K. After standard preprocessing of the MIDI file, all music is divided into more than one hundred parts according to the beat and the pitch is changed to a certain range. At this time, the dimension is [K * 5* 192 * 84]. The model given in the article contains a generator and a discriminator of the convolutional neural network architecture. The structures of the two are symmetrical and opposite. Finally, the activation function sigmoid is used to separate the data. Since the music data is not discrete, and there are often multiple chords pronounced at the same time, the convolution part adopts a full-channel architecture, which helps the network to converge quickly. ReLU + tanh is used in the former, LeakyReLU is used in the latter to deal with the gradient problem, and finally Adam is used to complete the optimization.

Although there are many music generation technologies, the existing music generation methods are still unsatisfactory. Most of the music and songs generated by the music generation technology can be easily distinguished from the real music and songs by the human ear. There are many reasons for this. For example, due to the lack of “alignment” data [4], different styles are used for the same song, leading to the main music style conversion can only use unsupervised methods. The loss of using GAN (RaGAN) during training leads to the inability to guarantee that the original music structure will be retained after conversion [5].

This paper proposes an improved time-series model network structure based on multi-track music MuseGAN, and adds a correction mapping model after the generators to bind the predicted results to the correct results. Experiments on standard data sets show that the method proposed in this paper can further improve subjective and objective evaluation indicators such as Qualified Rhythm frequency.

2 Symbolic Music Generation and Genre Transfer

Furthermore, when style conversion and classification are required, style alignment is first required, with the goal of realizing VAE and style classification in a shared space [6]. While switching the style of music data, this method can also change the types of musical instruments, such as piano to violin, and can also change auditory characteristics such as pitch. This model has a wide range of applications, such as music mixing, music and song mixing, music insertion, and so on. Each data file is in MIDI format, with style tags, that is, specific style tags. By extracting these information from the file and converting them, such as pitch, gauge, and speed. This kind of VAE comes with hyper parameter evaluation Kullback-Leibler to judge the cross entropy loss. In order to obtain the joint distribution of the overall data, three codecs are used to form a shared space.

Another model of musical style conversion is called ycleGAN [7], and the structure of its generator/discriminator is shown in Fig. 1. In order to perform style transfer while retaining the tune and structure of the original music itself, a discriminator is needed to balance the intensity difference between input and output. The generator extracts from the original data and can also input noise, but this method can only handle the transformation of two parts. The goal of the generator is to learn a variety of high-level features, so the discriminator is required to be able to distinguish between the source data and the generated data. The loss function part is measured by consistent loss, which helps to retain more overall information for two-way conversion, the output data can be a true form. When experimenting on the data set, the LeakyReLU + normalization method is used, and the final output is a classifier with a distribution.

The music generation, especially rhythm patterns of electronic dance music with novel rhythms and interesting patterns, which were not found in the training dataset, could be generated by using deep learning. They extend the framework GAN and encourage inherent distributions by additional classifiers [8]. The author proposes two methods in this paper (Fig. 2).

3 Improved Time Series Model Network on Multitrack Music

The paper [9] proposed the GAN, the quantitative measure estimating the interpretability of a set of generated examples and apply the method to a state-of-the-art deep audio classification model that predicts singing voice activity in music excerpts. Their method is designed to provide examples that activate a given neuron activation pattern (“classifier response”), where a generator is trained to map a noise vector drawn from a known noise distribution to a generated example. To optimize the prior weight and optimization parameters as well as the number of update steps, a novel, automatic metric for quickly evaluating a set of generated explanations is introduced. For the generator, they choose a standard normal likelihood. For AM optimization, is performed. The melody composition method could enhance the original GAN based on individual [10].

The INCO-GAN [11] is designed to mainly address two problems: 1) cannot judge when to end the generation by itself; 2) no apparent time relationship between the notes or bars. The automatic music generation is two phases: training and generation. The three training steps: Preprocessing, CVG training, and conditional GAN training. CVG provides the conditional vector required for music generation for the generator. It consists of two parts: one part is utilized to generate the relative position vector to represent the generation process, and the other part can predict whether the generation is to end. In the training phase, the CVG training and conditional GAN training are independent of each other. The generation phase comprises three steps: CVG executing, phrase generation, and postprocessing. To evaluate the generated music, the pitch frequency of the music generated by the proposed model was compared with human composer’s music.

In summary, these music generation technologies described above are all deep learning technologies. The deep network learns features from a large number of music samples, and generates an effective function approximation method based on the original music sample distribution, and finally generates new music sample data. Since music is a kind of time series data like speech and text, it can be generated by a variety of deep neural networks used to capture long dependencies in the sequence.

This paper proposes an improved time series model network structure based on multi-track music MuseGAN. The sub-network of generators is adhesion on the MuseGAN architecture: in addition to the time structure generator and the bar generator, a context generator is added. After these generators, a modified mapping model was added to further modify the prediction results. The architecture of the improved network model proposed is shown in Fig. 3. The time structure generator is used to characterize the unique time-based architecture of music; the bar generator is responsible for generating a single bar in different tracks, and the timing relationship between bar and bar comes from structures such as Scratch; the context generator is responsible for The music features that are context-sensitive across tracks are generated between tracks. The combination of these three generators can better generate single-track and multi-track music features and tunes in time and space.

4 Experiments and Results

The automatic music generation is divided into two phases: training and generation [11]. The training phase consists of three training steps: Preprocessing, CVG training, and conditional GAN training. CVG provides the conditional vector required for music generation for the generator. It consists of two parts: one part is utilized to generate the relative position vector to represent the generation process, and the other part can predict whether the generation is to end. In the training phase, the CVG and conditional GAN training are independent each other. The generation phase comprises three steps: CVG executing, phrase generation, and post processing. To evaluate the generated music, the pitch frequency of the music generated by the proposed model was compared with human composer’s music. The paper [3] uses two sets of programs to track the experimental results.

Table 1. The average score of each model on each indicator of Qualified Rhythm Frequency.

Full size table

In this paper, we generate more than 1000 music sequences with the method of each model, and then use some subjective and objective indicators (Qualified Rhythm frequency and Consecutive Pitch Repetitions) to evaluate the performance of each model [12]. It can be seen from Table 1 that the improved is better than traditional with two generators on the two indicators of the Qualified Rhythm frequency, and worse than the Traditional model with two generators on the Beat indicator. The reason may be that the context generator is in the influence on Beat has the opposite effect.

Table 2. The average score of each model on each indicator of Consecutive Pitch Repetitions.

Full size table

It can be seen from Table 2 that the improved is better than traditional with two generators on the two indicators of Consecutive Pitch Repetitions, and is still worse than the Traditional model with two generators on the Beat indicator. The reason may still be the influence of the context generator on Beat.

5 Conclusion

Music generation technology based on deep learning has been widely used, but it still was affected by problems such as loss of music structure during training. This paper proposes an improved time series model network structure, adding a context generator to the traditional architecture, and adding a modified mapping model to further modify the prediction results. Our experiments implied our method proposed can partially improve the index results of Qualified Rhythm Frequency and Consecutive Pitch Repetitions.

References

Qiu, Z., et al.: Mind band: a crossmedia AI music composing platform. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2231–2233, October 2019
Google Scholar
Zhu, H., et al.: XiaoIce band: a melody and arrangement generation framework for pop music. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2837–2846, July 2018
Google Scholar
Chen, H., Xiao, Q., Yin, X.: Generating music algorithm with deep convolutional generative adversarial networks. In: 2019 IEEE 2nd International Conference on Electronics Technology (ICET), pp. 576–580. IEEE, May 2019
Google Scholar
Cífka, O., Şimşekli, U., Richard, G.: Supervised symbolic music style translation using synthetic data. arXiv preprint arXiv:1907.02265 (2019)
Lu, C.Y., Xue, M.X., Chang, C.C., Lee, C.R., Su, L.: Play as you like: timbre-enhanced multi-modal music style transfer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 1061–1068, July 2019
Google Scholar
Brunner, G., Konrad, A., Wang, Y., Wattenhofer, R.: MIDI-VAE: Modeling dynamics and instrumentation of music with applications to style transfer. arXiv preprint arXiv:1809.07600 (2018)
Brunner, G., Wang, Y., Wattenhofer, R., Zhao, S.: Symbolic music genre transfer with CycleGAN. In: 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 786–793. IEEE, November 2018
Google Scholar
Tokui, N.: Can GAN originate new electronic dance music genres?--Generating novel rhythm patterns using GAN with Genre Ambiguity Loss. arXiv preprint arXiv:2011.13062 (2020)
Mishra, S., Stoller, D., Benetos, E., Sturm, B.L., Dixon, S.: GAN-based generation and automatic selection of explanations for neural networks. arXiv preprint arXiv:1904.09533 (2019)
Li, S., Jang, S., Sung, Y.: Automatic melody composition using enhanced GAN. Mathematics 7(10), 883 (2019)
Article Google Scholar
Li, S., Sung, Y.: INCO-GAN: variable-length music generation method based on inception model-based conditional GAN. Mathematics 9(4), 387 (2021)
Article Google Scholar
Trieu, N., Keller, R.: JazzGAN: improvising with generative adversarial networks. In: MUME Workshop, June 2018
Google Scholar

Download references

Author information

Authors and Affiliations

International School, Beijing University of Posts and Telecommunications, Beijing, China
Junchuan Zhao

Authors

Junchuan Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junchuan Zhao .

Editor information

Editors and Affiliations

College of Communication Engineering, Jilin University, Jilin, Jilin, China
Zhihong Qian
Department of AI & ML, Vardhaman College of Engineering, Hyderabad, Telangana, India
M.A. Jabbar
College of Technology, Indiana State University, Terre Haute, IN, USA
Xiaolong Li

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, J. (2022). An Improved Time Series Network Model Based on Multitrack Music Generation. In: Qian, Z., Jabbar, M., Li, X. (eds) Proceeding of 2021 International Conference on Wireless Communications, Networking and Applications. WCNA 2021. Lecture Notes in Electrical Engineering. Springer, Singapore. https://doi.org/10.1007/978-981-19-2456-9_120

Download citation

DOI: https://doi.org/10.1007/978-981-19-2456-9_120
Published: 13 July 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2455-2
Online ISBN: 978-981-19-2456-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

An Improved Time Series Network Model Based on Multitrack Music Generation

Abstract