Keywords

1 Introduction

Deep learning is rapid developed technology in the field of AI. From the sky to the ocean, from drones to unmanned vehicles, deep learning is playing its huge potential and capabilities. In the medical field, the machine’s disease recognition rate of lung photos has surpassed that of humans; the images and music generated by GAN technology can be fake and real; in the commercial field, micropayments can already be made through human faces; AlphaGo has defeated the real Go master in the official competition.

On the other hand, in my opinion, music is an art that conveys emotions and emotions through sound. It is a way of human self-expression. The creation of music can help people entertain and express their feelings. It is feasible to use deep learning to imitate the patterns and behaviors of existing songs, and to create music content that is real music to human ears. There have been many researchers and research results in the field of music generation based on artificial intelligence and deep learning.

Multi-track music composing [1] requires professional knowledge and a command of the interfaces of digital music software. Besides, few have focused on multi-track composing with emotion great human involvement. According to these, the author presents platform using our life elements. The system can be roughly split into three main parts.

An end-to-end generation framework called XiaoIce Band was proposed [2], which generates a track with several tracks. The CRMCG model utilizes the encoder-decoder framework to generate both rhythm and melody. For rhythm generation, in order to make generated rhythm in harmony with existing part of music, they take previous generation of music (previous melody and rhythm) into consideration. For melody generation, they take previous melody, currently generated rhythm and corresponding chord to generate melody sequence. Since rhythm is closely related to melody, the loss function of rhythm generation only updates parameters related with rhythm loss, whereas the loss function of melody generation updates all parameters by melody loss. The MICA model is used to solve task, it treats the melody sequence as the input of encoder and the multiple sequences as outputs of decoder. The designed between the hidden layers to learn the relationships and keep the harmony between different tracks.

The Attention Cell is used to capture the relevant parts of other tasks for current task. The author conducted melody generation and arrangement generation tasks to evaluate the effectiveness of the CRMCG and MICA. For melody generation task, they choose the Magenta and GANMidi as baseline methods, meanwhile, chord progression analysis and rest analysis are used to evaluate the CRMCG model. For arrangement generation task, they choose HRNN as baseline methods, meanwhile, harmony analysis and arrangement analysis are used to evaluate the CRMCG model.

The paper [3] proposed a method to generate multiple chord music using GAN. This model will process a transformation from MIDI files and chord music to multiple bass, piano, drum, and guitar tracks and piano rolls., And its dimension is K. After standard preprocessing of the MIDI file, all music is divided into more than one hundred parts according to the beat and the pitch is changed to a certain range. At this time, the dimension is [K * 5* 192 * 84]. The model given in the article contains a generator and a discriminator of the convolutional neural network architecture. The structures of the two are symmetrical and opposite. Finally, the activation function sigmoid is used to separate the data. Since the music data is not discrete, and there are often multiple chords pronounced at the same time, the convolution part adopts a full-channel architecture, which helps the network to converge quickly. ReLU + tanh is used in the former, LeakyReLU is used in the latter to deal with the gradient problem, and finally Adam is used to complete the optimization.

Although there are many music generation technologies, the existing music generation methods are still unsatisfactory. Most of the music and songs generated by the music generation technology can be easily distinguished from the real music and songs by the human ear. There are many reasons for this. For example, due to the lack of “alignment” data [4], different styles are used for the same song, leading to the main music style conversion can only use unsupervised methods. The loss of using GAN (RaGAN) during training leads to the inability to guarantee that the original music structure will be retained after conversion [5].

This paper proposes an improved time-series model network structure based on multi-track music MuseGAN, and adds a correction mapping model after the generators to bind the predicted results to the correct results. Experiments on standard data sets show that the method proposed in this paper can further improve subjective and objective evaluation indicators such as Qualified Rhythm frequency.

2 Symbolic Music Generation and Genre Transfer

Furthermore, when style conversion and classification are required, style alignment is first required, with the goal of realizing VAE and style classification in a shared space [6]. While switching the style of music data, this method can also change the types of musical instruments, such as piano to violin, and can also change auditory characteristics such as pitch. This model has a wide range of applications, such as music mixing, music and song mixing, music insertion, and so on. Each data file is in MIDI format, with style tags, that is, specific style tags. By extracting these information from the file and converting them, such as pitch, gauge, and speed. This kind of VAE comes with hyper parameter evaluation Kullback-Leibler to judge the cross entropy loss. In order to obtain the joint distribution of the overall data, three codecs are used to form a shared space.

Another model of musical style conversion is called ycleGAN [7], and the structure of its generator/discriminator is shown in Fig. 1. In order to perform style transfer while retaining the tune and structure of the original music itself, a discriminator is needed to balance the intensity difference between input and output. The generator extracts from the original data and can also input noise, but this method can only handle the transformation of two parts. The goal of the generator is to learn a variety of high-level features, so the discriminator is required to be able to distinguish between the source data and the generated data. The loss function part is measured by consistent loss, which helps to retain more overall information for two-way conversion, the output data can be a true form. When experimenting on the data set, the LeakyReLU + normalization method is used, and the final output is a classifier with a distribution.

Fig. 1.
figure 1

Architecture of CycleGAN model.

The music generation, especially rhythm patterns of electronic dance music with novel rhythms and interesting patterns, which were not found in the training dataset, could be generated by using deep learning. They extend the framework GAN and encourage inherent distributions by additional classifiers [8]. The author proposes two methods in this paper (Fig. 2).

Fig. 2.
figure 2

GAN with genre ambiguity loss.

3 Improved Time Series Model Network on Multitrack Music

The paper [9] proposed the GAN, the quantitative measure estimating the interpretability of a set of generated examples and apply the method to a state-of-the-art deep audio classification model that predicts singing voice activity in music excerpts. Their method is designed to provide examples that activate a given neuron activation pattern (“classifier response”), where a generator is trained to map a noise vector drawn from a known noise distribution to a generated example. To optimize the prior weight and optimization parameters as well as the number of update steps, a novel, automatic metric for quickly evaluating a set of generated explanations is introduced. For the generator, they choose a standard normal likelihood. For AM optimization, is performed. The melody composition method could enhance the original GAN based on individual [10].

Fig. 3.
figure 3

An improved time series model with multi-generator.

The INCO-GAN [11] is designed to mainly address two problems: 1) cannot judge when to end the generation by itself; 2) no apparent time relationship between the notes or bars. The automatic music generation is two phases: training and generation. The three training steps: Preprocessing, CVG training, and conditional GAN training. CVG provides the conditional vector required for music generation for the generator. It consists of two parts: one part is utilized to generate the relative position vector to represent the generation process, and the other part can predict whether the generation is to end. In the training phase, the CVG training and conditional GAN training are independent of each other. The generation phase comprises three steps: CVG executing, phrase generation, and postprocessing. To evaluate the generated music, the pitch frequency of the music generated by the proposed model was compared with human composer’s music.

In summary, these music generation technologies described above are all deep learning technologies. The deep network learns features from a large number of music samples, and generates an effective function approximation method based on the original music sample distribution, and finally generates new music sample data. Since music is a kind of time series data like speech and text, it can be generated by a variety of deep neural networks used to capture long dependencies in the sequence.

This paper proposes an improved time series model network structure based on multi-track music MuseGAN. The sub-network of generators is adhesion on the MuseGAN architecture: in addition to the time structure generator and the bar generator, a context generator is added. After these generators, a modified mapping model was added to further modify the prediction results. The architecture of the improved network model proposed is shown in Fig. 3. The time structure generator is used to characterize the unique time-based architecture of music; the bar generator is responsible for generating a single bar in different tracks, and the timing relationship between bar and bar comes from structures such as Scratch; the context generator is responsible for The music features that are context-sensitive across tracks are generated between tracks. The combination of these three generators can better generate single-track and multi-track music features and tunes in time and space.

4 Experiments and Results

The automatic music generation is divided into two phases: training and generation [11]. The training phase consists of three training steps: Preprocessing, CVG training, and conditional GAN training. CVG provides the conditional vector required for music generation for the generator. It consists of two parts: one part is utilized to generate the relative position vector to represent the generation process, and the other part can predict whether the generation is to end. In the training phase, the CVG and conditional GAN training are independent each other. The generation phase comprises three steps: CVG executing, phrase generation, and post processing. To evaluate the generated music, the pitch frequency of the music generated by the proposed model was compared with human composer’s music. The paper [3] uses two sets of programs to track the experimental results.

Table 1. The average score of each model on each indicator of Qualified Rhythm Frequency.

In this paper, we generate more than 1000 music sequences with the method of each model, and then use some subjective and objective indicators (Qualified Rhythm frequency and Consecutive Pitch Repetitions) to evaluate the performance of each model [12]. It can be seen from Table 1 that the improved is better than traditional with two generators on the two indicators of the Qualified Rhythm frequency, and worse than the Traditional model with two generators on the Beat indicator. The reason may be that the context generator is in the influence on Beat has the opposite effect.

Table 2. The average score of each model on each indicator of Consecutive Pitch Repetitions.

It can be seen from Table 2 that the improved is better than traditional with two generators on the two indicators of Consecutive Pitch Repetitions, and is still worse than the Traditional model with two generators on the Beat indicator. The reason may still be the influence of the context generator on Beat.

5 Conclusion

Music generation technology based on deep learning has been widely used, but it still was affected by problems such as loss of music structure during training. This paper proposes an improved time series model network structure, adding a context generator to the traditional architecture, and adding a modified mapping model to further modify the prediction results. Our experiments implied our method proposed can partially improve the index results of Qualified Rhythm Frequency and Consecutive Pitch Repetitions.