1 Introduction

Synthetic content generation with artificial intelligence has been a popular area of research in the past decade. Various kinds of content such as images, videos, text or music are generated with fully-automatic or semi-automatic methods in the recent years [3]. Still, modelling of musical pieces and generating synthetic content pose a unique challenge, as musical contents have an artistic structure. Thus both the representation of such musical data and the assessment of produced musical content is hard to tackle.

The representations of musical contents are examined in two major representation classes, symbolic representation and signal representation [1]. In signal representation, audio signals are directly used and the problem is evaluated in the class of signal processing problems. Signal representation provides strong low-level modelling capabilities but require high computational resources. Signal representation make up only a small portion of all studies compared to symbolic representation.

Being higher-level interpretation of musical knowledge, symbolic representation can be used to represent various musical features such as pitch, rhythm or dynamics. The symbolic contents can be easily processed and played by computers and pose less computational complexity in terms of both time and memory. The main disadvantage of the symbolic representation is the lack of expressiveness. While the contents can be very accurately played on computers, they may not be able to capture all the nuances of human performances. Still, the advantages of symbolic representation is hard to overlook and the field gets the majority of the focus and yields the most research in the field of musical information retrieval (MIR).

Basic components of symbolic music contents are high-level features such as note, rest and chords. The notes and chords come together in certain sequences to form musical pieces [14]. The main research problem is to find a combination of these elements to yield a pleasant-sounding music. Therefore two main sub-problems become important as to model musical contents.

The first question is how does inter-dependencies between local groups can be solved ? Inter-dependencies between notes refer to the local relationships between notes in a musical piece [8]. These relationships can be defined as melodic dependencies, harmonic dependencies or rhytmic dependencies. Melodic dependencies are relationships between successive notes, harmonic dependencies are relationships between notes played simultaneously and rhythmic dependencies are the relationships between the timing of notes. The melodic structure is the main focus of the proposed study and melodic dependencies will be referred to as inter-dependencies throughout the study.

The second question is how can we define long-term dependencies between notes ? Music is often structured in a way that relies on long-term dependencies, such as chord changes, melodic themes, and rhythmic patterns. In order to generate a melody that is consistent with the song’s structure, a machine learning model needs to be able to remember the chord progression even though it may not have appeared in the recent input.

Another important feature of music is that it follows a hierarchical structure (Fig. 1). This natural hierarchical representation in music can be examined in multiple angles. Music is organized into different levels, such as individual notes, phrases which consist of several notes, sections, and movements. Each level of the hierarchy is built on the levels below it, and the relationships between the levels are important for understanding and generating music. Another examination would be the relation between melodic and harmonic structures as shown in Fig. 1. These hierarchical relations allow us to employ new methods such as attentional networks to model musical information in a more natural way.

Fig. 1
figure 1

Hierarchical nature of music: melody and accompaniment

1.1 Novelty

Current literature highly employs the use of Recurrent Networks on modelling the structure of symbolic melodic structures, as the architecture propose better representation for sequential and temporal datasets. While RNNs are powerful in modelling local sequential relations with low computational complexity, their power declines as the sequence length increases. The long-term harmony dissolves and the global relations get weak. The recent applications of attention-based architectures show that the distant relations in long sequences can be captured with high precision. The attention networks provide new opportunities to improve the musical quality while strengthening global dependencies between notes and chords to represent even longer sequences [4].

In this study, we propose a MTHA-LSTM deep neural network to tackle melody generation problem. We introduce a novel approach to generate polyphonic symbolic melodies in two phases. In this way, melodic sequences and accompaniment sequences are treated in an abstract and hierarchical manner. First phase is where monophonic pitch sequences are generated and the phase is named as "monophonic melody generation" phase. The second phase is where generated melodies are enriched with chord sequences and the final music is outputted. Thus, relations between notes with respect to the provided features are analyzed in a successive manner along temporal axis. To further improve the applicability of the solution, distinct datasets have been analyzed through the model and results are supported by an human evaluation.

The hierarchical features of music can be used to isolate different levels of information and solve each sub-problem using a divide-and-conquer approach. Most of the existing models generating music use only monophonic melodies and do not distinctly evaluate accompaniment when generating musical content. Instead, these studies process melody and accompaniment together in a single data source. This leads to increased complexity in feature spaces which forces researchers to implement several limitations such as smaller sequence lengths or omitting of several pitch symbols to reduce the level of complexity. In this study, the datasets are pre-processed into melody sequence and chord sequence (accompaniment) datasets.

The first obstacle in the development of such structures is to obtain an applicable dataset. The symbolic music datasets used in state-of-art studies consists of MIDI pieces which provide only a sequential pitch structure. We have applied pre-processing to the datasets to seperate melodic sequences and chord sequences. The melodic sequence dataset consists of monophonic pitch sequences, and does not include the chords. The chord sequence datasets consists of polyphonic chord sequences.

Both melodic generation phase and chord accompaniment phase runs on hierarchical Multi-Head Attention (MTHA) and LSTM model, sequentially. Starting with the modelling of local structures, we provide the sequential monophonic melody data to an LSTM layer. The encoded sequence vector is then supplied to an MTHA layer where the long-term relations are resolved. The result is a monophonic melody and is transferred to the second phase for accompaniment.

The accompaniment generation phase also uses the same architecture with melody generation phase but it’s chord sequence are processed along with the synthetically generated melody from the previous phase. The model this time generates a polyphonic sequence, resulting in more realistic and pleasant sounding music.

Summary of our contributions are as follows:

  • Melodic and chord accompaniment structures are investigated in an abstract and hierarchical manner which provides reduced computational complexity.

  • Two state-of-art MIDI datasets are pre-processed in hierarchical form to obtain melody sequence datasets and chord sequence datasets.

  • A pipelined melody generation and accompany generation structure is proposed.

  • An improved long-term dependency representation for symbolic musical structures through the use of MTHA-LSTM Networks has been proposed.

1.2 Context settings

There are many different types of accompaniments that can be applied to melodies to generate music, such as chord accompaniment, bass accompaniment or countermelody, etc. [8]. All these approaches have distinct properties and require a different approach. In this study we have used chord accompaniment methodology. A chord sequence involves playing chords that support the melody. Chord accompaniment is one of the most basic types of accompaniment, and it is easier to obtain and process a dataset of chord sequences for a musical piece.

The proposed study uses two Western Classical Music datasets, which are meant to be played on piano. Thus, while the applicable area is expected to be wider, the focus of the paper is on Western Tonal Music analysis. Three features are extracted from MIDI datasets, namely "pitch", "step" and "duration". 88 different pitch values are analyzed as the datasets are limited to playable piano pieces. Rest notes are omitted in the dataset. Also, only the pieces with key values of C-Major are analyzed to further reduce complexity.

The paper is structured as follows: The previous works are briefly discussed in Section 2. The data representation, theoretical background of the methodology, and the experimental environment are presented in Section 3. Subsequently, the section also concludes the selection of hyperparameters along with the selection of evaluations metrics and the preparation of the dataset. The Section 4 present our experimental results using sketches with respect to mathematical scores and the ratings of human-listeners, demonstrating the effectiveness of the proposed model in generating high-quality music. Finally, Section 5 discusses the conclusion.

2 Literature review

The modelling of long-term dependencies in musical contents challenging and popular research field. Generating musical content with certain sequence length and high fluency requires careful observation and a multi-disciplinary study. The previous works improved the overall quality of generated musical contents in the past decade, thanks to the development of popular Deep Networks. Recurrent neural networks (RNN) based deep learning solutions show high capacity in modeling symbolic music contents [12, 21]. However, models based on RNN face the problem of vanishing gradients when tasked with generating longer sequences [9]. The vanishing gradients problem expresses that the weights converge to zero and become neutral as the time axis moves between the hidden layers [20]. LSTM and GRU-based solutions offer a solution to the problem of disappearing gradients by applying an additional memory structure in neural cells compared to standard RNNs and reveal a better display power. Biaxial LSTM networks empower this approach by employing two LSTM architectures operating on the same sequence to model musical contents. These bi-axial networks operate on different directions, one from start of the sequence till the end and one from the end of sequence to the start. The study proposed by Mao et. al proposes an improvement over standard LSTMs on modelling style-conditioned music generation. The approach shows improved results at modelling edges of the sequences. Still, the Bi-LSTMs does not make a significant contribution in case of improvement over modelling performance on the long-term dependencies problem [22]. Another limitation of RNN-based solutions is that the algorithms quickly over-fit.

Another approach to musical modelling is the use of encoder-decoder networks. Encoder-decoder networks take the original content as input and encode it into a dense representation called a hidden vector. This latent vector is then used together with a noise vector to generate new content in the decoder network. The noise vector is often selected as a random vector, while there are several works to improve the model capability through better noise vector selection [23]. Still, the performance of encoder-decoder networks is highly dependent on the use of hidden vectors and the use of discrete data in the hidden vector reduces the efficiency of the method due to the probabilistic nature of the algorithm. Hennig et al. proposes an LSTM-based method that aims to move the data space and improve the performance of the decoder by mapping discrete values to continuous values. Then, the melody generator is created by applying the VAE model in the mapped data space [11]. Koh et al. proposed a representation replacement method based on Convolutional Neural Networks (CNN) to the discrete representation problem. They feature the repetitive patterns in the new representation obtained with CNN by modeling them with RNN, and they produced musical contents with high originality value by using the obtained information in their last proposed Convolutional Variation Recurrent Neural Network model [16]. Another approach to discrete tokenization is to use hierarchical representation of the hidden (latent) vector. Roberts et al. proposes a hierarchical representation of the hidden vector to better represent the musical contents’ nature with the data features [24]. The improved latent vector also improves the long-term representation and provides clear advantages over long-term structure. Still, the fixed size of the hidden vector proposed in the studies limits the variable-size content generation. Liang et al. proposes a pipelined two-level solution where a Variational Autoencoder based first-level encoding model is employed to convert musical data into bar-level musical feature vector [19]. A second-level based on Generator Adversarial Networks (GANs) is pipelined to process the features encoded in the first layer and generate synthetic musical content. A later study (MIDI-Sandwich2) improves the single-instrument nature of the MIDI-Sandwich and contribute to the problem of music restoration while simultaneously performing multi-instrument musical content representation [18]. The MIDI-Sandwich2 consists of RNNs to run multiple VAEs simultaneously. In this way, it has been shown that the length of the content can be increased while maintaining the long-term structure. Still, the limitations of using autoencoder-based architectural approaches (fixed length context vector, need to optimize noise vector) are present in these works. The produced content length is fixed and the ability to model longer sequences are limited.

In the last decade, an important approach to the inference of long-term relationship has been the use of attentional networks. Formerly proposed in the field of Natural Language Processing (NLP), The attentional networks are now heavily considered in musical content production [28]. RNN-based methods require previous knowledge of n-1 tokens in order to extract information about the nth token. The attention approach, on the other hand, takes the entire input vector at once and creates a hidden vector where the weights can be changed dynamically. In this way, the attention networks, which offers a highly parallelizable solution, can learn and represent much longer input sequences [28].

Dai et al. divided the melody production problem into sub-problems as rhythm production and basic melody production which worked under a hierarchical general architecture [7]. It is seen that rhythm and melody patterns are made into features by using the attention method in rhythm and basic melody generator models. The study shows that the conditional training on feature space through attentions produce highly successfull long-term structure. Hierarchical representation of musical features are also investigated in the attention-based researches. Huang et al. has succeeded in producing musical content up to 60 seconds with a dataset of about 2000 different tokens by using attentions [13].

Wu et al. propose an algorithm that uses two different neural networks combined to form a hierarchical solution [26]. The study first encodes repetitions in music content using the CNN-based Structure Generation Networks and then uses these encoded features to generate the musical content using an RNN-based Melody Generation Network. On the other hand, Zhang et al. suggested the use of attention as the discriminator of a GAN network [28]. The integration of attention improves the local and global representation of the musical pieces use of GANs in symbolic representations, but it is also seen that the produced content provide low originality. Studies in the literature prove that attention methods are an effective tool in modeling long sequences. The matrice-based encoding of data and the increase on the Number of Parameters (PN) in the Deep Learning networks posed by the used of attention architectures significantly increase the required memory resources and in-general limit the size of the data to operate [2]. Researchers often limit the number of tokens, filter the data (such as employing only pieces with having C-Major as a key) or keep the sequence length short. Furthermore, this also leads to increased number of studies concerning with western classical music and reduces the analyzes over eastern or oriental music as the western music represent pitch classes with certain widths of frequency and can be encoded with ease.

When the development of the proposed solutions for the musical content production problem is examined in the context of methods, it can be derived that the solutions presented in hierarchical or hybrid structure reveal very productive results. In this context, it is understood that the solutions to be developed by applying the deductive approach to the musical content problem will be much healthier. The most basic structures of musical content are notes, chords, rhythm, melody and harmony (accompaniment). Rhythms, melodies and chords are all represented by vectors formed by stringing notes and chords one after the other, and a certain polyphonic sequence can contain both rhythm, melody and harmony at the same time. Wu et al. experimented with encoding repetitive expressions (rhythm, melody) with the CNN notation coding method. Dai et al. coded the parts rhythmically and melodically within the scope of the attribute, and then carried out their training on this enriched data set [7]. Thus, studies show an inclination towards hierarchical processing of music. Still, a large portion of studies leave the learning of melodic and harmonic structures to the model by looking at the data as a whole. One of the reasons for this approach is the lack of datasets containing seperate rhythmic, melodic and harmonic structures. An overview of relevant papers is summarized in Table 1.

Table 1 An overview of relevant papers

3 Methodology

3.1 Data representation

Two public datasets have been investigated to tackle the melody generation problem. The first dataset, named as Classical Piano Midi (CPM), includes the piano compositions of 25 world famous classical music composers in MIDI format, shared with CC-by-SA license [17]. These pieces were obtained from the Web environment with Web Scraping methods. The second dataset is the MAESTRO dataset which was processed by the Google Magenta team and made available to the public [10]. The MAESTRO dataset is again published in MIDI format and contains approximately 200 hours of piano performances. Thus, the MAESTRO demonstrates a much larger data space than CPM in terms of both the number of pieces and the total number of notes. The CPM dataset is used to optimize the hyperparameters and the training has been done on MAESTRO dataset.

The MIDI format includes valuable information but can not be directly supplied to deep learning architecture as an input line. Therefore, the pieces in the datasets are pre-processed using "music21" music processing library into array-type format [5]. Three main features (pitch, step, duration) are extracted from the pieces. Pitch indicates the frequency class of the played sound and demonstrate a categorical feature. For this study, only 88 pitches existing on piano are selected for the analyses. Chords are reduced to notes using their base note. Furthermore, pieces are grouped using the key notes. Amongst all, only the pieces based on C-Major keys are evaluated in the study to further reduce the computational complexity. The second feature of the dataset is step. Step is the starting time for the note and indicate the second feature of the dataset as a numerical feature. Finally, the third feature, duration is the playing duration for the respective note.

The first analyzes has shown that some of the chords and notes are rarely employed. Thus, a threshold of minimum use for pitches has been applied to remove underused chords and notes, with 100 as the threshold value. This process narrowed the data space with minimal compromise on quality and reduced the computational burden as well as memory requirements on the training phase. As these removed chords and notes are highly sparsely represented and often represent harmonic structure, the affect on the generated melody remains negligible. The dataset features obtained for CPM and MAESTRO after the pre-processing phase are shown in detail in Table 2. The repetition counts are taken with 64 as the sequence length for the pieces.

Table 2 CPM and MAESTRO Dataset Features

3.2 Proposed model

The proposed model tackles the melody generation problem with a two-tiered hierarchical architecture. The musical pieces are pre-processed and 3 temporal features are extracted as "pitch", "step" and "duration". These features are encoded in embedding layers and merged together to form an input vector. The obtained input vector is then fed into a Dense layer. A dense layer is a fully-connected neural layer where all features are affecting the calculation of a single non-linear function. The Dense layer functions as a preparation to convert the features to the desired dimensions. The outputs of the Dense layer is then fed directly to the LSTM layer. We have used two LSTM layers to obtain a context vector with a sequence length of 32. This sequence contains 8 bars with bars having 4 notes each.

The context vector strongly encodes the inter-dependencies between notes. The context vector is then fed into a multi-head attention module where long-term dependencies are resolved. The proposed hierarchical deep learning architecture is depicted in Fig. 2.

Fig. 2
figure 2

Proposed hierarchical architecture

3.3 Long-short term memory layer

Recurrent Networks are DL-based tools which excel at representing sequential features. The method is being used in many fields with sequential and temporal features such as signal processing, NLP and MIR. The main feature of the method is that a memory structure helps earlier neurons to affect the prediction of current value for the cell. The loss are optimized using gradient-based solutions and this leads to the popular "vanishing gradients problem" as the sequence length increases.

LSTM and Gated Recurrent Unit (GRU) address the vanishing gradients problem and improve the architecture by using neural level modifications. LSTM uses a 3-gate approach (forget, input, output) to optimize the balance between past and new information gained from the input sequence.

The forget gate evaluates which data from prior time steps should remain. The (1) shows the forget gate activation is calculated by taking a weighted sum of the current input vector x(t) and the previous hidden state vector \(h(t-1)\), and then passing the sum through a sigmoid function. The forget gate f(t) yields output values between 0 and 1. Output of 0 means the discarding of the previous information and 1 means the preserving of previous information, which can affect the calculation of the future values. The forget gate can control how much of the previous memory cell state is forgotten, and the input gate can control how much of the current input vector is stored in the memory cell. This allows the LSTM to learn long-term relationships between different parts of the sequence.

$$\begin{aligned} f(t) = \sigma (W(f_x) * x(t) + W(f_h) * h(t-1) + b(f)) \end{aligned}$$
(1)

An input gate is utilized to figure out whether the information from the current input cell should be included as shown in (2). The process of calculations are similar to the forget gate, only the weight and bias vectors are dedicated to the gate. Values close to 0 indicate that the information will not be kept, while values close to 1 show that the current input can be accepted.

$$\begin{aligned} i(t) = \sigma (W(i_x) * x(t) + W(i_h) * h(t-1) + b(i)) \end{aligned}$$
(2)

The candidate memory cell update vector is calculated by taking a weighted sum of the current input vector and the previous hidden state vector, and this time passing the sum through a tanh function.

$$\begin{aligned} g(t) = \tanh (W(g_x) * x(t) + W(g_h) * h(t-1) + b(g)) \end{aligned}$$
(3)

The output gate activation is a sigmoid function, which again outputs a value between 0 and 1. The output gate activation is then multiplied by the memory cell state to produce the output vector for the LSTM cell. The output vector is then passed to the next layer of the neural network. The output gate allows the LSTM to decide what information from the memory cell state is most important to output to the next layer.

$$\begin{aligned} o(t) = \sigma (W(o_x) * x(t) + W(o_h) * h(t-1) + b(o)) \end{aligned}$$
(4)

3.4 Self-attention

Self-attention is a mechanism used in deep learning models, particularly in attention-based models, to capture relations between different positions of a sequence [25]. In self-attention, a set of key-value pairs is used to represent each position in the input sequence, and a query vector is used to compute a weighted sum of the values, where the weights are determined by the similarity between the query and each key. The input sequence is first transformed into three matrices: the query matrix Q, the key matrix K, and the value matrix V. These matrices are then used to compute a set of attention weights. These attention weights represent the importance between different positions in the input vector. Finally, the output of the self-attention layer is computed as a weighted sum of the value matrix, where the weights are given by the attention weights. The scaled dot-product attention as calculated by [25], is shown in (5). The constant \(dim_k\) is utilized to scale the scores with respect to the sequence size.

$$\begin{aligned} attention\_scores(Q,K,V) = softmax(QK^T/\sqrt{dim_k})V \end{aligned}$$
(5)

The context vector is calculated by multiplying the attention scores by the value matrix as shown in (6). The context vector represents a weighted sum of the hidden state vectors of the entire sequence, where the weights are determined by the attention scores. The context vector can then be used as input to the next layer of the neural network.

$$\begin{aligned} context\_vector = attention\_scores * V \end{aligned}$$
(6)

3.5 Multi-head attention

While a single attention head provides the necessary functionality to represent a long sequence with good accuracy, the modelling power of self-attention decays when tasked with multiple features. The algorithm can further be improved to represent multi-featured datasets with even higher accuracy. The idea of increased number of attention heads provides modelling power on multiple features with linear-cost computational overhead. The implementation of Multi-Head attention is similar to self-attention. The algorithm first divides the Q, K and V matrices into several heads as shown in (7) where \(n_heads\) showing the number of heads in which the matrice is split.

$$\begin{aligned} {\begin{matrix} Q_{head_i} = split(Q, n_{heads}) \\ K_{head_i} = split(K, n_{heads}) \\ V_{head_i} = split(V, n_{heads}) \end{matrix}} \end{aligned}$$
(7)

Each head then uses it’s Q, K and V vectors to calculate their own context vector. The context vector calculation uses the same approach with the single-feature self-attention vector as shown in (8). Attention scores calculated in (5) is multiplied with Value matrice for each attention head. This operation can process several attention heads running their own attention vectors in parallel while focusing on different aspects of the data. This vector contains the sequential relations. Two different parts of the sequence even if they are distant regarding time axis can be related to each other with ease through the use of this context vector. Also, the concatenated context vector has a higher dimensionality than the context vector from any individual attention head. This allows the neural network to learn more complex relationships between different parts of the sequence.

$$\begin{aligned} context\_vectors = [attention\_scores_i * V_i], i \in n_{heads} \end{aligned}$$
(8)

3.6 Experimental environment

The pre-processing of MIDI datasets, data-analysis of newly prepared datasets and training of models were all implemented using Python version 3.9. The proposed Hierarchical Deep Learning framework were implemented using the Tensorflow 2.12.0. The hyperparameter optimization has been conducted at B.T.U High-Performance Clustering Laboratory (HPCLAB)Footnote 1. The utilized CPU’s of the consist of Intel® CoreTM i9-10900X chips. Pre-processing and data-analysis steps are conducted through CPU’s and training phase utilized the GPU’s. The model has been trained on Nvidia 3090 GPUs with CUDA version 8. Eight GPUs were used in parallel to accelerate the optimization process.

3.7 Hyperparameter selection

The choice of hyperparameters has a significant impact on the performance of deep learning models. Thus, fine-tuning of parameters before executing the training phase is crucial to produce a successful model. The hyperparameters optimized for the proposed model include the number of neurons for the LSTM layer, dropout rate for the regularization, number of attention heads, optimizer selection. The search space is chosen with intuition and the optimization has been done with grid search, where each hyperparameter is optimized one-by-one while other hyperparameters were fixed. The experiments has been run with 10-fold cross-validation approach. The best performing hyperparameters are given in Table 3. The learning rate is a hyperparameter that controls how quickly a deep learning model updates its weights during training. Number of iterations are given as "iteration". Dropout rate is another hyperparameter that controls the amount of regularization applied to a deep learning model during training. Batch size controls the number of training examples used to update the model’s parameters in each iteration. The number of attention heads represent the hyperparameter for the multi-head attention algorithm. The heads split the original Q, K and V vectors and each deal with a distinct region of the encoded feature space. Number of neurons is the hyperparameter for LSTM layers.

Table 3 Hyperparameter optimization

3.8 Evaluation metrics

The training phase of the DL algorithm uses loss metrics of Root Median Square Error (RMSE) and \(R_2\). Both these metrics allow distance based evaluation approximation and are proven to be effective at optimizing mathematical problems. Equation (9) shows the calculation for the RMSE metric. Lower RMSE means lower error between original and synthetic content.

$$\begin{aligned} RMSE=\sqrt{(1/N)*\sum _{L=1}^{N} (y_p-y_c)^2} \end{aligned}$$
(9)
Table 4 Qualitative experimental results

The (10) shows the calculation for the \(R_2\) metric. The \(R_2\) is a powerful-metric for regression based analysis. The scores close to 1.0 shows that the results are close to the original feature space.

$$\begin{aligned} R^2= 1 - \frac{\sum _i (y_i-\hat{y}_i)^2}{\sum _i (y_i-\bar{y}_i)^2} \end{aligned}$$
(10)

4 Experimental results

Table 4 shows the experimental results with respect to MSE loss performances for the proposed model and baseline models. The attention module greatly improves the overall scores in a multi-feature environment. Due to the multi-head structure of the attention architecture, each feature can specialize through a different part of the matrice. Thus, the which leads to improved accuracy over MSE metric.

Fig. 3
figure 3

Loss curves for the features: (a) general loss (b) pitch loss (c) duration loss (d) step loss

Repetition is one of the features of musical contents which is common with NLP. The produced synthetic musical content should have comparable number of repetitions with the original content. Derived from NLP, we have also conducted extensive analysis to count the n-gram counts for the pitches. The n-grams show important information about the recurrent relations between the original and synthetic content’s feature spaces. For the tests, bi-gram and tri-gram analysis have been conducted using "pitch" features. Previous studies show that most occurring repetitions in formation of rhytmic structure should be searched in consecutive notes of two or three notes. The similarity results with respect to repetition metrics are shown in the table includes the original pieces averages on the left and the models’ outcomes on the right. The repetition alignment scores also present better alignment with the original pieces compared to baseline methods where produced pieces struggle to represent a strong repetitive structure.

Table 5 Comparative analysis of the proposed model and existing studies

The general loss curve as well as the feature-level loss curves for the pitch, duration and step are given in Fig. 3. The loss curves show that the sharp decrease on the first epochs is supported with the continuous improvement over the remaining epochs. Comparing the feature-wise loss curves, the pitch loss pose the highest complexity as it is a categorical feature with more than 20k tokens. Thus, the focus on categorical pitch feature with multi-head attention shows a significant gain over baseline models. The features to represent the duration and step are numerical values and while the proposed model leaves only a small portion of the neural capability to model these features, the results show that it still provides enough complexity for modelling.

Table 6 Computational complexity of models

The model is also compared with other studies ragarding qualitative and listening test evaluations to show the applicability of the proposed model. The results depicted at Table 5 shows that the proposed MTHA-LSTM algorithm excels at qualitative metrics over the state of the art solutions. Studies employ several different metrics at conducting a human evaluation, which shows a clear lack of consensus on human evaluation of generative music. Majority of the studies indicate more than one subjective metric and only the highlighted metrics are shown in the table to reduce the complexity. While the proposed method is amongst the best performers in the survey scores, it is difficult to compare these numbers between. The real data scores for musical-melodic structure often result near 4 in likert scale, which shows that the artistic nature of the field plays a big role in musical perception. The human listening test analysis for the proposed study is given in detail in Section 7.

While the qualitative analysis provide a good scores on forming of a melodic structure, the reliability of the analysis should also be inspected. We have conducted a Wilcoxon test to evaluate the reliability of the scores [6]. Wilcoxon test is a statistical test which evaluates the statistical relevance of two distributions. The Wilcoxon signed-rank test is a non-parametric test, which means that it does not make any assumptions about the distribution of the data. This makes it a good choice for comparing two samples when the data is not normally distributed, which is often the case in music generation problems. The wilcoxon test is conducted between the original pieces and MTHA-LSTM generated pieces, indicates a score of 0.05, accepting null hypothesis. The hypothesis is that given two samples of human-composed music and AI-generated music, there is no difference between the two types of music in terms of measure of quality on melodic similarity.

4.1 Computational and memory complexity analysis

The DL algorithms offer sophisticated architectures to represent more complex problems. As the complexity increases, both the computational and the memory requirements increase and pose another challenge for researchers and engineers. Memory complexity limits the dataset representation which can be lead to less data usage on training phase. On the other hand, the computational complexity leads to higher execution times for the training. Thus, to determine the real-world applicability as well as scalability of the proposed studies, complexity analysis provide valuable information.

Table 6 shows the computational complexity of the proposed works along with the baseline approaches. The baseline algorithms of RNN, LSTM and LSTM-Attention and the proposed Hierarchical LSTM-Attention both consist of a Dense Layer which merges the Embedding Layers and 2 Recurrent Layers. Training time evaluations are limited with 200 epochs and execution time evaluations include generating a sequence with 64 notes. As the attention-based models have a memory requirement, neuron sizes for the Recurrent Layers (RNN and LSTM) are kept at 512 and the batch sizes are selected as 256.

Despite the fact that the attention-based methods consist of additional attentional layers to process the data, they do not require more time to train. Still, the architectural additions in the Hierarchical LSTM-MTHA does not lead to a sharp increase on training time and the algorithm executes training phase quicker than baseline models and reaches better loss and alignment scores. The RNN architecture trains at a significantly longer period as the structure of the algorithm does not utilize the modern GPU hardware well. The difference in inference times for the analyzed models show that the affect of the computational overhead are negligible. The attention-based methods does not require longer inference periods, despite the fact that they have a more complex architecture.

The table is also supported with the comparison of the memory requirements for the models. Column number of parameters (PN) shows the required number of parameters to be calculated for each iteration for the given algorithm and directly affects the memory usage for the algorithm. Attention-based solutions require more memory than recurrent architectures but perform well in time-series based data and yield more accurate results.

4.2 Human evaluations

Evaluating synthetic music generation models using mathematical models is an open and challenging topic [27]. Due to the artistic nature of the problem, giving scores to the contents through mathematical metrics could produce misleading results. Several music generation based researches has shown that by employing subjective evaluations, the performance of the model can be enhanced. Thus, a hybrid evaluation including both mathematical and subjective evaluations is the best path to evaluate studies according to the current state of the literature.

The survey consists of four questions to evaluate the human listening test. Scoring of the questions are evaluated using Likert scale [15]. Likert scale is unidirectional and the questionnaire has the ability to express their opinion supported with the degree of opinion. The degree of opinion ranges between 1-5 where 1 and 5 show the two opposite sides. Through the listening tests, 4 questions have been evaluated. These questions are given as follows:

  • (AI) Is this piece composed by artificial intelligence or human?

  • (LT) Score the melodic structure of the piece.

  • (QA) Score the overall quality of the piece.

  • (FA) Score the fluency of the piece.

The questionnaire are selected from various backgrounds with two different levels of knowledge from casual listeners to professionals and people having theoretical backgrounds. The survey has been held open for one week and 57 people has attended. As for the demographics, 61.4% of the people expressed themselves as casual listeners and 38.6% of the people with theoretical background or professional history. Of all attendees, 35.1% are between 18-25 years old, 47.4% are 26-40 years old and 17.5% percent are 41-65 years old.

The Table 7 provides the results of the survey. The survey is evaluated through three different perspectives. The columns "All" show the survey results for all attendees, "Inx" show the casual listener scores and "Exp" shows the experienced attendees’ scores. For the overall scores, the model excels on all metrics but quality. In cases for experienced attendees, the proposed model show higher overall scores on all metrics.

Table 7 Human evaluation scores for the proposed work and baseline methods

5 Conclusion

Musical contents pose a significant importance to human life. From entertainment to health, the composers create such artistic pieces to accomplish goals related to the respective fields. While having a highly subjective nature, music has also many rules which can help to mathematically represent the contents along with their features in many levels such as notes, chords or melodic and harmonic structures and many others. Current studies examine the MIR in two major branches of representation. One being the symbolic music representation while the other is signal representation. Symbolic representation retains musical features as a high-level language, where the features can be processed and analyzed through mathematical models.

Sequence-based representation of musical contents pose two major challenges. The first challenge is to determine the temporal inter-dependencies between features. The recurrent LSTM architectures show great success at modelling small-scale neighboring relations. The second challenge is to retain the general characteristic as the sequence progress, named as the long-term dependency problem. State-of-art approaches such as LSTMs or GRUs can not cope with the complexity of relations in long sequences and the content quality quickly deteriorate. The newer approaches such as attention-based algorithms offer powerful solutions to solve long-term dependency problems.

The musical contents also pose a hierarchical way of representation. utilizing the hierarchical representation we have distincly categorized different levels of information, melodic and harmonic structures. This can provide a different set of sub-problems which reduces the complexity of the problem at-hand. Thus, to enhance the long-term relation modelling capability, the power of attentional networks are combined with the idea of hierarchically modelling musical contents.

A two-phased approach is developed to tackle the music generation problem. First, melodic structure is generated. Later, generated melody is analyzed along with the accompaniment data to generate final musical piece. Melodic structures can be enhanced with accompaniment using several techniques. We have used "chord accompaniment" to implement the accompaniment generation. Both phases use a two layer deep learning algorithm where LSTM and MTHA algorithms are employed together. This allows the algorithm to first learn the local-relations with LSTM and improve the long-term relations using MTHA. Multiple attention heads are employed to improve the modelling capability in a multi-featured feature space.

The datasets used in the study analyze two popular datasets in MIR, namely "Classical Music MIDI" and "Google MAESTRO". These datasets are pre-processed and melodic and harmonic structures are split into two distinct datasets, "melody sequence dataset" and "accompaniment sequence dataset".

The algorithms are evaluated against RMSE and \(R^2\) metrics. The results show clear improvement over baseline methods as well as previous state-of-art solutions. Musical contents pose an artistic style and mathematical assessments supported by human evaluations enhance the applicability of the solutions. A survey has been conducted to identify the performance of the proposed work along with baseline approaches to decide if the proposed method fits well to generating synthetic musical content.

The limitations to the proposed approach include the selection of pieces with key of only "C-Major", removal of rest notes and using a single instrument to cope with the increasing computational complexity and memory requirements. Future studies would solve these problems with with enough computational resources and more advanced architectures. One approach would be to reduce the memory requirement for attention architectures. The attention layers often contain sparsely populated matrices. The underlying architecture can be updated to offer better memory complexity. Increased memory efficiency could lead to better handling of even longer sequences. Also, the attention algorithms produce bilateral relations through the use of context vectors. The relational dimensionality of the algorithm can be enhanced in way that trilateral or even higher dimension relations can be modelled.

We believe that our model is a significant step forward in the field of polyphonic symbolic melody generation. Our model is able to generate melodies that are both musically coherent and expressive, and it outperforms previous methods by a significant margin. However, there are still some major challenges in this field, and we hope that our work will inspire other researchers to continue working on this important problem.