Keywords

1 Introduction

Even though the western music scale only has twelve notes per octave, it encompasses the vast majority of the music one is familiar with, from Bach and Beethoven to Elton John and Beyonce. The past ten years have seen an increase in interest and a diversity of methods for algorithmically creating music using a variety of learning models [1]. Can deep learning models be used to study and utilize the genius of Mozart or Beethoven? How simple is it to recreate particular styles of music that the generation is accustomed to hearing?

While music has attracted a lot of attention, research on music prediction has received less attention. In this research, using a musical snippet, predictions are made for following N musical occurrences (for instance, 10 quarter-note beats). The actual musical occurrences are then contrasted with the predicted music, and the prediction is graded using a scoring system. Experiments leaves one wondering on whether music prediction or music generation is trickier.

2 Data

The Patterns for Prediction Development Dataset, was used. This data was produced by analysing a randomly chosen portion of the Lakh MIDI DatasetFootnote 1, a dataset made up of one million popular music tracks. For both monophonic and polyphonic songs, the symbolic MIDI format was applied. Each input, or prime, corresponds to about 35 s of music, and each output, or continuation, includes the following 10 quarter-note beats.

3 Feature Representation

3.1 Basic Music Notation

The fundamental component of music and the foundation from which all chords and melodies are built are notes. Each note has a pitch and a duration. Pitch is essentially the note’s sound frequency. The MIDI music file system can store 128 different pitches. Quarter-note lengths are used to measure duration, which is the length of the note. When several notes are played simultaneously, they form chords. Since only one note is played at a time, monophonic music lacks chord structure. Polyphonic music has chords.

3.2 Note and Chord Representation

Each note was encoded as an integer ranging from 0 to 128 due to the fact that notes have 128 pitches in the MIDI file system. The number 129 was used to symbolize a rest, or the amount of time between notes. In string format, a chord was shown as a collection of pitches. For instance, “68.70.75" might be used to denote the chord with the pitches 68, 70, and 75. The training set’s chords were all sampled, and a dictionary was built that converted the uncommon pairings into integer values. This corresponded to 101,039 distinct combinations and consequently input data dimensions for the medium-sized dataset.

How to handle inputs (chords) that are present in the testing set but absent from the training set is one of the difficulties in music prediction. This issue is avoided while creating music since the model only generates chords that are present in the training set. There will invariably be note combinations in the testing data that the model hasn’t “seen" previously while making a forecast. The initial strategy was replacing the unfamiliar chord with the most prevalent chord in the practice set.

A more sophisticated approach is to substitute the most frequent chord found inside the prime sequence for the unknown chord as the next sequence item. Combinatorial proliferation of chord possibilities in polyphonic music makes the process a bit challenging.

Duration Representation: The representation of the durations is a quarter-note length. The number of durations may be insurmountable, in principle. The duration length is typically spread among a limited set of choices. All of the training data’s durations were sampled, and a dictionary was built that converted each distinct duration length to an integer. The dictionary’s length varied depending on the dataset, although it was often less than 100 words long.

Combination Representation: The combination of note pitch and durations as a tuple was represented initially as a Pre-process step. For instance, the tuple was used to represent a note with a pitch of 60 and a duration of 1 quarter-note length (60, 1). Each distinct tuple was mapped to an integer value, just how the chords and durations were done.

Interval Representation: Note intervals were used in the final input representation that was assessed. Pitch difference between two nodes is an interval. Each interval denotes a semitone of difference between the pitches of two notes that are right adjacent to each other. For instance, the pitch difference between the notes 60 and 65 is 5 semitones. Negative intervals are used to illustrate drops in pitch. There are 256 total possible interval values because there are 128 total pitches. Instead of using pitch sequences for training and prediction, interval sequences were used for this representation. For instance, the sequence of intervals for a prime consisting of the four notes 65, 70, 45, and 60 would be 5, -25, 15. To find the note predictions using the list of interval changes, the last pitch of the prime was preserved in a list and utilized as the beginning point.

4 Methods and Modeling

4.1 Markov Chain Model

One of the earliest models utilized for music production was the Markov chain. A first-order and a second-order Markov chain was trained and evaluated. Separate training sessions were held for notes and durations. This model could be utilized with inference methods like restricted inference and beam search [2,3,4].

4.2 LSTM Model

A Forest of Long-Short-Term Memory (LSTM) network was used for next set of experiments. A single network with multivariate input made up one model (both notes and durations). For the notes and durations, respectively, there were two different LSTM networks in the other model. It was established that notes and durations have a significant relationship. Input layer, 64-dimensional embedding layer, 512-hidden unit LSTM layer, batch normalization layer, and dropout layer with a dropout rate of 0.3 were the layers present in both models. The multivariate model then has two dense layers employing a categorical cross-entropy loss and an rmsprop optimizer for gradient descent after concatenating the two inputs. The batch size was 64, and there were typically 100 epochs. Experiments were performed using more epochs, but the outcomes didn’t change, over-fitting was also observed in some cases. Figure 1 below displays the multivariate LSTM model’s graph.

Fig. 1.
figure 1

LSTM Multivariate Model

4.3 LSTM Encoder-Decoder Model

Two different models for the notes and durations, as well as a single multivariate model, were trained. The parameters matched those of the LSTM model aforementioned. The longest prime sequence in the training set served as the input sequence length, and the longest target sequence in the training set served as the output sequence length. The graph of the multivariate encoder-decoder model is shown in Fig. 2.

Fig. 2.
figure 2

LSTM Encode-Decoder Multivariate Model

4.4 Inference Modeling

Prediction was performed using the common beam search algorithm with beam sizes ranging from k = 2 to k = 4. Only the notes from the prime part were used for the constrained inference [5]. The following note with the highest score, for instance, would be the new option for prediction of song’s prime sequence.

4.5 Other Techniques

Transposition was a method used to increase the uniformity of the input data. Transposition in music refers to the process of raising or lowering notes and chords by a fixed distance. The spaces between succeeding notes remain constant. Six major scales and six minor scales make up the 12 different scales in Western music. One of these scales is used in the majority of harmonic music. For instance, changing a song from the key of C major to the key of E major would require adding four semitones or four intervals to each pitch. Depending on whether the music was originally in a major or minor key, each input song was transposed to either C major or C minor. The training then took place using the transposed pitches. The prediction notes were created by transposing the prime sequence into either C major or C minor before testing. The anticipated target notes were then recomposed back into their original key before being contrasted with the actual target notes.

5 Results and Observations

Both the training timeframes and the outcomes were adversely affected by the exponential growth of the input data for the polyphonic music. Scoring was done based on cardinality score and pitch score.

Cardinality Score: Minimizing a general supermodular monotone non-increasing function f() with respect to a cardinality constraint. Let \(x^*_1\in _x {f(\emptyset ) - f({x})}\) and use the greedy algorithm on the set function

$$ g(S) := f({x^*_1}) - f({x^*_1} \cup S), $$

which is a monotone non-decreasing submodular set function with \(g(\emptyset ) \,=\, 0\). Thus, the greedy algorithm maximizes the function with respect to a cardinality constraint within an approximation factor of \(1 - 1/e\). However, there are two caveats. First, the greedy algorithm uses a budget of \(k - 1\) instead of k (budget of one is spent on identifying \(x^*_1\)). Second, and most importantly, observe that the approximation factor is obtained on the function g(S) and not f(S), they get a set S of size \(k - 1\) such that

$$ f({x^*_1}) - f({x^*_1} \cup S) \ge {1 - \frac{k}{(k - 1)e}}\cdot (f({x^*_1}) - f({x^*_1} \cup S^*)), $$

where \(S^*\) is the optimal set of size \(k - 1\) to be added to \({x^*_1}\) with the goal of minimizing f().

The Pitch Score. The right “shape" or separation from each note is rewarded by the cardinality of continuations. However, it rewards a continuation that is transposed one semitone lower in most cases. The pitches are additionally scored because this isn’t entirely accurate. Two normalized histograms of the pitches from the prediction and the ground truth are made to achieve this. The amount by which the histograms are overlapped determines the score. The operation is carried out while ignoring octaves as well. The cardinality score and the pitch score are combined linearly to produce the final score.

Baseline. The first baseline estimate included selecting notes from the training set at random. The second technique involved selecting notes from the training set based on their frequency. The last approach used the same steps as the first, but added restrictions required that the sampled note be in the song’s expected prime sequence. The third method generated the highest baseline for both the polyphonic and monophonic music sets.

Discussions. Polyphonic and monophonic datasets were evaulated individually. 90% of the songs were used for training and 10% for testing in the first division of the datasets into training and testing groups. Figure 3 and Fig. 4 displays the outcomes of numerous training and testing runs for the monophonic and polyphonic dataset. The LSTM model produced the greatest results when it used constrained inference, transposition on the input data, and a longer sequence length of 64. Compared to the best baseline score, the model’s accuracy increased by over two times. The accuracy of the LSTM model was improved by transposing the input and lengthening the sequence. Transposition improved all of the models’ accuracy, suggesting that more reliable input data improves learning. There were no appreciable improvements in accuracy brought about by the beam search method of inference. Prediction accuracy was also improved by constrained inference, particularly for the Markov model.

The notes and duration feature representations for the input data were far more accurate than the interval feature representation. This suggests that compared to pitches, intervals may have less structure and greater unpredictability. Unsurprisingly, the results were significantly worse for the input with larger dimensions (i.e., the tuple combinations of pitches and durations). There were some minor differences between employing a multivariate LSTM model and training two distinct models for notes and durations. When all other variables remained constant, the multivariate model consistently beat the two independent models by a little margin. This suggests that there is some relationship between the anticipated notes and durations. In addition to having a slightly higher accuracy, the multivariate models also required less time to train than two individual models.

Fig. 3.
figure 3

Results - Model Predictions for Polyphonic Music Data

Across all models, the pitch score was far more variable than the cardinality score. When the findings were examined more closely, it became clear that many of the projected sequences only had two or three different pitches. As a result, the model could determine which pitch or pitches appeared in the target sequence the most frequently, but not which order they were played. Additionally, no predictions were ever made for notes that in the target sequence seemed to be more “random" (i.e., not present in the prime sequence). While music might have a strong structural element, it also has randomness or surprises that provide appeal.

Overall, the LSTM model performed better than every other model. The Markov models had the highest variance of all the models, indicating that they would benefit from further restrictions or domain expertise [6,7,8].

The baseline values for the polyphonic music were significantly higher than those for the monophonic music, neither set of results should be used to judge the other. Due to the method used to determine the score, the baseline scores for the two forms of music are different.

Fig. 4.
figure 4

Results - Model Predictions on Monophonic Datasets

6 Conclusion and Future Work

Because music has both structure and randomness, it presents an intriguing and difficult prediction problem. Although feature representation is significant, feature size is even more crucial. Training is substantially more challenging for inputs with very high dimensions. Constraints can significantly shrink the search space and improve model accuracy when used in inference. Meaningful learning and inference depend on limiting the input space and the search space in ways that nonetheless capture pertinent relationships and significant features.

Random pop tracks were employed in this research. Pop music is just one of many different musical styles. Jazz music is more free-form than classical music, which has a lot more structure. A fascinating expansion of this research would be to compare the results of other musical genres. Models, such as Hidden Markov Models and LSTM with attention mechanisms, leave room to wonder [9, 10].