Predicting Music Using Machine Learning

Asesh, Aishwarya

doi:10.1007/978-3-031-37649-8_3

Aishwarya Asesh¹⁵

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 710))

Included in the following conference series:

Machine Intelligence and Digital Interaction Conference

1507 Accesses

Abstract

The intricate temporally prolonged sequences seen in music make it a perfect environment for the study of prediction. Melody, harmony, and rhythm are three examples of the structural elements found in music. This research incorporates music excerpts prediction by understanding structural details using Markov chain and LSTM models. The novel approach compares to state-of-the-art algorithms by predicting how a musical excerpt would continue after being given as input. To compare the variations in prediction and learning, different learning models with different input feature representations were utilized. This algorithm envisions multitude of usage including next generation music recommendation system using intra-sequence matching, pitch-tone correction, amongst others by integrating with recent advances in deep learning, computer vision, and speech techniques.

You have full access to this open access chapter, Download conference paper PDF

Keywords

1 Introduction

Even though the western music scale only has twelve notes per octave, it encompasses the vast majority of the music one is familiar with, from Bach and Beethoven to Elton John and Beyonce. The past ten years have seen an increase in interest and a diversity of methods for algorithmically creating music using a variety of learning models [1]. Can deep learning models be used to study and utilize the genius of Mozart or Beethoven? How simple is it to recreate particular styles of music that the generation is accustomed to hearing?

While music has attracted a lot of attention, research on music prediction has received less attention. In this research, using a musical snippet, predictions are made for following N musical occurrences (for instance, 10 quarter-note beats). The actual musical occurrences are then contrasted with the predicted music, and the prediction is graded using a scoring system. Experiments leaves one wondering on whether music prediction or music generation is trickier.

2 Data

The Patterns for Prediction Development Dataset, was used. This data was produced by analysing a randomly chosen portion of the Lakh MIDI Dataset^{Footnote 1}, a dataset made up of one million popular music tracks. For both monophonic and polyphonic songs, the symbolic MIDI format was applied. Each input, or prime, corresponds to about 35 s of music, and each output, or continuation, includes the following 10 quarter-note beats.

3 Feature Representation

3.1 Basic Music Notation

The fundamental component of music and the foundation from which all chords and melodies are built are notes. Each note has a pitch and a duration. Pitch is essentially the note’s sound frequency. The MIDI music file system can store 128 different pitches. Quarter-note lengths are used to measure duration, which is the length of the note. When several notes are played simultaneously, they form chords. Since only one note is played at a time, monophonic music lacks chord structure. Polyphonic music has chords.

3.2 Note and Chord Representation

Each note was encoded as an integer ranging from 0 to 128 due to the fact that notes have 128 pitches in the MIDI file system. The number 129 was used to symbolize a rest, or the amount of time between notes. In string format, a chord was shown as a collection of pitches. For instance, “68.70.75" might be used to denote the chord with the pitches 68, 70, and 75. The training set’s chords were all sampled, and a dictionary was built that converted the uncommon pairings into integer values. This corresponded to 101,039 distinct combinations and consequently input data dimensions for the medium-sized dataset.

How to handle inputs (chords) that are present in the testing set but absent from the training set is one of the difficulties in music prediction. This issue is avoided while creating music since the model only generates chords that are present in the training set. There will invariably be note combinations in the testing data that the model hasn’t “seen" previously while making a forecast. The initial strategy was replacing the unfamiliar chord with the most prevalent chord in the practice set.

A more sophisticated approach is to substitute the most frequent chord found inside the prime sequence for the unknown chord as the next sequence item. Combinatorial proliferation of chord possibilities in polyphonic music makes the process a bit challenging.

Duration Representation: The representation of the durations is a quarter-note length. The number of durations may be insurmountable, in principle. The duration length is typically spread among a limited set of choices. All of the training data’s durations were sampled, and a dictionary was built that converted each distinct duration length to an integer. The dictionary’s length varied depending on the dataset, although it was often less than 100 words long.

Combination Representation: The combination of note pitch and durations as a tuple was represented initially as a Pre-process step. For instance, the tuple was used to represent a note with a pitch of 60 and a duration of 1 quarter-note length (60, 1). Each distinct tuple was mapped to an integer value, just how the chords and durations were done.

Interval Representation: Note intervals were used in the final input representation that was assessed. Pitch difference between two nodes is an interval. Each interval denotes a semitone of difference between the pitches of two notes that are right adjacent to each other. For instance, the pitch difference between the notes 60 and 65 is 5 semitones. Negative intervals are used to illustrate drops in pitch. There are 256 total possible interval values because there are 128 total pitches. Instead of using pitch sequences for training and prediction, interval sequences were used for this representation. For instance, the sequence of intervals for a prime consisting of the four notes 65, 70, 45, and 60 would be 5, -25, 15. To find the note predictions using the list of interval changes, the last pitch of the prime was preserved in a list and utilized as the beginning point.

4 Methods and Modeling

4.1 Markov Chain Model

One of the earliest models utilized for music production was the Markov chain. A first-order and a second-order Markov chain was trained and evaluated. Separate training sessions were held for notes and durations. This model could be utilized with inference methods like restricted inference and beam search [2,3,4].

4.2 LSTM Model

A Forest of Long-Short-Term Memory (LSTM) network was used for next set of experiments. A single network with multivariate input made up one model (both notes and durations). For the notes and durations, respectively, there were two different LSTM networks in the other model. It was established that notes and durations have a significant relationship. Input layer, 64-dimensional embedding layer, 512-hidden unit LSTM layer, batch normalization layer, and dropout layer with a dropout rate of 0.3 were the layers present in both models. The multivariate model then has two dense layers employing a categorical cross-entropy loss and an rmsprop optimizer for gradient descent after concatenating the two inputs. The batch size was 64, and there were typically 100 epochs. Experiments were performed using more epochs, but the outcomes didn’t change, over-fitting was also observed in some cases. Figure 1 below displays the multivariate LSTM model’s graph.

4.3 LSTM Encoder-Decoder Model

Two different models for the notes and durations, as well as a single multivariate model, were trained. The parameters matched those of the LSTM model aforementioned. The longest prime sequence in the training set served as the input sequence length, and the longest target sequence in the training set served as the output sequence length. The graph of the multivariate encoder-decoder model is shown in Fig. 2.

4.4 Inference Modeling

Prediction was performed using the common beam search algorithm with beam sizes ranging from k = 2 to k = 4. Only the notes from the prime part were used for the constrained inference [5]. The following note with the highest score, for instance, would be the new option for prediction of song’s prime sequence.

4.5 Other Techniques

Transposition was a method used to increase the uniformity of the input data. Transposition in music refers to the process of raising or lowering notes and chords by a fixed distance. The spaces between succeeding notes remain constant. Six major scales and six minor scales make up the 12 different scales in Western music. One of these scales is used in the majority of harmonic music. For instance, changing a song from the key of C major to the key of E major would require adding four semitones or four intervals to each pitch. Depending on whether the music was originally in a major or minor key, each input song was transposed to either C major or C minor. The training then took place using the transposed pitches. The prediction notes were created by transposing the prime sequence into either C major or C minor before testing. The anticipated target notes were then recomposed back into their original key before being contrasted with the actual target notes.

5 Results and Observations

Both the training timeframes and the outcomes were adversely affected by the exponential growth of the input data for the polyphonic music. Scoring was done based on cardinality score and pitch score.

Cardinality Score: Minimizing a general supermodular monotone non-increasing function f() with respect to a cardinality constraint. Let $x^*_1\in _x {f(\emptyset ) - f({x})}$ and use the greedy algorithm on the set function

$$ g(S) := f({x^*_1}) - f({x^*_1} \cup S), $$

which is a monotone non-decreasing submodular set function with $g(\emptyset ) \,=\, 0$. Thus, the greedy algorithm maximizes the function with respect to a cardinality constraint within an approximation factor of $1 - 1/e$. However, there are two caveats. First, the greedy algorithm uses a budget of $k - 1$ instead of k (budget of one is spent on identifying $x^*_1$). Second, and most importantly, observe that the approximation factor is obtained on the function g(S) and not f(S), they get a set S of size $k - 1$ such that

$$ f({x^*_1}) - f({x^*_1} \cup S) \ge {1 - \frac{k}{(k - 1)e}}\cdot (f({x^*_1}) - f({x^*_1} \cup S^*)), $$

where $S^*$ is the optimal set of size $k - 1$ to be added to ${x^*_1}$ with the goal of minimizing f().

The Pitch Score. The right “shape" or separation from each note is rewarded by the cardinality of continuations. However, it rewards a continuation that is transposed one semitone lower in most cases. The pitches are additionally scored because this isn’t entirely accurate. Two normalized histograms of the pitches from the prediction and the ground truth are made to achieve this. The amount by which the histograms are overlapped determines the score. The operation is carried out while ignoring octaves as well. The cardinality score and the pitch score are combined linearly to produce the final score.

Baseline. The first baseline estimate included selecting notes from the training set at random. The second technique involved selecting notes from the training set based on their frequency. The last approach used the same steps as the first, but added restrictions required that the sampled note be in the song’s expected prime sequence. The third method generated the highest baseline for both the polyphonic and monophonic music sets.

Discussions. Polyphonic and monophonic datasets were evaulated individually. 90% of the songs were used for training and 10% for testing in the first division of the datasets into training and testing groups. Figure 3 and Fig. 4 displays the outcomes of numerous training and testing runs for the monophonic and polyphonic dataset. The LSTM model produced the greatest results when it used constrained inference, transposition on the input data, and a longer sequence length of 64. Compared to the best baseline score, the model’s accuracy increased by over two times. The accuracy of the LSTM model was improved by transposing the input and lengthening the sequence. Transposition improved all of the models’ accuracy, suggesting that more reliable input data improves learning. There were no appreciable improvements in accuracy brought about by the beam search method of inference. Prediction accuracy was also improved by constrained inference, particularly for the Markov model.

The notes and duration feature representations for the input data were far more accurate than the interval feature representation. This suggests that compared to pitches, intervals may have less structure and greater unpredictability. Unsurprisingly, the results were significantly worse for the input with larger dimensions (i.e., the tuple combinations of pitches and durations). There were some minor differences between employing a multivariate LSTM model and training two distinct models for notes and durations. When all other variables remained constant, the multivariate model consistently beat the two independent models by a little margin. This suggests that there is some relationship between the anticipated notes and durations. In addition to having a slightly higher accuracy, the multivariate models also required less time to train than two individual models.

Across all models, the pitch score was far more variable than the cardinality score. When the findings were examined more closely, it became clear that many of the projected sequences only had two or three different pitches. As a result, the model could determine which pitch or pitches appeared in the target sequence the most frequently, but not which order they were played. Additionally, no predictions were ever made for notes that in the target sequence seemed to be more “random" (i.e., not present in the prime sequence). While music might have a strong structural element, it also has randomness or surprises that provide appeal.

Overall, the LSTM model performed better than every other model. The Markov models had the highest variance of all the models, indicating that they would benefit from further restrictions or domain expertise [6,7,8].

The baseline values for the polyphonic music were significantly higher than those for the monophonic music, neither set of results should be used to judge the other. Due to the method used to determine the score, the baseline scores for the two forms of music are different.

6 Conclusion and Future Work

Because music has both structure and randomness, it presents an intriguing and difficult prediction problem. Although feature representation is significant, feature size is even more crucial. Training is substantially more challenging for inputs with very high dimensions. Constraints can significantly shrink the search space and improve model accuracy when used in inference. Meaningful learning and inference depend on limiting the input space and the search space in ways that nonetheless capture pertinent relationships and significant features.

Random pop tracks were employed in this research. Pop music is just one of many different musical styles. Jazz music is more free-form than classical music, which has a lot more structure. A fascinating expansion of this research would be to compare the results of other musical genres. Models, such as Hidden Markov Models and LSTM with attention mechanisms, leave room to wonder [9, 10].

Notes

1.
https://www.music-ir.org/mirex/wiki/2020:Patterns_for_Prediction.

References

Briot, J.-P., Hadjeres, G., Pachet, F.-D.: Deep Learning Techniques for Music Generation, vol. 1. Springer, Heidelberg (2020)
Book Google Scholar
Lisena, P., Meroño-Peñuela, A., Troncy, R.: MIDI2vec: learning MIDI embeddings for reliable prediction of symbolic music metadata. Semant. Web Preprint 13, 1-21 (2022)
Google Scholar
Asesh, A.: SentiSeries: a trilogy of customer reviews, sentiment analysis and time series. In: Shakya, S., Balas, V.E., Kamolphiwong, S., Du, K.-L. (eds.) Sentimental Analysis and Deep Learning. AISC, vol. 1408, pp. 31–45. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-5157-1_3
Chapter Google Scholar
Kim, S.T., Oh, J.H.: Music intelligence: granular data and prediction of top ten hit songs. Decis. Support Syst. 145, 113535 (2021)
Google Scholar
Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res. 22(57), 1–64 (2021)
MathSciNet MATH Google Scholar
Savage, P.E.: Music as a coevolved system for social bonding. Behav. Brain Sci. 44, e59 (2021)
Article Google Scholar
Gronauer, S., Diepold, K.: Multi-agent deep reinforcement learning: a survey. Artif. Intell. Rev. 55(2), 895–943 (2022)
Article Google Scholar
Guzmán, C., Mehta, N., Mortazavi, A.: Best-case bounds in online learning. Adv. Neural Inf. Syst. 34, 21923–21934 (2021)
Google Scholar
Chen, L., et al.: Decision transformer: reinforcement learning via sequence modeling. Adv. Neural. Inf. Process. Syst. 34, 15084–15097 (2021)
Google Scholar
Schrittwieser, J., Hubert, T., Mandhane, A., Barekatain, M., Antonoglou, I., Silver, D.: Online and offline reinforcement learning by planning with a learned model. Adv. Neural. Inf. Process. Syst. 34, 27580–27591 (2021)
Google Scholar

Download references

Acknowledgments

All that I am, or ever hope to be, I owe to my angel mother.

Author information

Authors and Affiliations

University of Utah, Salt Lake City, USA
Aishwarya Asesh

Authors

Aishwarya Asesh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aishwarya Asesh .

Editor information

Editors and Affiliations

National Research Institute, National Information Processing Institut, Warszaw, Poland
Cezary Biele
Polish Academy of Sciences, Systems Research Institute, Warsaw, Poland
Janusz Kacprzyk
Polish-Japanese Academy of Information T, Warsaw, Poland
Wiesław Kopeć
Polish Academy of Sciences, Systems Research Institute, Warsaw, Poland
Jan W. Owsiński
Institute of Applied Computer Science, Łódż University of Technology, Łódź, Poland
Andrzej Romanowski
Department of Informatics in Management, Faculty of Management and Economics, Gdańsk University of Technology, Gdańsk, Poland
Marcin Sikorski

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Asesh, A. (2023). Predicting Music Using Machine Learning. In: Biele, C., Kacprzyk, J., Kopeć, W., Owsiński, J.W., Romanowski, A., Sikorski, M. (eds) Digital Interaction and Machine Intelligence. MIDI 2022. Lecture Notes in Networks and Systems, vol 710. Springer, Cham. https://doi.org/10.1007/978-3-031-37649-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-37649-8_3
Published: 25 July 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37648-1
Online ISBN: 978-3-031-37649-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics