1 Introduction

Main melody is essential information of a piece of music. Main melody can be applied in various applications, including music retrieval [1, 2], accompaniment generation [3, 4], melody plagiarism identification [5, 6], cover song recognition [7, 8], and new melody generation [9,10,11]. Consequently, automatic extraction of the main melody becomes more important [12, 13], and it has captivated the attention of many researchers. Humans possess different cognitive attributes, including auditory perception and deliberate attention [14] which in turn enables us to subjectively identify the main melody through loudness transformation, pitch combination, and so forth. However, computers lack perceptual capacity and subjective judgment. Therefore, it would be arduous for a computing system to automatically extract the main melody [1, 13, 15].

Music is usually recorded in audio and symbolic files [2, 16], which are completely different. On the one hand, the audio files (e.g., WAV) encode the signal information of music, and they are realistic recordings of natural sounds, including noise and perceptual information such as loudness and intensity. On the other hand, a symbolic file (e.g., MIDI—Musical Instrument Digital Interface) is essentially a sequence of note’s messages, including velocity, pitch, and duration, track [2, 17, 18]. The symbolic file can better reflect the content of music in comparison to the audio file [16]. Hence, the focus of this work is on extracting melody representations from a symbolic file, namely MIDI.

Conventional methods use the most distinctive feature of MIDI files (track message) to obtain the main melody [12, 16, 19,20,21,22]. These methods assume that the main melody notes are located on a single track, and therefore, they directly identify the main melody track using some statistical characteristics of the track of interest. However, such assumptions cannot be made in real-world application because this approach represents an ideal condition, which rarely occurs.

Fig. 1
figure 1

An example to illustrate a limitation of assuming that the main melody is unique. The Choral music from the Bach10 dataset [23] (06-DieSonne) contains at least two main melodies, namely Melody 1 and 2. If the main melody of other music (e.g., Music A) is similar to Melody 2, the choral music and Music A should be similar from the human perspective. However, if the only extracted main melody of the choral music is Melody 1, matching these two pieces of music using the main melody may fail

Aside from obtaining the track feature of a MIDI file, data-driven approaches using deep learning techniques could be used to extract the main melody notes. Such techniques include both the classification and prediction approaches. Specifically, classification approaches such as MIDIBERT [24], CNN [13] and LStoM [25] could only output one main melody. In other words, they assume that the main melody of a piece of music is unique by default. Under this assumption, these methods classify all notes as either main melody or accompaniment. These classification methods can give better results than algorithmic methods such as Skyline [15] or methods that filter out accompaniment notes using thresholds of pitch interval [26]. Unlike the classification approaches, prediction approaches are more flexible and could predict multiple melodies, such as MusicFrameworks [10]. Specifically, Dai et al.’s method [10] uses basic melody and other music information to generate new melodies. The basic melody could be understood as the main melody. Although MusicFrameworks [10] can actually predict multiple basic melodies, the predicted results have low similarity with the corresponding music, or in other words, the results are not the main melodies. Hence, there are no existing methods that can successfully extract multiple main melodies.

Although most conventional methods assume the uniqueness of the main melody, this assumption is not reasonable. To be specific, music often contains multiple main melodies that are equally important [27], and hence, it is challenging to determine the single or the essential main melody [19]. For example, in Fig. 1, the choral music in the Bach10 dataset [23] contains at least two melodies, i.e., Melody 1 and Melody 2, where both of them could be used to identify the music. If using the MIDIBERT [24] method that defaults to the main melody uniqueness for extracting the choral music’s main melody, the extracted result is highly similar to Melody 1 according to the experiment results. In this case, if the main melody of other music is similar to choral music’s Melody 1, other music and the choral music will be matched using the main melodies. However, the main melody of Music A in Fig. 1 is similar to choral music’s Melody 2, so Music A and the choral music could not be matched if the main melody of the choral music is only Melody 1. Consequently, assuming music only has a unique main melody has limitations and contradicts the complexity of the main melody. For this problem, if both Melody 1 and Melody 2 of the choral music could be identified as the main melodies, the matching in Fig. 1 would be successful. This problem is usually found in applications where the main melodies are used for retrieval, such as music retrieval [1, 2], melody plagiarism identification [5, 6], and cover song recognition [7, 8]. Hence, an approach that could identify multiple main melodies within a piece of music is required.

If all main melodies are highly similar and highly relevant to the corresponding music, it will be meaningless to predict multiple main melodies. Conversely, if the extracted melodies are entirely different or irrelevant to the music of interest, then, the prediction has failed because melodies may actually correspond to different music. Therefore, only melodies that are different but remain moderate similarity with respect to the music of interest and viable for identifying the music are the main melodies. As a result, to extract multiple main melodies, we need to address the following difficulties: (1) Finding an approach to predict multiple main melodies and (2) Finding a strategy to control the similarity of the final predicted results to ensure that they are the main melodies.

According to the above analysis, the main melody classification methods assume the uniqueness of main melodies and only output one result, so they cannot accommodate the complexity and diversity of main melodies. The advantage of these methods is the classified main melody with high accuracy. Furthermore, the advantage of prediction approaches is that they can predict multiple melodies, but they are usually used in new melody generation tasks rather than the main melody extraction task because of their uncontrollable predicted results. In conclusion, neither the existing melody classification methods nor the prediction methods can address the multiple main melodies prediction problem independently. However, they all have merits. Hence, we propose a novel framework combining the merits of classification and prediction approaches. We name our framework as Multi-MMLG (Multiple Main Melodies Generator) framework, and Fig. 4 shows the structure of the framework. Note that Multi-MMLG is a 2-stage framework, where

  • Stage 1: The classified main melody with high accuracy is a good condition for predicting main melodies. Hence, we set the first stage of Multi-MMLG framework to a note level classification stage thus providing a high-accuracy classified main melody as the next stage’s input. To obtain a higher accuracy than using existing methods, we implement a MIDIXLNet model, and we use it in the framework’s first stage.

  • Stage 2: Since the prediction method is suitable to output multiple melodies, the second stage of the Multi-MMLG framework is a conditional prediction stage. Considering that the predicted melody normally has low similarity with the corresponding music, we use two conditions, Stage 1’s classified main melody and a notes’ relationship matrix from the original MuseBert approach [28]. These two conditions will be used as the input of a modified MuseBERT model, and they can constrain the predictions to ensure that the predicted melodies are the main melodies. Moreover, because the main melodies should be similar and relevant to the corresponding music at a moderate level, neither too high nor too low, we implement a masking strategy on the predicting conditions to control the predicted results.

The evaluation results of the Multi-MMLG framework are analyzed quantitatively and qualitatively. Our results demonstrate that the framework could extract potential main melodies, and the similarity of the predicted results can be controlled within a reasonable range. In addition, we also conduct ablation experiments and compare our framework with other approaches. The ablation results verify the framework’s effectiveness and rationality.

To the best of our knowledge, this is the first work that details and clarifies the definition of multiple main melodies. The contributions of this paper include:

  1. 1.

    Putting forward a new definition for the main melody, which is a set of similar but non-identical melodies that can be analyzed by entities (humans and computers) to identify a piece of music. This definition of main melody then serves as the backbone of this work;

  2. 2.

    Implementing a MIDIXLNet model which increases the accuracy of the main melody classification task to 97.37% with relatively lower parameter numbers than existing methods. Meanwhile, we verify the necessity of designing a model specifically for the midi-based main melody classification task;

  3. 3.

    Proposing a two stages Multi-MMLG framework that could predict multiple main melodies. It combines MIDIXLNet and a modified MuseBERT model. Furthermore, the framework avoids the randomness of the prediction melodies by modifying the prediction strategy, notes’ representation, and implementing a masking strategy. Ablation experiments demonstrate that the combination of MIDIXLNet and a modified MuseBERT model is optimal.

The structure of the paper is as follows: In Sect. 2, the preliminary knowledge referenced in this article is presented, and this section reviews the conventional methods designed for main melody extraction. Section 3 puts forward the Multi-MMLG framework. Experiment results are presented and discussed in Sect. 4. Finally, Sect. 5 concludes this paper.

2 Preliminary and related work

This section will firstly introduce two types of digital music files: audio and symbolic. Next, some existing works will be discussed, including MIDI-based and audio-based. Finally, a new main melody definition will be clarified. This definition is the cornerstone of proposing our framework.

2.1 Digital music files

Digital music files can be broadly divided into two categories, namely audio files and symbolic files. In the former case, an audio file encodes the signal information of a piece of music, namely the physical information [29]. The audio files could present the sound of the natural world to the greatest extent, including noise. Unlike the audio file, symbolic files (e.g., MIDI file) directly store the music as a sequence of messages rather than the signal information. Hence, they usually exclude noise and can reflect the content of music better than the audio file [16]. In this work, we focus on symbolic file, in particular, the MIDI file. Figure 2 shows a snippet of a MIDI file that describes one bar/measure of music score with \(\frac{4}{4}\) time signature. In a MIDI file, each note is represented as eight messages (viz., eight-tuple), which include Track, Channel, Pitch, Velocity, and so forth [2, 17, 18]. A MIDI file is usually visualized using piano roll based on Track, Pitch, and Time messages [30].

Fig. 2
figure 2

Notes’ messages in MIDI file and the visualization of MIDI file (piano roll)

In an ideally structured MIDI file, the notes are grouped by instruments or main/non-main melody, and each group is located in separate MIDI tracks [15, 22]. However, a common scenario is when the notes are located on the same MIDI track. Hence, extracting the main melody from such MIDI files is more complex than an ideally structured file, to which this paper focuses on this complex scenario.

2.2 Related works

The existing methods of main melody extraction are mainly based on audio files and MIDI files. For the purposes of this work, we focus on MIDI-based methods, which can be broadly categorized into 2 groups, namely: the track/bar-level methods and the note level methods. Subsequently, we briefly discuss the audio-based methods.

2.2.1 MIDI-based methods

Track and measure level methods The early MIDI-based main melody extraction methods are essentially extracting the MIDI track. Specifically, handcrafted (HC) features of the notes’ attributes are adopted as the metrics to select a track containing the main melody notes [12, 15, 16, 19,20,21, 31]. Traditional metrics utilized to filter out non-main melody notes include the entropy and average of pitch, the standard deviation of duration [32]. In addition to setting the range or threshold value of the metrics directly [32], the HC features could be used by some machine learning classifiers, including random forest [19, 33] and Naive Bayes classifier [22, 34]. Unlike the methods using HC features, Li et al. [35] try to calculate the similarity between tracks using different versions of a piece of music. Furthermore, similarity neighborhood [36] could be utilized to extract bars that contain main melody notes. However, the aforementioned track level-based methods require that the notes are located on separated MIDI tracks. If all the notes appear on the same track or the main melody notes are scattered on different tracks, these methods will fail [37].

Note level methods Instead of using the features of the MIDI file, Uitdenbogerd et al. put forward a method named Skyline [15], which assumes that all the notes with the highest pitch at a time are the main melody notes. Subsequently, other researchers [13, 34] usually use the skyline algorithm to preprocess MIDI files to filter out part of the accompaniment, and they make improvements on the main melody extraction based on the preprocessed MIDI file. Later on, some non-machine learning methods identify the main melody by extracting repeated segments [38] or directly specifying a threshold of notes interval to filter out accompaniment notes [26]. Such methods are both inefficient and error prone (e.g., extracting the accompaniment part by mistake). Hence, researchers put forward intelligent methods using the machine learning technique in recent years, for example, the approach using LSTM [39], CNN model [13, 40] or Markov Chain [31]. In recent years, MIDIBert [24] and LSToM [25], which are large parameter model and small parameter model, respectively, that have obtained state-of-art results for main melody notes classification.

However, all the aforementioned methods assume that there is only one main melody. In many cases, it is difficult to determine only one main melody from the music, for example, see Fig. 3. Therefore, the aforementioned methods have significant limitations. Taking a different approach, Dai et al. [10] propose a framework based on Transformer-LSTM to generate new melody using basic melody and other music information. The basic melody could be understood as the main melody. Because Dai et al. [10] regard the basic melody extraction as a prediction task, this method could predict multiple melodies. However, the accuracy of the predicted main melody is around 39%, which means that the identified melody is not related to the corresponding music, and hence, it is not the main melody.

2.2.2 Audio-based methods

Some audio files are performance recordings. Others are converted from MIDI files which are usually clean without background noise. Audio-based main melody extraction only focuses on the former. Many conventional approaches estimate the pitch of main melody notes based on pitch salience and spectrogram [41,42,43,44,45]. These conventional methods are often complex and have many steps. For example, to extract the main melody, [42] first uses short-time Fourier transform (STFT) technology to obtain spectral peaks and corrects the frequency and amplitude of the music signal. Next, it obtains a group of melody candidates via the computed salience function. Finally, it will select a melody using some music features (e.g., pitch mean and deviation). This complicated method could get relatively good results, with 70% accuracy on average. In recent years, some methods have started to use the machine learning technique, like CNN [46, 47] or SVM [48]. Such machine learning-based methods is more accurate than traditional methods. However, the accuracy results are erratic. The best performing method above [47] ranges from an average of 75% to a maximum of 93%.

Table 1 tabulates the conventional methods designed for main melody extraction for both MIDI-based and audio-based approaches. Although most existing methods are aware of the problem of ambiguity in the main melody definition, they choose to simplify the problem, i.e., working under the existed dataset providing one main melody label. In addition, no matter the audio-based or midi-based approaches, machine learning techniques can usually significantly improve the accuracy of main melody extraction. Motivated by the aforementioned analysis, in this paper, we put forward a new definition of main melody and design a deep learning method for main melody extraction.

Table 1 Summary of existing main melody extraction works, including MIDI-based and audio-based

3 Methodology

Firstly, we put forward a new definition of the main melody for a music. Subsequently, we explore compound word as the data representation for further analysis, and finally, we detail the proposed Multi-MMLG (Multiple Main Melodies Generator) framework.

3.1 Main melody

When referring to the main melody of a music, the definition agreed upon by most articles is that melody is a unique sequence of notes used in identifying music [15, 34, 49]. In addition, main melody has also been defined as the part hummed and remembered by people, and it is the most attractive part of the music [49]. However, the above definitions of main melody are vague because different people have different perceptions and judgments [50, 51]. Hence, current definitions of the main melody are limited, and no single definition is adequate [49, 52]. For this phenomenon, many studies will specifically clarify the definition of the main melody to apply to their research field. Bittner et al. [52] use three main melody definitions in its research, arguing that multiple main melodies’ definition is the most complex but the most general. Sequentially, we will discuss the complex case.

Fig. 3
figure 3

Two examples of the music contain multiple main melodies: a is a segment of choral music (the 10th file in the Bach10 dataset [23]). b is a segment of pop music (the 874th file in POP909 dataset [4])

Main melodies in music are diverse, complex, and variable. Specifically, music usually contains multiple main melodies which are equally important [27]. For example, Fig. 3a shows a segment of choral music with two main melodies, namely Melody 1 and Melody 2. These main melodies have similar pitch contour in different keys, and Melody 1 contains more notes. It is a difficult task to decide which melody line is not the main melody because both two melodies could be used to identify the music. Another example (pop music) is shown in Fig. 3b. Here, the two main melodies in the music are similar but have some different notes and number of notes. Therefore, the assumption of uniqueness in main melody adopted by existing research cannot be established, and it oversimplifies the problem of main melody extraction [49, 51].

Therefore, in this work, we put forward a new definition for main melody. For a piece of music, its main melodies should be relevant to the music so that they can be used to identify it. Furthermore, one music’s main melodies are not entirely identical, such as in Fig. 3. In the event that the main melodies are identical, they are treated as single main melody, but not multiple main melodies. However, if the main melodies are substantially different, they may belong to different music. Hence, main melody could be defined as a set of similar but non-identical melodies that can be utilized to identify the music by entities (humans and computers). Subsequently, we predict multiple main melodies based on this new definition for main melody.

3.2 Data representation

We use Compound Word (CP) [53] to represent notes as the input of MIDIXLNet used in the framework’s Stage 1. Each compound word contains five tokens, including Sub-Beat, Pitch, Duration, Chord, and Bar. Specifically, Sub-Beat is equivalent to a note’s position in a bar, and the Pitch events are the absolute pitch value. For a piece of piano music, the corresponding MIDI number of absolute pitch values range from 21 to 108. To represent the chord events, we use a chord recognition method proposed by Huang and Yang [54], which basically uses a sliding window and proposes a rule of calculating chord likelihood. After extracting the chords in a bar, we assign the recognized chord’s name to the notes that make up the chord. In addition, the Bar event is described as a binary number, namely 0 for a new bar and 1 for a continuing bar. By adopting the representation, each compound word has exactly five tokens. An example of the MIDIXLNet input is shown in Stage 1 of Fig. 4.

On the other hand, the original MuseBert [28] uses three notes’ events, including Onset, Pitch, and Duration, to describe melody. Onset event is equivalent to Sub-Beat, and both are the note’s position in a bar. In the modified MuseBERT model used in the framework’s Stage 1, we additionally use the notes’ track label (0/1), which specifies that the notes belong to the main/non-main melody track. Interested reader may refer to [28] for detailed information on MuseBert data representation.

3.3 Multi-MMLG framework

Our proposed Multi-MMLG (Multiple Main Melodies Generator) framework consists of 2 stages, namely a classification stage (Multi-MMLG\(_{s1}\)) and a conditional prediction stage (Multi-MMLG\(_{s2}\)). Figure 4 shows the structure of the Multi-MMLG framework. Both stages are detailed in the following subsections.

Fig. 4
figure 4

Multi-MMLG Architecture: A pipeline architecture with two stages that could predict potential main melodies

3.3.1 Stage 1: Multi-MMLG\(_{s1}\) & MIDIXLNet

The first stage in the proposed Multi-MMLG framework is the note level classification task. Using the classified results with high accuracy as the Stage 2’s condition guarantees that Stage 2 can learn suitable features from the condition, thus generating potential main melodies. Hence, we implement a MIDIXLNet model to improve the accuracy of classifying main/non-main melody notes.

Figure 5 shows the pre-training structure of the MIDIXLNet model. Here, we would not use the notes’ binary labels (i.e., main melody note and accompaniment) provided in the dataset during pre-training, which is an unsupervised learning task. After representing the notes, the prepared compound words will be embedded in the word embedding layers. Considering that each note is represented by five tokens, the embedding layer is five dimensions. All the embedded notes will be merged into one dimension matrix via a linear layer. At the same time, the words’ position will be embedded following the standard relative positional encoding strategy used in the Transformer-XL model [55, 56].

Fig. 5
figure 5

Pre-training procedure of MIDIXLNet

Next, these embeddings will be fed to self-attention layers. The XLNet-based model uses a particular self-attention strategy, namely the Two-Stream self-attention strategy, including the content stream attention and query stream attention. This strategy could avoid the significant weakness of prevailing Masked-Language Modeling, namely directly corrupting input data so that some context features of music will be lost. Due to the contextual information of music being important [30, 57, 58], the advantage of avoiding loss of information is essential and is the reason behind why we implement an XLNet-based model. In addition, the permutation of likelihood factorization’s order will support the XLNet-based model to capture bidirectional context. Hence, we adopt the permutation mechanism and the two-stream self-attention strategy.

Specifically, the permutation is used to change the factorization order of likelihood rather than the sequence order of the notes. Since the permutation generates many possible results (i.e., n!), Eq. (1) will be used to sample a permutation set z.

$$\begin{aligned} \textrm{max}_\theta : E_z {\sim } Z_n \Big [ \, \sum _{p=1}^{n} \textrm{log} p_\theta (N_{zp}\mid M_{z<p})\Big ]. \end{aligned}$$
(1)

In Eq. (1), \(Z_n\) represents all permutation results of the likelihood factorization order for the melody sequence M with length n. In addition, \(N_{zp}\) represents the word of the note at position p in the sampling permutation z set, and \(M_{z<p}\) refers to the notes before position p in z set. Subsequently, the likelihood \(p_\theta\) of the sampling order will be calculated.

On the other hand, the content stream attention (a.k.a. standard self-attention) could provide information of the current token and the preceding tokens. It is expressed as follows:

$$\begin{aligned} h_{zp}^m = \textrm{Attention}\left( Q = h_{zp}^{m-1}, \textrm{KV} = h_{z\le p}^{m-1}\right) , \end{aligned}$$
(2)

where Q, K, and V are vectors for Query, Key, Value, respectively, and zp is the target token’s position following z set’s likelihood factorization order. In layer \(m-1\), Q knows the target’s content and position (\(h_{zp}^{m-1}\)). KV will know the content and position information of the target token and the tokens before position \(p\; (h_{z\le p})\) in z. Note that the query stream attention uses the binary number to encode whether the target’s position could be queried or not. Meanwhile, it also encodes that the position and content of tokens before the target could be queried or not [55]. This mechanism is used to avoid accessing the target’s content during querying. The expression is written as follows:

$$\begin{aligned} g_{zp}^m = \textrm{Attention}(Q = g_{zp}^{m-1}, KV = h_{z<p}^{m-1}) \end{aligned}$$
(3)

Equation (3) is similar to Eq. (2), but the difference is that query stream attention will access the token before the target (\(h_{z<p}\)). Furthermore, Q will only know the target token’s position (\(g_{zp}\)) rather than both position and content (\(h_{zp}\)).

Figure 6a shows a melody segment, and Fig. 6b shows the melody’s corresponding matrices of Two-Stream self-attention before permutating the likelihood factorization order. The melody sequence could be written as \(M = (N_1,\ldots ,N_n)\), where \(N_1\) refers to the first note and the length \(n = 5\). In order to predict \(N_2\) following the original factorization order, only \(N_1\) could be accessed before the permutation, i.e., is \(P(N_2\mid N_1)\), which is uni-directional learning.

Fig. 6
figure 6

A melody example (a) and its corresponding attention matrices (b)

Figure 7 shows the Two-Stream self-attention matrices after sampling permutation order. Under this case, when predicting \(N_2\), \(h_{z\le 4}\) includes the position and content of \(N_5, N_4\), and \(N_1\) rather than only \(N_1\).

Fig. 7
figure 7

Two-Stream self-attention matrices after sampling permutation order

Furthermore, our architecture contains 6 self-attention layers. After mapping the hidden vector from the attention layers to fully connection layers, the prediction results are calculated using a standard softmax function on its normalized value.

When using the MIDIXLNet model for fine-tuning, the standard self-attention will be used rather than Two-Stream self-attention. The fine-tuning framework is shown in Stage 1 in Fig. 4. In addition, the binary classifier used to classify main/non-main melody notes is similar to the classifier in the MIDIBERT method using ReLU as the activation function [24].

3.3.2 Stage 2: Multi-MMLG\(_{s2}\)

The second stage of the Multi-MMLG framework will predict multiple main melodies. Recall that the potential main melodies of a piece of music are similar but non-identical (see Sect. 3.1). The MuseBERT model proposed by Wang and Xia [28] aims to predict new melodies which do not conform to our main melody definition. Hence, we modified the model to make it more suitable for main melody extraction.

The structure of the modified MuseBERT model is shown (enclosed in a box with dotted lines) in Fig. 4. First, we make a replica of the classification results from Stage 1. Next, all notes in the replica will be masked and will be finally predicted. Under this case, there are two conditions for predicting, Stage 1’s output and the notes relationship matrices. In other words, the masked notes will be predicted based on these two conditions. This prediction strategy is different from the original MuseBert [28] that directly masks all music notes and only uses one condition (i.e., the relationship among notes) to predict masked notes. After preparing the data, all input is mapped to twelve linear embedding layers with eight attention heads. Finally, the last hidden vector will be normalized to generate the results via the softmax layer.

The pre-training stage of the modified MuseBERT model is similar to the prediction stage. The only difference is that we do not make a replica of the input, and we mask part of the notes for training. Hence, in the pre-training stage, our conditions are the relationship among the notes and the unmasked notes.

Before producing the predicted results, there is still a problem with the original MuseBert [28] model. It is a BERT-based model, and it cannot predict the next notes freely, meaning that it can only predict the masked notes. However, the predicted masked notes may not belong to the main melody. Hence, as mentioned in Sect. 3.2, we add the track label of every note. Under this setting, the modified MuseBERT model could extract the note whose track label is predicted as 0 (0: main melody notes), thus filtering out the non-main melody notes.

In summary, Multi-MMLG\(_{s2}\) adopts the modified MuseBERT model, which uses a different prediction strategy and improve the flexibility of predicting the main melody by adding the track label of the notes.

4 Experiments

4.1 Experiment settings

For evaluation purposes, we use the Mixed POP909 dataset, which includes POP909\(_m\) (MELODY track’s notes are the main melody) and POP909\(_{mb}\) (MELODY track and BRIDGE track’s notes are the main melody). An example that illustrates the Mixed POP909 dataset is shown in Fig. 8.

Fig. 8
figure 8

Main melody of 808th file in Mixed POP909 dataset (POP909\(_m\): regarding the MELODY track as the main melody. POP909\(_{mb}\): regarding the MELODY and BRIDGE tracks’ notes as the main melody)

Fig. 9
figure 9

Examples illustrate the different data pre-processing strategies during the training and prediction. Corrupting strategy used in the training stage is shown in (a), and the random masking strategy used on the final prediction’s conditions of Multi-MMLG\(_{s2}\) is shown in (b). The part covered by the twill-filled rectangle represents the randomly selected masking notes

In the Stage 1 (i.e., Multi-MMLG\(_{s1}\)) using the MIDIXLNet model to classify main melody notes, we split the Mixed POP909 dataset into 8:1:1 for training:validation:testing purposes. Our model is trained on an RTX3060 GPU with 6GB. Since this GPU has less memory, we set the batch size to 1, and the maximum length is 512. We train the model for 45 epochs, and each epoch takes about 10 min. During the fine-tuning process where only the standard self-attention is used, each epoch takes about 5 min, and we also train for 45 epochs.

For the second stage (i.e., Multi-MMLG\(_{s2}\)), using the modified MuseBERT model to conduct conditional prediction, we also split Mixed POP909 into 8:1:1 for training:validation:testing purposes. As discussed in Sect. 3.3.2, the training stage does not need to make a replica of Stage 1’s output as the condition, so the data pre-processing strategy of the final prediction and the training stages have a few differences, as shown in Fig. 9. During training, 15% of the data will be corrupted following a specific strategy in Fig. 9a. In the final prediction stage, we use a random masking strategy (0%–45%) on Stage 1’s output. As shown in Fig. 9b, 0% masking rate of the Stage 1’s (Multi-MMLG\(_{s1}\)) output means that the modified MuseBERT model needs to learn the whole output to predict the main melodies. Generally, using the conditions with more notes (e.g., 0% masking rate) will generate melodies that are highly similar to the corresponding music. Conversely, as the masking rate increases, there are fewer and fewer notes in condition. The similarity between the predicted melody from the model that has fewer conditions and the corresponding music will decrease. Hence, the 0%–45% masking rate aims to ensure that the similarity between the predicted main melodies and the corresponding music will neither be too high nor too low, thus conforming to our new main melody definition. Essentially, the masking rate is varied to manage the condition, thus generating melody. This managing condition strategy is similar to the method used by Hadjeres and Nielsen [59], which directly specifies notes as the condition, thus creating new melodies.

4.2 Results and analysis

This section conducts a comprehensive evaluation of the proposed Multi-MMLG framework. We first evaluate the effectiveness of the Multi-MMLG framework’s Stage 1 using the proposed MIDIXLNet model and compare its performance with the other four models. Subsequently, we perform an ablation experiment on MIDIXLNet to explore the influence of the individual components in MIDIXLNet. In addition, we also verify the necessity of designing a model specifically for the midi-based main melody classification task. Next, we conduct an ablation experiment on the Multi-MMLG framework. These experiments can justify the selected models within the framework, verifying the influence on the final predicted results when using different models in Stage 1 of the framework, and comparing the predicted model in Stage 2 with four models. Finally, we evaluate the potential main melodies predicted by the proposed Muti-MMLG framework under different masking rates.

4.2.1 Stage 1: Multi-MMLG\(_{s1}\) & MIDIXLNet

Recall that the MIDIXLNet is the model; we proposed and used in the Stage 1 in Multi-MMLG framework (i.e., Multi-MMLG\(_{s1}\)). Firstly, we adopt AccuracyFootnote 1 and Parameter Number (PN) to evaluate the effectiveness and efficiency of MIDIXLNet based on the Mixed POP909 dataset (POP909\(_{mix}\)). Table 2 shows the results of comparing the MIDIXLNet model with conventional models. For all compared models, the RNN-based model with one classifying layer (RNN\(_{cl}\)) is set as the baseline model. In addition, we also compare the proposed MIDIXLNet model to MIDIBert  [24] and LSToM  [25], which are large parameter model and small parameter model, respectively, that have obtained state-of-art results in recent years. Note that the models in each group in Table 2 have the same data description approach.

It can be seen from Table 2 that for the complex POP909\(_\textrm{mix}\) dataset, MIDIBert  [24] and LStoM  [25] achieve no more than 90% accuracy, suggesting some room for improvement. Furthermore, the result of RNN\(_{cl}\), which has the same data description method as MIDIBert [24], is lower. However, based on our proposed 5-dimensional data input (including Bar, Sub-Beat, Position, Duration, and Chord), after restructuring MIDIBert [24] (MIDIBert\(_5\)), the result shows a significant improvement, i.e., 96.84% accuracy. Under the same data description method as MIDIBert\(_5\), the proposed MIDIXLNet can achieve a higher accuracy of 97.37%, while its parameter size is approximately 64% of that of MIDIBert\(_5\).

Table 2 Comparison of using different main melody classification models

Next, we perform an ablation experiment to explore the influence of the individual components within MIDIXLNet, and the results are recorded in Table 3. Specifically, when the masking strategy used in the XLNet model is similar to that of the Bert model (rather than using the unique two-stream self-attention mechanism of the XLNet model), the performance of the main melody classification drops to 96.81%. In other words, under the premise of fewer parameters than the MIDIBert\(_5\), the results obtained without the two-stream self-attention mechanism are very close to those of MIDIBert\(_5\) in Table 2. Also, the reduction in the number of hidden layers appears to have the most apparently negative impact on performance. The other contributing factor to the performance deterioration is the change of the attention type from bi-direction to uni-direction.

Table 3 Ablation experiment for the proposed MIDIXLNet model using the POP909\(_{mix}\) dataset

Comparison with non-main melody classification approach

In this section, we conduct two experiments to justify the necessity of designing a new model for the main melody classification task based on MIDI files.

First, recall from earlier that music is usually recorded as the audio file (e.g., WAV) or the symbolic file (e.g., MIDI). To verify the importance of designing the main melody classification method for MIDI files, we evaluated the result of the main melody extraction method for audio. To be more specific, we use the midi2audio package to convert the MIDI files in the Mixed POP909 dataset to audio files. The algorithm for extracting the main melody from audio and output to MIDI files is based on the Melodia algorithmFootnote 2 [42]. To evaluate and compare the results, we use edit_distance [36, 60, 60,61,62] and Dynamic Time Warping (DTW) [1, 60]. The edit_distance can calculate the similarity between sequences, and it is robust against different lengths [60]. Hence, the edit_distance metric is used to calculate pitch similarity (PS) and chord similarity (CS) between the extracted main melody and ground truth. On the other hand, DTW, also referred to as melody distance (MD) by Tsai et al. [1] and Ju et al. [60], is an indicator based on dynamic programming. It takes into account the pitch value and duration value together. Table 4 records the results of the aforementioned metrics for both Melodia [42] and our proposed MIDI-based method (i.e., MIDIXLNet). Results suggest that the melodies extracted from the audio file by Melodia [42] are not the main melody due to the low similarity and large MD values. The main reason for the non-ideal results is that the audio files converted from MIDI files lose an important perceptual feature, namely loudness. In other words, the feature that the main melody is usually louder than other melodies in music is the key feature used to extract the main melody from the audio file, but that is not the main feature in the MIDI file. As a result, the main melody from a MIDI file cannot be extracted accurately using the audio-based method due to the differences between a MIDI file and an audio file. In short, when considering the results collected for MIDIXLNet, it is apparent that MIDIXLNet outperforms Melodia [42] for the task of classification of main melodies for all metrics considered in this work. Therefore, it is necessary to design the MIDI-based approach.

Table 4 Comparison of the proposed MIDI-based main melody classification method (i.e., MIDIXLNet) with the audio-based method (i.e., Melodia)

Secondly, Table 5 records the performances achieved by cross-domain transfer learning, which further justifies the need to design a model specifically for the main melody classification task. This experiment is inspired by Bukhsh et al.  [63]’s promising results via cross-domain transfer learning strategy. Specifically, considering that, in recent years, Wu et al.  [64] obtained better results in the field of Few-shot object detection, and we refer to their base detector model to conduct the cross-domain transfer learning, namely the Faster R-CNN model  [65]. In addition, the backbone of Faster R-CNN is Resnet-101  [66], which is a popular architecture used with Faster R-CNN  [64, 67]. In fact, Faster R-CNN  [65] aims to detect objects in an image. Hence, in our input matrix, we set the ground truth of the object region as 1 (width) \(\times\) 5 (height), which means there are 512 regions in total in the input \(512 \times 5\) matrix. Based on the results recorded in Table 5, cross-domain transfer learning, in its current form, has not achieved promising results. Specifically, the accuracy is low (around 44%), which is significantly lower than the accuracy obtained by our method designed for the extraction of main melodies.

Overall, neither cross-domain transfer learning nor audio-based melody extraction can achieve good results in the field of midi-based main melody classification. Therefore, it is necessary to design a model specifically for the main melody classification task based on the midi file.

Table 5 Performances attained by Transfer learning based on ResNet-101 using Faster R-CNN model

4.2.2 Stage 2: Multi-MMLG\(_{s2}\)

As per our newly proposed definition for main melody, it should be relevant to the corresponding music and similar at a moderate level. Considering that the Mixed POP909 dataset contains the main melody ground truth, the melodies similar to the ground truth at a moderate level are the potential main melodies. Therefore, we use three metrics described in section 4.2.1, namely PS, CS and MD, to evaluate the similarity between the final predicted results and the main melody’s ground truth of the music.

Table 6 Ablation experiments’ results of the Multi-MMLG framework

Table 6 shows the evaluation results of ablation experiments of the Multi-MMLG framework. In this experiment, we do not evaluate Stage 1 or Stage 2 of our framework separately because these two stages cannot work independently to predict potential main melodies. In other words, the classification stage (i.e., Stage 1 in Multi-MMLG framework) can only classify one main melody rather than multiple main melodies, and the prediction stage (i.e., Stage 2 in Multi-MMLG framework) can only predict new melodies rather than main melodies. Hence, none of the stages can predict multiple main melodies independently as we have anticipated. Hence, we use different models to replace the two stages of the Multi-MMLG framework, thus justifying the selected models within the framework, including verifying the influence on the final predicted results when using different models in Stage 1 of the framework and comparing the predicted model in Stage 2 with four models.

For one of the control groups in Table 6, we use MuseBert  [28], XLNet, RNN\(_{pre}\) and MusicTransfomer  [68] to replace the modified MuseBERT model used in Multi-MMLG\(_{s2}\), i.e., MM\(_{s1}\) + MuseBert/XLNet/RNN\(_{pre}\)/MusicTransfomer. These experiments compare the predicted results from the modified MuseBert with four other music-predicted methods, thus demonstrating the necessity of using the two predicting conditions (i.e., MIDIXLNet’s classified results from Multi-MMLG\(_{s1}\) and music notes’ relationship). Specifically, MM\(_{s1}\) + XLNet/RNN\(_{pre}\)/MusicTransfomer  [68] control groups use one condition, i.e., classified results from Multi-MMLG\(_{s1}\). These three methods directly predict the next notes. Table 6 suggests that their predicted results show considerably low similarity (i.e., Both PS and CS are below 30%, and MD is larger than our proposed Multi-MMLG framework using MIDIXLNet and the modified MuseBERT models) with reference to the corresponding music. Hence, these three methods cannot predict the potential main melodies. Similarly, although MM\(_{s1}\) + MuseBert [28] uses another condition, i.e., notes’ relationship, and its predicted results are still not ideal. Overall, the proposed Multi-MMLG framework, which uses MIDIXLNet and a modified MuseBERT model in sequence and combines the above two conditions, could significantly improve all three-similarity metrics, i.e., PS, CS and MD.

For another control group, we use RNN\(_{cl}\) and MIDIBert [24] to replace MIDIXLNet used in the Multi-MMLG\(_{s1}\), i.e., RNN\(_{cl}\)/MIDIBert [24] + MM\(_{s2}\). In comparison with Multi-MMLG framework, this control group can justify using the MIDIXLNet model in Multi-MMLG\(_{s1}\) and can prove the importance of classified main melody’s high accuracy. This control group also uses two conditions, i.e., classified melody and music notes’ relationship. However, the predicted results are still worse than that of Multi-MMLG. From these results, we confirm that, in addition to using the above two conditions, the accuracy of one condition, namely classified melody, is the second factor that affects the final predicted results. Therefore, Multi-MMLG simultaneously responds to these two factors, thus achieving improved results.

Table 7 Similarity between the generated and ground truth melodies under different masking rates (MR)

In addition, as introduced in Sect. 4.1, we use the random masking rate to control the prediction’s condition, thus controlling the predicted main melodies. Therefore, we also evaluate the results of the Multi-MMLG framework under different masking rates.

To be more specific, the potential main melody should be similar to the ground truth in a suitable range, neither too high nor too low. To control the final results, we perform a masking strategy (0%–45%) on Stage 1’s output, as introduced in Sect. 4.1. The results are shown in Table 7. At a 0% masking rate, the results are closer to the ground truth, while for a higher masking rate (e.g., 45%), the results are less similar to the music. Hence, it is easy to infer that, as the masking rate increases, the model will predict main melodies with fewer notes in condition, and the predicted melodies will become more irrelevant to the corresponding music. However, Table 7 shows that the PS of the prediction results under 30% masking rate using the POP909\(_{mb}\) dataset is significantly better than the 15% masking rate. After a comprehensive observation of the evaluation results, 15%–30% is a reasonable masking range for predicting the potential main melodies. At the same time, the results from using different masking rates can also prove the robustness of Multi-MMLG. In other words, even if the masking rate reaches 30% and the notes in the condition become less, the similarity between the main melodies predicted by the model, and the corresponding music is still better than other methods in Table 7.

Fig. 10
figure 10

Results of using the Multi-MMLG framework to predict the potential main melodies of the 874th file (in POP909\(_m\) dataset). Under 15%-30% masking rate of the condition, the predicted results are potential main melodies. Especially for the first result under 15% masking rate, it is similar to the BRIDGE melody that is one typical potential main melody ignored by POP909\(_m\)

In conclusion, the structure and the models selected in Multi-MMLG framework are reasonable, and this framework could output multiple main melodies.

4.2.3 Qualitative analysis

We perform a qualitative analysis for a more intuitive view of the predicted melodies. Figure 11 presents the results of embedding other models into our framework. Consistent with the quantitative analysis results, the predicted results are different from the Melody track or Bridge track

Fig. 11
figure 11

Results of the control group using other models to replace the two stages of the Multi-MMLG framework to predict the potential main melodies of the 874th file (in the POP909\(_m\) dataset), including MIDIBert [24], XLNet and MuseBert\(_o\)

. Hence, using the methods shown in Fig. 11 could not work. Figure 10 shows some results achieved by the Multi-MMLG framework when ranging the masking rate between 0% and 45%. In the Figs. 11 and 10, the musical scrolls containing the ground truth main melody is derived from the POP909\(_m\) dataset. This dataset only regards the MELODY track as the main melody. If the predicted results are similar and non-identical to the MELODY track, these results are the potential main melodies. In addition, the BRIDGE melody is also the main melody following our new main melody definition. Since POP909\(_m\) ignores this potential main melody, if the predicted results are similar to the BRIDGE melody, it could better justify that our framework successfully extracts the potential main melody.

According to the above analysis, in Fig. 10, the first result in 15% masking rate is more similar to the BRIDGE melody, which is a typical example that proves our framework’s ability to predict the main melody. Consistent with the results of quantitative analysis, the predicted results for 0% masking rate contain less difference from the ground truth, and 45% result is irrelevant to the corresponding music. Therefore, when operating in the range between 15% and 30% of masking rate, the Multi-MMLG framework could predict the potential main melodies of a piece of music.

5 Conclusion

This paper first puts forward a new definition for main melody. It acknowledges the fact that the main melodies are not unique but instead they are a set of similar and non-identical melodies. This definition is complex but more suitable for the applications such as music information retrieval. We proposed a framework that addresses the problem of automatically predicting multiple main melodies, and it caters for the complexity and diversity of the main melody. The two stages pipeline framework involves the main melody classification stage and a conditional prediction stage. Specifically, MIDIXLNet used in Stage 1 of the Multi-MMLG framework is proposed to provide main melody classification results with high accuracy efficiently. High accuracy classified results could preserve more context features, which becomes the condition of predicting at Stage 2. To predict potential main melodies that are similar in a reasonable range, we conduct a masking strategy on Stage 2’s condition. Experiment results suggest that Multi-MMLG could efficiently obtain high-accuracy classification results of the main melody and automatically extract multiple potential main melodies when the masking rate fall in the range between 15 and 30%.

Moreover, the experimental results show that the chord feature significantly impacts the melody classification results, increasing the accuracy by around 6%. In addition, we observe that when the masking rate of the predicting condition varied, PS, CS, and MD do not completely decrease. In other words, as the conditions for predicting the main melody become more relaxed, the similarity between the predicted main melody and the corresponding music may be likely to be improved. This characteristic benefits the application of the framework in the field of main melody completion.

Currently, this framework controls the predicted results by randomly masking the prediction conditions. However, random masking causes the prediction results not to adapt precisely to the application environment. In other words, the predicted results may randomly and irregularly generate the main melody with different music keys, similar strong beats, or similar weak beats. In this case, when people expect the predicted main melodies to have a particular structure, they need to predict them multiple times. Therefore, in the future, we will consider adjusting the framework’s Stage 2. We would use multiple relationship matrices, such as the strong and weak beat relationship matrix, the chord relationship matrix, as the input of the modified MuseBERT model replacing random masking. After specifying the required relationship matrix, our framework will be able to predict the main melody that meets human requirements more directly. This adjustment involves two parts, designing the representation of different relationship matrices and embedding methods dealing with different inputs.