Keywords

1 Introduction

A music piece is composed of musical notes. These notes occur in different combinations and timings which makes melodies different. In order to study such music compositions it is very important to notate or transcribe them. This not only helps in understanding them in a better way but also to communicate with other musicians. The BGM of a composition as important as the lead melody. It is the BGM which makes a piece sound complete and which goes on for almost the entire span of a piece. A change in the BGM can alter the mood of a composition and at times disrupt it completely. Thus it is very important to play the BGM flawlessly during performances to uphold the essence of a composition. One of the most important facets of BGM melody is known as a chord which is composed of two or more musical notes played simultaneously. Every composition has a chord chart associated with it whose transcription is essential.

Rajpurkar et al. [15] distinguished chords in real-time. They used hidden markov model (HMM) and Gaussian discriminant analysis in addition to chroma-based features and obtained an accuracy of 99.19%. Zhou and Lerch [18] used deep learning for distinguishing chords. They worked with 317 music pieces and obtained a recall value of 0.916 using max-pooling. Cheng et al. [4] distinguished chords for music classification and retrieval with the aid of N-gram technique and HMM. Different chord-based features like chord histogram and LCS were also involved in their experiments and a highest overall accuracy of 67.3% was obtained. Dylan Quenneville [14] has talked about multitudinous aspects of automatic music transcription. He has highlighted the basics of making music as well as that of transcription. He has talked about different techniques of pitch detection in the thick of fourier transform-based approaches, fundamental frequency-based approaches, harmonicity-based approaches to name a few.

Berket and Shi [3] presented a two phase model for music transcription. In the first phase, they used acoustic modelling to detect pitches and in the later phase it was transcribed. They worked with 138 MIDI files which were converted to audio. The train set consisted of 110 songs while the remaining were used for testing and reported results as high as 99.81%. Wats and Patra [17] used a non negative matrix factorization-based technique for automatic music transcription. They worked on the Disklavier dataset and obtained good results. Benetos et al. [1] presented an overview of automatic music transcription. They have touched on its various applications and challenges. They have also talked about several transcription techniques as well. Muludi et al. [12] frequency domain information and pitch class profile for chord identification. Their experiments involved 432 guitar chords and obtained an accuracy of 70.06%.

Osmalskyj et al. [13] used a neural network and pitch class profiles for guitar chord distinction. Their study involved other instruments in the thick of accordion, violin and piano. They performed instrument identification as well and obtained an error rate of 6.5% for chord identification. Benetos et al. [2] laid out different techniques and challenges which are involved in automatic music transcription. They have talked about various pitch tracking methods in the thick of feature-based approaches, statistical approaches, spectrogram factorization-based approaches and many more. They have also talked about several types of transcriptions including instrument and genre-based transcription as well as informed transcription. Kroher and Gomez [7] attempted to automatically transcribe flamenco singing from polyphonic tracks. They extracted predominant melody and eliminated contours of the accompaniments. Next the vocal contour was discretized into notes followed by assignment of a quantized pitch level. They experimented with three datasets totaling to more than 100 tracks and obtained results which was better than state of the art singing transcribers based on overall performance, onset detection and voicing accuracy. Costantini and Casali [5] used frequency analysis for chord identification. Experiments were performed with upto 4 note chords. Highest accuracies of 98%, 97% and 95% were obtained for the 2, 3 and 4 note chords.

Here, a system is proposed to distinguish chords from clips of very short duration. It works with LSTM-RNN based classification and has the potential of aiding in automatic music transcription for background music which is very vital. The system is illustrated in Fig. 1.

Fig. 1.
figure 1

Pictorial representation of the proposed system.

The rest of the paper consists of the details of dataset in Sect. 2. Sections 3 and 4 talk about the proposed method whose results respectively. Finally we have concluded in Sect. 5.

2 Dataset

Data is a very important aspect of any experiment. The quality of data plays a crucial part in development of robust systems as well. To the best of our knowledge, there is no publicly available dataset of chords and hence we put together a dataset of our own. In the present experiment, we consider two of the most popular chords from the major family (C and G) and two most popular chords from the minor family namely A minor (Am) and E minor (Em) [16]. The constituent notes of scales of the considered chords along with the notes of the chords is presented in Table 1. The chord pairs (G-Em) and (C-Am) have common notes which makes it difficult to distinguish them.

Table 1. Notes involved in the chords.

Volunteers were provided a Hertz acoustic guitar (HZR3801E) for playing the chords. They played different rhythm patterns and no metronome was used to allow relaxation with respect to tempo. Volunteers further used different type of plectrums which slightly change the sound thereby encompassing more variation. The audios were recorded with the primary line port of a computer having a motherboard (Gigabyte B150M-D3H). Further, studio ambience and use of pre amplifiers was avoided to ensure real world scenario. The audio clips were recorded in .wav format at a bitrate of 1411  kbps.

Four datasets (D1-D4) having clips of lengths 0.25, 0.5, 1 and 2 s respectively were put together form the recorded data whose details are presented in Table 2. We worked with clips of such durations to test the efficiency of our system for short clips which is common in real world.

Table 2. Details of the generated datasets with number of clips per chord.

3 Proposed Method

3.1 Preprocessing

Framing. The clips were first subdivided into smaller segments called frames. This was mainly done to make the spectral contents stationary which otherwise show high deviations thereby making analysis a herculean task. The clips were divided into 256 point frames in overlapping mode with 100 common points (overlap) between two consecutive frames [11].

Windowing. Jitters are often observed in the frames due to loss of continuity at the boundaries. These disrupt frequency-based analysis in the form of spectral leakage. To tackle this, the frames are windowed with windowing function. Here we used hamming window [11] which is presented in Eq. (1).

$$\begin{aligned} w(n)=0.54-0.46 \cos \Bigg ( \frac{2 \pi n}{N-1}\Bigg ), \end{aligned}$$
(1)

Feature extraction where n is a sample point within a N sized frame.

3.2 Feature Extraction

Each of the clips were used for extraction of the standard line spectral frequency (LSF) features at frame level. LSF [11] was chosen due to its higher quantization power [10]. Here, a sound signal is represented as the output of a filter H(z) whose inverse is G(z) where G \(_{1\ldots .m}\) are the predictive coefficients

$$\begin{aligned} G(z)=1+g_1z^{-1}+.+ g_mz^{-n} \end{aligned}$$
(2)

The LSF derived by decomposing G(z) into G\(_x\)(z) and G\(_y\)(z) which are detailed below

$$\begin{aligned} G_x(z)= G(z)+z^{-(m+1)}G(z^{-1}) \end{aligned}$$
(3)
$$\begin{aligned} G_y(z)= G(z)-z^{-(m+1)}G(z^{-1}) \end{aligned}$$
(4)

We had extracted 5, 10, 15, 20 and 25 dimensional features for the frames. Each of these dimensions correspond to bands that is 5 dimensional LSFs have 5 bands and so on. Next, these bands were graded in accordance with the total value of the coefficients. This band sequence was used as feature. It depicted the energy distribution pattern across the bands. Along with this, the mean and standard deviation of the spectral centroids per frame was also appended. When 5 dimensional LSF was extracted, a total of \(5 \times 440=2200\) coefficients were obtained for a clip of only 1 s (1 s clip produced 440 frames). This dimension varied with disparate length of the clips. The band grades along with the mean and standard deviation of the centroids produced a \(5+2=7\) dimensional feature when 5 dimensional LSFs were extracted. These were also independent of the clip lengths. So finally we obtained features of 7, 12, 17, 22 and 27 dimensions.

3.3 Long Short Term Memory-Recurrent Neural Network (LSTM-RNN) Based Classification

LSTM-RNN can preserve states as compared to standard neural networks [9] which makes them suitable for sequences. It further solves the vanishing gradient problem of simple RNNs [8]. A LSTM block comprises of a cell state and three gates namely forget gate, input gate and output gate. The input gate (\(i_n\)) helps to generate new state:

$$\begin{aligned} i_n=\sigma (Wt_iS_{n-1}+Wt_iX_n), \end{aligned}$$
(5)

where \(Wt_i\) is the associated weight. The forget gate discards values form previous state to the present state:

$$\begin{aligned} f_n= \sigma (Wt_fS_{n-1}+Wt_fX_n), \end{aligned}$$
(6)

where \(Wt_f\) is the associated weight. The output determines the next state as shown below:

$$\begin{aligned} o_n=\sigma (Wt_oS_{n-1}+Wt_oX_n), \end{aligned}$$
(7)

where \(Wt_o\) is the associated weight. Our network comprised of a 100 dimensional LSTM layer. The output of this layer was passed through three fully connected layers of dimensions 100, 50 and 25 respectively. These layers had ReLU activation. The final layer was a 4 dimensional fully connected layer with softmax activation. We had initially used 5 fold cross validation with 100 epochs in our experiment and the network parameters were set after trials.

4 Result and Analysis

Each of the feature sets for the datasets D1-D4 were fed to the recurrent neural network as summarized in Table 3. It is observed that the best result was obtained for the 22 dimensional features on D3. To obtain better results, the training epochs were varied with 5 fold cross validation for 22 dimensional features of D3 as shown in Table 4. The best performance was obtained for 300 epochs. Increasing the training epochs even further led to over fitting and thus produced lower results. The confusions among the different classes for 300 iterations is presented in Table 5(a). It is observed that the highest confusion was among the minor chords. The clips were analyzed and it was found that the volunteers at times accidentally muted strings which interfered with the chord textures in the barred shapes. This could be one probable reason for such confusions.

Table 3. Results for different datasets with the disparate datasets.
Table 4. Accuracy for different training epochs on D3 with 22 dimensional features.
Table 5. (a) Confusion matrix for 300 epochs. (b) Confusion matrix for 20 fold cross validation. (c) Confusion matrix for 300 epochs 20 fold cross validation.

In order to obtain further improvements, we varied the cross validation folds for 100 epochs for 22 dimensional features of D3. The obtained results are presented in Table 6. 20 folds produced the best result wherein the variation of the dataset was evenly distributed. The performance decreased on further increasing the folds of cross validation. The interclass confusions is presented in Table 5(b) wherein it is observed the chords C and Em were recognized with 100% accuracy. The confusions among the minor chords was also overcome in this setup. Finally the best fold value (20 fold) along with the best training epoch (300 epochs) were combined which produced an accuracy of 99.91 % (overall highest) whose confusions are presented in Table 5(c). It is observed that the confusions were exactly similar as compared to the 20 fold cross validation setup, only 1 more instance of G chord was identified correctly as compared to the former setup. Some of the other popular classifiers in the thick of bayesnet (BN), naïve bayes (NB), multi layer perceptron (MLP), random forest (RF), radial basis functional classifier (RBF) from [6] were also evaluated on D4 whose results are summarized in Table 7.

Table 6. Accuracy for different folds of cross validation on D3 with 22 dimensional features.
Table 7. Performance of different classification techniques on D3 with 22 dimensional features.

5 Conclusion

Here, a system is presented to distinguish chords from clips of short durations. The system works with LSTM-RNN based classification technique and produced encouraging results. In future, we will experiment with a larger set of chords and involve other instruments as well. We will introduce other tracks along with the chords to observe the system’s performance. We also plan to identify and discard silent sections in the clips to obtain better results. Finally, we will make use of other acoustic features coupled with different modern machine learning techniques to obtain further improvement in our results.