1 Introduction

A special issue on automatic processing of music information research, edited by Herrera-Boyer et al. (2013), accounts on how the past of MIR (Music Information Retrieval), combined with realistic perspectives on the future of specific topics, influence the future of this area. One may argue that due to the amount of available music-related information, expert-based knowledge may start to be obsolete as it is already preserved in multiple data records. Thus, the future research in MIR domain may eventually lead to deep machine learning (Humprey et al. 2013). On the other hand, Lee and Cunningham (2013) show that understanding users’ needs, behavior and requirements may have a great impact on developing a system that addresses critical concepts of MIR. Accordingly, they recommend to increase the visibility and impact of user-based studies in the field. A very comprehensive review of topics related to MIR was recently prepared by Schedl et al. (2014). The study contains over 300 references, however this is only a small fraction of the literature sources devoted to Music Information Retrieval. Even a more recent survey was prepared by Burgoyne et al. (2016), in which research performed within MIR was presented as an important part of a rapidly evolving area called digital humanities.

Automatic music genre classification (AMGC) has been exploited quite thoroughly in recent years by the research community (ISMIR conferences, ISMIR (2016)) and is one of the most popular search query choices within the MIR domain (Bergstra et al. 2006; Burred 2014; Kostek 2005; Ntalampiras 2013; Schedl et al. 2014; Silla et al. 2007; Sturm 2013; Tzanetakis et al. 2002)). On a smaller scale, a survey, focusing on AMGC was presented by Silla et al. (2007). They observed that the typical approach adopted to the AMGC is based on feature space decomposition and machine learning to assign music genre labels. Also some non-conventional machine learning strategies to AMGC exist, based on both space and time decomposition schemes, an example of which may be again the work of Silla et al. (2007). Features employed in their work were selected from several parts of a music excerpt, as well as from the entire music signal. They used a combination of binary classifiers, the results of which were merged to produce the final music genre labeling (Silla et al. 2007). Another, non-conventional approach was shown in the work by Sturm (2014), as well as by Bergstra et al. (2006). The AdaBoost algorithm, performing the classification iteratively by combining the weighted votes of several weak learners, was utilized. Yet, a novel data selection strategy based on Gaussian mixture model clustering for the creation of the Universal Modeling (UM) was introduced by Ntalampiras (2013). The scheme considered the dataset characteristics, adapted itself to them and achieved increased recognition rates in comparison to the conventional approach. Very recently, a Special Issue on Intelligent Audio Processing, Semantics, and Interaction was prepared, in which it was pointed out that semantic audio incorporates the processes of intelligent audio processing and augmented (semantic) interaction, thus broadening the area of music information retrieval (Kalliris et al. 2016).

AMGC is still an ongoing process, especially in the context of scalability, as most of the studies were carried out on the databases delivered either by ISMIR, MIREX, ISMIS conferences or those available in the Internet, e.g. GTZAN, RWC-MDB (Real World Computing Music Database) (Goto et al. 2002), Magnatune, etc., typically including approx. 1,000–2,000 music pieces assigned to a few popular music genres (see e.g. Bergstra et al. (2006). There are some larger collections of music excerpts, e.g. Latin Music Database containing 3,160 music pieces categorized in 10 musical genres (Silla et al. 2007). In many cases, such databases are labeled manually, which means that audio files are correctly assigned to the corresponding music genre, however the assignment is carried out on subjective basis. This aspect may have a very positive impact on the effectiveness of classification experiment. Still, as reported in the literature, for low level-feature-based approach and multi-class recognition, the effectiveness of music genre classification is in the range of 60–80% (Bergstra et al. 2006; Tzanetakis et al. 2002; Holzapfel and Stylianou 2008; Kostek et al. 2011) with some exceptions (see e.g. Ntalampiras (2013)). It is worth mentioning that the above-mentioned collections are not consistent among one another, as they differ in the number of musical pieces, file format, bit resolution, number of genres, etc., hence, a full comparison within these databases is justified only to some extent.

As pointed out by Silla et al. (2007) music genres are categorical labels created by human experts in order to identify the style of the music and organize music collections. “Music genre” notion may not precisely be defined, however the research comprising music categorization, as stated by Tekman and Hortacsu, still plays an essential role in music appreciation and cognition (Tekman and Hortacsu 2002). One may argue that the expanding consumer market for social music network services brought new ways for searching and analyzing musical information and examining their effectiveness and quality, i.e. based on collaborative filtering and similarity measures retrieved from large music archives (Schedl et al. 2014; Ness et al. 2009). However, the subject of a deeper content exploring, i.e. considering the sound source separation in the context of music recognition, starts to be useful in improving genre classification (McKay and Fujinaga 2004; Pérez-García et al. 2010; Zhu et al. 2004). This also is visible in some new applications (e.g. AudioScore Ultimate 7. (2016)).

Finally, music genre is related to the thematic identity of radio broadcasting shows, therefore to the underlying (semantic) relations between radio producers, content and consumers (Fu et al. 2011; Kotsakis et al. 2012; Romain et al. 2012) with many practical uses in media analytics and broadcasting programming. Similar audio-driven semantic analysis approaches (including Music Genre Recognition) can also be considered for the case of video content, thus leading to various semantic conceptualization outcomes (i.e. related to activities: dancing, signing, jogging, skiing etc., occasions: birthday, graduation etc., and others) (Lee and Ellis Daniel 2010). These are examples of intelligent information systems that will dominate in the upcoming (fully deployed) Semantic Web in the near future.

The presented work is a part of a larger framework carried out over the past several years. The authors and their collaborators performed several studies devoted to AMGC (Kostek 2013; Kostek et al. 2011; Kostek et al. 2014; Plewa and Kostek 2015; Rosner et al. 2014; Rosner and Kostek 2015), in which decision algorithms, such as: k NNs (k Nearest Neighbors), SVM by Sequential Minimal Optimization (SMO) algorithm, Rough Sets and Bayesian Networks were used. Recently, a paper was published by one of the authors and her Ph.D. student (Hoffmann and Kostek 2015), which presents a novel approach to the Virtual Bass Synthesis (VBS) applied to mobile devices, called Smart VBS (SVBS). Improving the low frequency sound of mobile devices is a problem that appears in many studies (Hill and Hawksford 2010; Mu and Gan 2012; 2015; Oo et al. 2000). The proposed algorithm uses a rule-based settings of bass synthesis parameters adjusted according to the recognized music genre. To perform harmonic generation based on a nonlinear device (NLD) method an intelligent controlling system, automatically adapting to the recognized music genre, was proposed (Hoffmann and Kostek 2015). Lately, a patent application was prepared in which the above described approach has been extended to separating music tracks before the NLD settings are adjusted. Thus, the motivation behind the presented study is to provide answers with regard to the content of the feature vector derived from separated tracks, also to what extent separating tracks helps to distinguish between genres, and which genres make the most use of track separation. The last question has alaredy been asked by Wieczorkowska et al. (2011) with regard to recognition the dominating musical instrument in sound mixes.

The aim of this research study is two-fold, namely: to propose a feature vector created based on separated audio tracks but retaining parameters derived from the original excerpt. This may be important in the context of the nature of musical genres. For example, it is well known that some genres (e.g. rock, hard rock, techno, etc.) are characterized by rich rhythmic patterns that possibly translate, among others, into the values of energy and temporal descriptors. The authors’ approach differs from other studies shown in the literature. When the track separation is performed on audio source, a music excerpt is separated into harmonic and drum tracks. We have expanded this by extracting features that are related to individual music instruments that may be characteristic to the specific genre. Then, having several individual tracks, we checked whether it is sufficient to build a feature vector based on descriptors derived only from individual tracks or whether to include those from the whole music excerpt as well, assuming the separation is not perfect because of the estimation inaccuracies. The second goal is to check whether the feature vector derived from this study enables to effectively classify musical genres in the case of a database containing thousands of records and a dozen of musical genres, i.e. with a similar correctness comparing to earlier experiments carried out on much smaller music databases.

The paper is organized as follows, Section 2 presents the experimental setup starting with a concise description of the database employed and parameters utilized. Research studies devoted to the music separation process are then recalled and a methodology involving music track separation that was utilized by the authors is explained. Finally, the pre-processing stage with regard to building a feature vector for the genre classification process is shown. Section 3 contains a short description of the classification algorithm used in this study, focusing on the so-called co-training mechanism. Section 4 discusses results of experiments that were carried out for optimizing feature vectors. Overall comments are included in Summary.

2 Experimental setup

2.1 Music database

For the purpose of experiments a subset of audio excerpts extracted from the Synat database belonging to 13 popular music genres was used (SYNAT 2016). In addition, a dataset of musical instrument samples was collected from the Sampleswap music service (Sampleswap 2016). As it contains samples of various instrument sounds, as well as examples of instruments playing in the loop, the dataset also provided longer sections of particular instruments. The samples of three musical instruments were collected for the experiment: piano, trumpet and saxophone.

The Synat database (Kostek et al. 2014; SYNAT 2016) stores over 50.000 music tracks of 30-second long song excerpts in mp3 format, representing the following 22 genres: Alternative Rock, Blues, Broadway & Vocalists, Children’s Music, Christian&Gospel, Classic Rock, Classical, Country, Dance & DJ, Folk, Hard Rock & Metal, International, Jazz, Latin Music, Miscellaneous, New Age, Opera & Vocal, Pop, Rap & Hip-Hop, Rock, R&B, and Soundtracks. The whole database is parameterized employing a feature vector shown in the subsequent Section. For the experiments carried out within this study over 8,000 music excerpts representing 13 music genres were selected. They are as follows: Alternative Rock, Blues, Classical, Country, Dance & DJ, Hard Rock & Metal, Jazz, Latin Music, New age, Pop, R&B, Rap & Hip-Hop, Rock. Music genres chosen for the analysis represent sufficiently diverse, yet similar music material. Also, they were utilized in other research works. This way we could indirectly compare the obtained results with findings from the literature sources.

It should be pointed out that the constructed music robot assigned songs to the genres (i.e. classes in the Synat database) according to their ID3 tags. These tags were saved in a fully automatic way without human control. It should be reminded that the ability of humans to distinguish between complex music genres is based strongly on context-dependent inferences and is far from being perfect (Tzanetakis et al. 2002). Hence, the decision systems trying to mimic human’s way of analyzing music may not be capable to do it with a very high effectiveness.

2.2 Parametrization

Feature extraction plays a crucial part in the genre recognition process, thus this stage should be carefully controlled and optimized. Feature vectors (FVs) for music genre classification are usually based on low-level descriptors from the MPEG-7 standard (Lindsay and Herre 2001; Hyoung-Gook et al. 2005), Mel-Frequency Cepstral Coefficients (MFCCs) (Tzanetakis et al. 2002) or finally, dedicated parameters suggested by researchers (Kostek 1999; Kostek et al. 2011; Liu et al. 2007; Nayak and Bhutani 2011; Salamon et al. 2012; Silla et al. 2007). Table 1 presents a list of parameters contained in the Synat database (Kostek et al. 2014). Most parameters are based on the MPEG-7 standard, and the remaining ones are the MFCC descriptors and time-related dedicated parameters (Kostek et al. 2011). Since definitions of these parameters are well-known or easily found in the literature sources, they are not to be recalled here. It is interesting, however, that the same set of parameters was used in a study on music mood classification and brought sufficient effectiveness (Plewa and Kostek 2015).

Table 1 Audio features: identifier (ID) and description per type

2.3 Music track separation

In recent years, an extensive research has also been conducted on the subject of audio sound separation, and resulted in interesting ideas and solutions. Among the most promising, one finds sinusoidal modeling (SM) (Serra and Smith 1990) that was extensively exploited over the last two decades. There are also many examples of algorithms that were implemented within many research studies (Bregman 1990; Casey and Westner 2000; de Cheveigne 1993; Dziubiński et al. 2005; Eweret et al. 2014; Gerber et al. 2012; Gillet and Richard 2008; Herrera et al. 2000).

Uhle et al. (2003) designed a system for drum beat separation based on Independent Component Analysis. In contrast, Smaragdis and Brown (2003) applied Non-Negative Matrix Factorization (NMF) to create a system for transcription of polyphonic music with a special focus on piano music. In the study of Helen and Virtanen (2005) NMF is used, combined with a feature extraction and classification process. They have got good results in drum beat separation from pop music. The same methodology was used by Paulus and Virtanen (2005) for drum transcription.

In this study, a semi-supervised instrument separation based on NMF is adopted to the authors’ needs. The main principles of the NMF-based methodology are first recalled with a focus on cost function minimization.

The main principle of the drum separation algorithm is employing a semi-supervised approach based on non-negative matrix factorization (NMF). The aim of unsupervised learning algorithms such as vector quantization is to factorize a data matrix according to different constraints (Lee and Seung 1999). This results in clustering the data into mutually exclusive prototypes. The general idea of NMF is to separate input audio track into several isolated audio tracks, representing specified components such as rhythmic or melodic part.

NMF is an efficient method used in the blind separation of drums and melodic parts of music recordings. NMF performs a decomposition of the magnitude spectrogram V VWH) obtained by Short-Time Fourier Transform (STFT), with spectral observations in columns, into two non-negative matrices W and H where \(\mathbf {W}\in R_{\ge 0}^{m \times r}\), \(\textbf {H}\in R_{\ge 0}^{r\times n}\) and a constant rN. Columns of matrix W resembles characteristic spectra of the audio events occurring in the signal (such as notes played by an instrument), and rows in matrix H measures their time-varying gains. Columns W are not required to be orthogonal as in Principal Component Analysis (PCA).

For rn, m, there exists generally only an approximate solution. Factorization is achieved by iterative algorithms minimizing cost-functions, as presented in (1) (Schuller et al. 2009):

$$ \begin{array}{ll} (\textbf{V} - \textbf{WH})^{2} &\text{Squared error} \\ \left\|(\textbf{V} - \textbf{WH}) \right\|_{F} &\text{Frobenius norm} \\ \sum\limits_{ij} \left( V_{ij} \log\frac{V_{ij}}{(\textbf{WH})_{ij}} - V_{ij} + (\textbf{WH})_{ij}\right)\quad &\text{Modified KL Divergance} \end{array} $$
(1)

The first two cost-functions are closely related to each other, both minimizing some form of quadratic error, the Modified KL Divergence interprets the matrices V and (WH) as probability distributions and minimizes their divergence. The modification of the Kullback-Leibler (KL) divergence lies in an additional term (WH) i j V i j , added not only to introduce a measurement of the absolute error, but also to ensure non-negativity.

In the experiments carried out, an approach employing an iterative algorithm for computing two factors based on the Modified Kullback-Leibler divergence of V given W and H was used. A pre-trained SVM (Support Vector Machine) classifier was applied to each NMF component (column of W and the corresponding row of H) to distinguish between percussive and non-percussive components based on such features as harmonicity of the spectrum and periodicity of the gains. By selecting the columns of W that are classified as percussive and multiplying them with their estimated gains in H, we obtain an estimate of the contribution of percussive instruments to each time-frequency bin in V. Thus, we can construct a soft mask that is applied to V to obtain an estimated spectrogram of the drum part, which is transferred back to the time domain through the inverse STFT using the OLA (overlap-add) operation between the short-time sections in the inverting process. It should be reminded that the redundancy within overlapping segments and the averaging of the redundant samples averages out the effect of the window analysis (windowing). More details on the drum separation procedure can be found in the introductory paper by Schuller et al. (2009).

2.3.1 OpenBliSSART

The openBliSSART application is a C++ toolbox that provides Blind Source Separation for Audio Recognition Tasks (Weninger and Lehmann 2011). Besides the basic blind (unsupervised) source separation, classification by Support Vector Machines (SVM) using common acoustic features from speech and music processing is implemented. A GUI is available based on cross-platform application framework Qt (Qt 2016) for the source component playback and data set creation. It includes various source separation algorithms, with a strong focus on variants of Non-Negative Matrix Factorization (NMF). Furthermore, supervised NMF can be performed for source separation as well as audio feature extraction (Weninger et al. 2017). It should be noted that openBliSSART has built-in components to separate the HARMONIC and DRUM instruments. However, the toolkit also enables to import audio files (in order to define new instrument components), create label (to define new instrument’s name), and create response (to define which instruments should be considered in the separation process). In our study, we have introduced samples of new instruments (piano, trumpet, saxophone) to teach the built-in SVM classifier. Musical instrument samples were collected from the Sampleswap music service (Sampleswap 2016).

2.3.2 Feature vectors built on separated music tracks

This part of the experiment includes separating the input signal in order to obtain the signal of the specific instrument, such as: harmonic part of the input audio track, drum signal (percussion), piano, trumpet, saxophone. Therefore the same parameters as presented in Table 1 were calculated additionally for the separated music tracks. In that way, vectors of parameters (VoPs) were obtained, i.e. the FV containing the original track was extended by new parameters derived from the separated signal. Therefore, feature vectors derived from original and harmonic signals (denoted as OH), original and drum (OD), original and piano (OP), etc., as well as from mixtures of more than two signals (e.g. original + drum + harmonic resulted in OHD FVs) were created. This strategy assumes that the separation process may not be perfect.

2.3.3 Normalization methods

Data normalization is a scaling of original data to the specified range, e.g. [−1, 1] or [0, 1], which is useful in data exploration and specifically for neural networks. In the study performed, the most popular methods of normalizations, such as: Min-Max and Zero-Mean (ZScore) were applied and tested in the pre-study. Min-Max normalization is a linear transformation on the original data usually to the range [0, 1]. Zero-Mean normalization takes into account the fact that the mean value should equal zero after the normalization process. The normalization of training and test datasets designated for the decision algorithms are performed in the same way as for Min-Max normalization – the mean and standard deviation values are calculated only for training dataset, and only the current value is retained from training and test datasets (used respectively for normalization of training and test datasets).

2.4 Experimental setup

As described above, the experimental setup consisted of several steps (see Fig. 1). As observed in Fig. 1, the feature extraction is performed on original (O) audio and separated (harmonic (H), drum (D), trumpet (T), piano (P), saxophone (S)) signals. All feature vectors (FVs) are then normalized and optimized. When performing the Non-Negative Matrix Factorization-based separation, the following configuration was used: cost function (Modified KL-divergence), window sizes (20 ms, 30 ms or 40 ms), window function (square root of Hann function), window overlap (0.5), number of components (5, 10, 20 or 30). After including parameters derived from separated signals to the original FV, they form an expanded feature vector, which is also optimized (based on the reduction of the number of attributes – i.e. Best First, Greedy Stepwise, Ranker). This feature vector is called the vector of parameters (VoP). Finally, the derived VoPs are employed in the classification process by means of the co-training mechanism applied for SVM (which is described in the next Section). The last step involves selection of optimum classification algorithm parameters and settings.

Fig. 1
figure 1

Experimental setup

3 Classification process

As already mentioned, there were two stages of experiments, namely one focused on FVs optimization and the other one was devoted to evaluating music genre classification effectiveness. Several algorithms were employed in the pre-study phase, namely k Nearest Neighbors (k NNs) algorithm, Support Vector Machine (SVM), both algorithms with- and without the co-training mechanism, as well as Random Forests. The results achieved for music genre classification using these algorithms are approximately within the same range of accuracy. As the best effectiveness was obtained while using the Support Vector Machine algorithm with a co-training mechanism, consequently, the results for SVM (co-training), will only be presented. First, some basic information concerning SVM is recalled below.

SVM uses a nonlinear mapping to transform the original training data into new space. Within this new space, it searches for the optimal separating hyperplane (i.e., a “decision boundary” separating the instances of one class from another). With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane, which is found by using support vectors (“essential” training data elements) and margins (defined by the support vectors). The SVM method is accurate thanks to its ability to model complex, nonlinear decision boundaries. It is much less prone to overfitting (especially as the cross-validation procedure is utilized) (Hsu et al. 2003) than other methods and can also provide a compact description of the learned model. Weka implementation of the SVM algorithm is the SMO function (Weka library 2016) that allows for using normalization or standardization of the input data as the preprocessing step, additionally enabling to determine different kernel functions, such as linear, polynomial or RBF (Gaussian radial basis function). Details of the decision-making stage involving machine learning with the cross-validation approach are presented in Fig. 2.

Fig. 2
figure 2

Classification process using the cross-validation approach

3.1 Co-training method

Co-training (Blum and Mitchell 1998) is an example of semi-supervised machine learning technique, which uses labeled and unlabeled data to build a classifier. It initially learns on small training set, then during the classification of unlabeled data, the elements of the most confident predictions are used to iteratively extend the original training set (Xiaojin and Goldberg 2009). This is done by adding a threshold criterion in the process of classifying the data from the test set. If the prediction of classification of unlabeled data is sufficiently high (i.e. higher than the threshold criterion), such data are marked as classified and they are added to the training set. This is repeated up to the stage when all the elements from the test set are classified.

The main advantage of such an approach is that in each iteration, the training set is extended by new information based on the classification of new elements from the test set, which can improve the learning process. Contrarily, the disadvantage of the method is that if the elements are not classified correctly, it introduces misleading information to the training set. Regardless of that, co-training is a common approach in the machine learning-based problem solutions and usually gives much better results than the standard methods. Since the co-training method enhances the performance of classification, it was decided to be applied in the experiments (Rosner et al. 2013; Rosner et al. 2014; Rosner and Kostek 2015).

3.2 Effectiveness measures

Sturm (2013) indicated that presenting accuracy alone is not sufficient in accurate interpretation of results obtained in the evaluation of music recognition. Therefore, in this study, the following measures: True Positive (TP) Rate, Precision, Recall, Accuracy and F1 were used:

$$ TPR(Id) = \frac{CCP(Id)}{TNE(Id)}\cdot 100\% $$
(2)

where: Id – class identifier, TPR – percentage of true positives of class Id, CCP – Correctly Classified Positives of class Id, TNE – total number of elements in class Id, meaning that a class Id will be classified correctly with TP probability.

$$ \textit{Precision}(Id) = \frac{CCP(Id)}{TCP(Id)}\cdot 100\% $$
(3)

where: Id, CCP – as above, Precision – proportion of the examples which truly have class Id among all those which were classified as class Id, in [%], TCP – total number of objects classified as class Id (including FP(Id) – false positives), meaning that if an instance X is classified as an object in class Id, then with probability equal to Precision value it is truly class Id.

$$ \textit{Recall}(Id) = \frac{TPR(Id)}{TCN(Id)}\cdot 100\% $$
(4)

where: Id, TPR(Id) – as above, Recall(Id) – equivalent to true positive rate (or sensitivity), TCN – total number of objects classified as class Id (including FN(Id) – false negatives).

$$ \textit{Accuracy}(Id) = \frac{TP(Id)+TN(Id)}{TP(Id)+TN(Id)+FP(Id)+FN(Id)}\cdot 100\% $$
(5)

where: Id, TP, FP, FN – as above, TN(Id) – true negatives of class Id.

$$ F1 = 2\cdot \frac{\textit{Precision}(Id)\cdot \textit{Recall}(Id)}{\textit{Precision}(Id)+\textit{Recall}(Id)}\cdot 100\% $$
(6)

F1 – a combined measure for precision and recall (harmonic mean).

3.3 Feature vector optimization

Optimization of FVs is focused on selecting optimum parameters in the process of music genre classification. The first step is to reduce the original vector of parameters to eliminate strongly correlated parameters and replace them with one parameter, so the derived feature vector consists only of uncorrelated features. The second step is to add new parameters representing a specific instrument, typical for a given music genre.

Several optimization methods were performed using Weka implementation (Weka library), which resulted in a new vector of parameters (VoP). As mentioned before, VoP is understood here as the optimized FV.

4 Experiments

4.1 Reducing feature vector

4.1.1 Attribute subset selection

Among the methods of feature vector reducing one may discern those based on the attribute subset selection such as Best First, Greedy or Ranker. They were all tested on music excerpts extracted from the Synat database in the pre-study phase. For each experiment different settings were considered (direction of search, direction of search, number of non-improving nodes to consider before search termination). It occurred that the Best First (Weka library) method (based on 5-node test results for the Best First method in direction Forward) returned the best results for reducing FV of 173 parameters. The algorithm was implemented in the Weka environment, and the analysis searches the space of attribute subsets by greedy hill-climbing augmented with a backtracking facility. Setting the number of consecutive non-improving nodes allowed controlling the level of the backtracking. Best First search uses the node depth as its cost.

As a result of such an optimization, a VoP_59 was obtained containing the following 59 descriptors: (VoP_59={TC,ASE1,ASE4,ASE5,ASE21,ASE23,ASE25–E29,ASEV1,ASEV29, ASE_MV,ASC,ASC_V,ASS_V,SFM1,SFMV1–3,SFMV5,SFMV6,SFMV8–12,SFMV14, MFCC2–7,MFCC9–13,MFCC17,MFCC20,MFCCV1–6,MFCCV8,MFCCV19,THR_2RMS_ TOT,THR_3RMS_TOT,THR_1RMS_10FR_MEAN,THR_2RMS_10FR_MEAN,THR_3RMS_ 10FR_MEAN,1RMS_TCD,2RMS_TCD,ZCD_10FR_VAR,1RMS_TCD_10FR_MEAN}).

The Principal Component Analysis (PCA) was also used to reduce data dimensionality. Resulted from PCA there were 74 new components that retain information contained previously in FV of 173 attributes. For the set of 74 components, a similar correctness to the Best First algorithm was obtained. Since these two data reduction approaches returned similar results, thus the smaller VoP_59 was employed for further analysis.

4.1.2 Adding parameters extracted from separated tracks

The importance of the parameters selected for the specific instrument may play a quite significant part in the classification process. Thus, the next series of experiments that involved adding parameters to the feature vector related to separated music tracks were performed. It was also checked that the mixture of more than three signals, in most cases, returned less promising results. Further analysis involved optimization of VoPs using the Best First method.

The optimization process was performed according to the scheme presented below:

  1. a)

    The reduced VoP p_59 (as shown above in Section 4.1.1) was applied to each separated signal (the same attributes for original and separated signals). In that way new VoPs of 118 attributes were created for each mixture of two signals, and then subjected to the Best First method.

  2. b)

    The Best First method was applied for each single FV (173 attributes) of the separated signal and added to VoP p_59 (original signal). As a result, new VoPs of different length (OH p_125, OD p_90, OP p_105, OT p_102, OS p_74) were created.

  3. c)

    The Best First method was applied for the VoP of mixture of two FVs (173 (original) + 173 (separated)). This way the following VoPs were extracted: OH p_79, OD p_65, OP p_61, OT p_60, OS p_60.

Even though not every of the optimized VoPs as shown below gave the best overall correctness of classification, they were chosen for further experiments:

  1. a)

    In the case of the OH signal the correctness of classification of the following genres: Classical, Pop, Latin Music and New Age, was taken into consideration with regard to the importance of Harmonic part for those genres. Based on those criteria, VoP p_59 (as shown above) was chosen for the OH signal.

  2. b)

    In the case of Drum mixture Rock, Hard Rock & Metal and Alternative Rock genres were taken into consideration. Based on those criteria, VoP p_90 was chosen for the OD signal: (VoP p_90={VoP_59+drum_SC_v,drum_ASE1,7,16–19,29,drum_ASEv25,29, drum_ASE_MV,drum_SFM12–15,17,19,drum_SFMv18,drum_MFCCV1–4,drum_THR_ 3RMS_TOT,drum_THR_1RMS_10FR_MEAN,drum_THR_2RMS_10FR_MEAN,drum_ THR_3RMS_10FR_MEAN,drum_THR_3RMS_10FR_VAR,drum_PEAK_RMS_TOT, drum_1RMS_TCD,drum_ZCD_10FR_VAR,drum_1RMS_TCD_10FR_VAR}).

  3. c)

    In the case of Piano Classical, Blues, Jazz and New Age genres were taken into consideration. Based on those criteria, VoP p_61 was chosen for the OP signal (OP VoP_61={TC,ASE1,4,5,21,23,25–29,ASEV1,29,ASE_MV, ASC,ASC_V,ASS_V,SFM1, SFMV1–5,8–12,14,MFCC2–7,9–13,17,20,MFCCV1–6,8,19, THR_2RMS_TOT,THR_ 3RMS_TOT,THR_1RMS_10FR_MEAN,THR_2RMS_10FR_MEAN,THR_3RMS_10FR_ MEAN,1RMS_TCD, 2RMS_TCD,ZCD_10FR_VAR,1RMS_TCD_10FR_MEAN,piano_ ASE_MV, piano_1RMS_TCD_10FR_MEAN}).

  4. d)

    In the case of Trumpet and Saxophone Blues, Jazz and New Age genres were taken into consideration. Therefore, OT VoP p_60 for OT and OS VoP p_60 for OS were chosen, correspondingly. Their structures are as follows: OT VoP p_60={OP VoP_61+Trumpet_MFCC1} and OS VoP p_60={OP VoP_61+sax_MFCC2}.

4.1.3 Results and discussion

All results presented further on refer to the Co-SVM-based classification method. Cross-validation with 3-folds was used in this stage of experiments. Three iterations of cross-validation were performed. Then the individual performance values are aggregated, by calculating the mean over the three rounds. Tables 2 and 3 show confusion matrices along with the effectiveness measures, i.e. Precision, Recall (True Positive Rate), F1 and Accuracy obtained for VoP_59 for the original audio and for the OH (original + harmonic) mixture, correspondingly. As seen from Tables 2 and 3, there is a high degree of misclassifications between alternative rock and rock. It should be emphasized that a song in the Synat database was assigned to the particular music genre automatically by its label, however these two genres may be very similar both in perception and an automatic evaluation by a decision algorithm, hence difficult to be distinguished between each other.

Table 2 Confusion matrix for the Original signal based on VoP_p59 using Co-SVM classification method along with the effectiveness measures
Table 3 Confusion matrix for the OH (VoP_p59) for the Co-SVM classification method along with the effectiveness measures

Overall, the improvement in classification results occurred for almost all music genres when expanding the original audio-based feature vector by parameters derived from harmonic, drum or a sum of harmonic and drum signals. For example, for the OH signal, Alternative Rock was less confused than in the case of the Original signal, Pop was less confused with New Age than for the original signal, Blues with Country, and Country with Rock, Latin with Country and with Jazz, Rock with New Age, etc. However, these results were statistically significant only in the case of: Alternative Rock, Blues, Classical, DanceDJ, Hard Rock & Metal, Latin Music, Rap & Hip Hop, R&B genres (see Table 8). Even though the difference in classification accuracy for Country, Jazz, New Age, Pop, Rock genres is still visible when using the expanded VoPs, the results obtained are not statistically significant.

The next step of experiments was to use different VoPs in the classification process depending on the type of music genre and taking into account pre-study classification results. Therefore, experiments were designed according to Section 4.1.2, i.e. selecting the most effective VoPs and the Co-SVM settings for the particular mixture of separated signals. In Tables 456 and 7 effectiveness measures, i.e. Precision, Recall, F1 and Accuracy obtained for OH, OD, OP, OT and OS signals are shown. It should be emphasized that both Precision and Recall measures have high values for most music classes except for Alt. Rock, Blues and Pop. Even though parameters selected for OH, OD, OP, OT and OS seem to return quite similar results, the list of attributes related to the original signal for these mixtures differs, this especially concerns OH and OD VoPs. It was also observed that for the VoPs of OP, OT and OS mixtures, the difference between those VoPs is only between the attributes related to the instrument part, i.e. two parameters for piano (piano_ASE_MV and piano_1rms_TCD_10FR_MEAN), one selected for trumpet (trum_mfcc1) and one for saxophone (sax_MFCC2).

Table 4 Effectiveness measures for the OD signal based on VoP_p90 using Co-SVM classification method [%]
Table 5 Effectiveness measures for the OP signal based on VoP_p61 using Co-SVM classification method [%]
Table 6 Effectiveness measures for the OT signal based on VoP_p60 using Co-SVM classification method [%]
Table 7 Effectiveness measures for the OS signal based on VoP_p60 using Co-SVM classification method [%]

To confirm the statistical significance of the results, T-Student test was carried out. The value of the T-student parameter above the 2.201 value indicates that the null hypothesis can be rejected, thus all the results with values above the threshold are statistically significant. Statistical significance threshold was set at 0.05. In Table 8 T-Student’s test values are contained for all the combinations of VoPs. As observed from Table 8, the statistical analysis returned values above the threshold for: Alternative Rock, Blues, Classical, DanceDJ, Hard Rock & Metal, Latin, Rap & Hip Hop and R&B genres. In the case of Pop and New Age using mixtures of OP and OT signals, correspondingly, makes the difference statistically significant.

Table 8 Student’s T values under the null hypothesis for independent samples (statistical significance threshold set at 0.05)

The empirical study performed by the authors brought several findings:

  • In most cases of the mixture of signals the improvement of the effectiveness measures was observed in comparison to the original signal.

  • For each of the genres where Harmonic plays important part (Classical, Latin Music, New Age and Pop) the improvement of TPR values is observed for the OH signal. For the three of four selected genres (Classical vs. Latin Music and vs. New Age and vs. Pop), the improvement of Precision is also observed. In particular, an increase of 12.53 percent points in Precision for the New Age is achieved. Jazz genre deserves a special attention, in which case Precision was higher for over 4.27 percent points and TPR for over 3.34 percent points, as well as DanceDJ where TPR got over 3.1 percent points higher. The improvement of Jazz should be stressed out especially in the context of the lower rate of misclassification between Pop and Jazz. It was also shown that for genres such as Rock and Hard Rock & Metal, the decrease of correctness for OH was observed, what confirms that the harmonic part does not play an important part for those genres. Surprisingly, Alternative Rock got over 6 percent points of improvement of the TPR. The behavior of Blues is also interesting, where the TPR was also improved for over 4.5 percent points, while Precision decreased by almost 5 percent points.

  • The improvement of Recall (TPR) value for Alternative Rock was gained in the case of the OD signal. Surprisingly, higher Precision values of New Age (over 7 percent points) and Recall (TPR) of Dance & DJ (almost 4 percent points) were also gained. In the case of classes such as Latin Music, New Age, Pop and Classical, a slight improvement of classification was also observed. That proves that the lack of Drum element (the percussion signal was present only for 89.6% of elements from the input audio dataset) is a piece of information/feature for the classifier with the significance in the training process.

  • The improvement of classification for the genres where the piano plays an important part was not so visible for the OP signal. Precision was improved in the case of several genres (e.g. Classical, Dance & DJ, Jazz, Latin, but also Blues, Alternative Rock, etc.), along with Recall (TPR) values. Improvement of over 3 percent points for the DanceDJ genre was obtained.

  • A slight improvement of Precision is to be observed, i.e. for the OT – Hard Rock & Metal, Latin, New Age, and in the case of OS – New Age, Rap & Hip Hop and R&B. This is also visible for Recall (TPR) values (e.g. DanceDJ, R&B).

5 Summary

The article focuses on automatic music genre classification while using the original and separated tracks. The instrument separation approach was selected to improve the results of music genre classification, and in particular to decrease the misclassification between selected genres in the context of the influence of the specific instrument on selected genres. For that case, a Non-Negative Matrix Factorization (NMF) method was adapted from the literature, and a new way of using the separated and original signals for parameterization was proposed. Since other researchers (Lampropoulos et al. 2005; Rump et al. 2010) applied just one specific single signal in the process of music classification, which did not result in high accuracy, the authors’ approach was based on creating a new VoP, i.e. extending the original FV by new attributes representing a specific instrument. In that way five different separated audio signals were obtained: harmonic, drum, piano, trumpet and saxophone. It should also be noted that this is a multi-instrument separation process, as the “drum” signal consists of a few instruments: snare drum, bass drum, tom-tom, timpani, crash cymbals, etc. With regard to the piano, we have to keep in mind that classical piano has quite different kind of sound than e.g. a jazzy piano. VoP consisted of an original audio and separated piano (OP) did not improve the results for classical music but decreased the misclassification of Jazz.

In the analysis performed, the overall correctness of classification was higher in almost each case of the mixed VoP in comparison to the Original signal. Also, it was observed that the specific mix of signals improved the correctness of classification of genres where this signal played an important part. This means that for genres where harmonic instruments play an important part, e.g. New Age, Pop, Latin Music, the correctness of classification increased. The same tendency was observed for other mixed VoPs: OD signal for Alternative Rock, Hard Rock & Metal, as well as DanceDj, and New Age. In the case of the OP signal, the improvement in classification of Blues, Classical and New Age was also visible. Overall, a decrease in misclassification between the similar, as well as opposite genres was obtained.

In the process of the analysis over 8.000 music tracks, representing 13 music genres, were extracted from the Synat database. Although many research works were published in the area of music genre classification, most of them, with some exceptions, analyze only a few genres represented by ∼ 1.000 songs in total. The results shows that the overall classification obtained by the authors reaches ∼ 72%, what is ∼ 10 percent points better in comparison to the results shown in the literature, i.e. ∼ 60% for 10 musical genres (Tzanetakis et al. 2002) and ∼ 57.8% for 13 music genres (Burred 2014).