1 Introduction

The problem of identifying, discriminating between, and learning the criteria of music genres or styles—music genre recognition (MGR)—has motivated much work since 1995 (Matityaho and Furst 1995), and even earlier, e.g., Porter and Neuringer (1984). Indeed, a recent review of MGR by Fu et al. (2011) writes, “Genre classification is the most widely studied area in [music information retrieval].” MGR research is now making an appearance in textbooks (Lerch 2012). Most published studies of MGR systems report classification performance significantly better than chance, and sometimes as well as or better than humans. For a benchmark dataset of music excerpts singly-labeled in ten genres (GTZAN, Tzanetakis and Cook 2002; Sturm 2013b), reported classification accuracies have risen from 61 % (Tzanetakis and Cook 2002) to above 90 %, e.g., Guaus (2009), Panagakis et al. (2009a, b), Panagakis and Kotropoulos (2010) and Chang et al. (2010). Indeed, as Bergstra et al. (2006a) write, “Given the steady and significant improvement in classification performance since 1997, we wonder if automatic methods are not already more efficient at learning genres than some people.” This performance increase merits a closer look at what is working in these systems, and motivates re-evaluating the argument that genre exists to a large extent outside of the acoustic signal itself (Fabbri 1982; McKay and Fujinaga 2006; Wiggins 2009). Most exciting, it might also illuminate how people hear and conceptualize the complex phenomenon of “music” (Aucouturier and Bigand 2013). It might be too soon to ask such questions, however.

Recent work (Sturm 2012b; Marques et al. 2010, 2011a) shows that an MGR system can act as if genre is not what it is recognizing, even if it shows high classification accuracy. In a comprehensive review of the MGR literature (Sturm 2012a), we find that over 91 % of papers with an experimental component (397 of 435 papers) evaluate MGR systems by classifying music excerpts and comparing the labels to the “ground truth,” and over 82 % of 467 published works cite classification accuracy as a figure of merit (FoM). Of those that employ this approach to evaluation, 47 % employ only this approach. Furthermore, we find several cases of methodological errors leading to inflated accuracies: those of Panagakis et al. (2009a, b) and Panagakis and Kotropoulos (2010) come from accidentally using the true labels in classification (private correspondence with Y. Panagakis) (Sturm and Noorzad 2012); those of Chang et al. (2010), are irreproducible, and contradict results seen in other areas applying the same technique (Sturm 2013a); and those of Bağci and Erzin (2007) are highly unlikely with an analysis of their approach (Sturm and Gouyon 2013, unpublished). One must wonder if the “progress” in MGR seen since 1995 is not due to solving the problem: can a system have a high classification accuracy in some datasets yet not even address the problem at all?

We show here that classification accuracy does not reliably reflect the capacity of an MGR system to recognize genre. Furthermore, recall, precision and confusion tables are still not enough. We show these FoMs—all of which have been used in the past to rank MGR systems, e.g., Chai and Vercoe (2001), Tzanetakis and Cook (2002), Aucouturier and Pachet (2003), Burred and Lerch (2004), Turnbull and Elkan (2005), Flexer (2006), DeCoro et al. (2007), Benetos and Kotropoulos (2008), Panagakis et al. (2009b), Bergstra et al. (2010), Fu et al. (2011) and Ren and Jang (2012) citing one work from each year since 2001—do not reliably reflect the capacity of an MGR system to recognize genre. While these claims have not been made overt in any of the 467 references we survey (Sturm 2012a), shades of it have appeared before (Craft et al. 2007; Craft 2007; Lippens et al. 2004; Wiggins 2009; Seyerlehner et al. 2010; Sturm 2012b), which argue for evaluating performance in ways that account for the ambiguity of genre being in large part a subjective construction (Fabbri 1982; Frow 2005). We go further and argue that the evaluation of MGR systems—the experimental designs, the datasets, and the FoMs—and indeed, the development of future systems, must embrace the fact that the recognition of genre is to a large extent a musical problem, and must be evaluated as such. In short, classification accuracy is not enough to evaluate the extent to which an MGR system addresses what appears to be one of its principal goals: to produce genre labels indistinguishable from those humans would produce.

1.1 Arguments

Some argue that since MGR is now replaced by, or is a subproblem of, the more general problem of automatic tagging (Aucouturier and Pampalk 2008; Bertin-Mahieux et al. 2010), work in MGR is irrelevant. However, genre is one of the most used descriptors of music (Aucouturier and Pachet 2003; Scaringella et al. 2006; McKay and Fujinaga 2006): in 2007, nearly 70 % of the tags on last.fm are genre labels (Bertin-Mahieux et al. 2010); and a not insignificant portion of the tags in the Million Song Dataset are genre (Bertin-Mahieux et al. 2011; Schindler et al. 2012). Some argue that automatic tagging is more realistic than MGR because multiple tags can be given rather than the single one in MGR, e.g., Panagakis et al. (2010b), Marques et al. (2011a), Fu et al. (2011) and Seyerlehner et al. (2012). This claim and its origins are mysterious because nothing about MGR—the problem of identifying, discriminating between, and learning the criteria of music genres or styles—naturally restricts the number of genre labels people use to describe a piece of music. Perhaps this imagined limitation of MGR comes from the fact that of 435 works with an experimental component we survey (Sturm 2012a), we find only ten that use a multilabel approach (Barbedo and Lopes 2008; Lukashevich et al. 2009; Mace et al. 2011; McKay 2004; Sanden 2010; Sanden and Zhang 2011a, b; Scaringella et al. 2006; Tacchini and Damiani 2011; Wang et al. 2009). Perhaps it comes from the fact that most of the private and public datasets so far used in MGR assume a model of one genre per musical excerpt (Sturm 2012a). Perhaps it comes from the assumption that genre works in such a way that an object belongs to a genre, rather than uses a genre (Frow 2005).

Some argue that, given the ambiguity of genre and the observed lack of human consensus about such matters, MGR is an ill-posed problem (McKay and Fujinaga 2006). However, people often do agree, even under surprising constraints (Gjerdingen and Perrott 2008; Krumhansl 2010; Mace et al. 2011). Researchers have compiled MGR datasets with validation from listening tests, e.g., (Lippens et al. 2004; Meng et al. 2005); and very few researchers have overtly argued against any of the genre assignments of the most-used public dataset for MGR (Sturm 2012a, 2013b). Hence, MGR does not always appear to be an ill-posed problem since people often use genre to describe and discuss music in consistent ways, and that, not to forget, MGR makes no restriction on the number of genres relevant for describing a particular piece of music. Some argue that though people show some consistency in using genre, they are making decisions based on information not present in the audio signal, such as composer intention or marketing strategies (McKay and Fujinaga 2006; Bergstra et al. 2006b; Wiggins 2009). However, there exist some genres or styles that appear distinguishable and identifiable from the sound, e.g., musicological criteria like tempo (Gouyon and Dixon 2004), chord progressions (Anglade et al. 2010), instrumentation (McKay and Fujinaga 2005), lyrics (Li and Ogihara 2004), and so on.

Some argue that MGR is really just a proxy problem that has little value in and of itself; and that the purpose of MGR is really to provide an efficient means to gauge the performance of features and algorithms solving the problem of measuring music similarity (Pampalk 2006; Schedl and Flexer 2012). This point of view, however, is not evident in much of the MGR literature, e.g., the three reviews devoted specifically to MGR (Aucouturier and Pachet 2003; Scaringella et al. 2006; Fu et al. 2011), the work of Tzanetakis and Cook (2002), Barbedo and Lopes (2008), Bergstra et al. (2006a), Holzapfel and Stylianou (2008), Marques et al. (2011b), Panagakis et al. (2010a), Benetos and Kotropoulos (2010), and so on. It is thus not idiosyncratic to claim that one purpose of MGR could be to identify, discriminate between, and learn the criteria of music genres in order to produce genre labels that are indistinguishable from those humans would produce. One might argue, “MGR does not have much value since most tracks today are already annotated with genre.” However, genre is not a fixed attribute like artist or instrumentation (Fabbri 1982; Frow 2005); and it is certainly not an attribute of only commercial music infallibly ordained by composers, producers, and/or consumers using perfect historical and musicological reflection. One cannot assume such metadata are static and unquestionable, or that even such information is useful, e.g., for computational musicology (Collins 2012).

Some might argue that the reasons MGR work is still published is that: 1) it provides a way to evaluate new features; and 2) it provides a way to evaluate new approaches to machine learning. While such a claim about publication is tenuous, we argue that it makes little sense to evaluate features or machine learning approaches without considering for what they are to be used, and then designing and using appropriate procedures for evaluation. We show in this paper that the typical ways in which new features and machine learning methods are evaluated for MGR provide little information about the extents to which the features and machine learning for MGR address the fundamental problem of recognizing music genre.

1.2 Organization and conventions

We organize this article as follows. Section 2 distills along three dimensions the variety of approaches that have been used to evaluate MGR: experimental design, datasets, and FoMs. We delimit our study to work specifically addressing the recognition of music genre and style, and not tags in general, i.e., the 467 works we survey (Sturm 2012a). We show most work in MGR reports classification accuracy from a comparison of predicted labels to “ground truths” of private datasets. The third section reviews three state-of-the-art MGR systems that show high classification accuracy in the most-used public music genre dataset GTZAN (Tzanetakis and Cook 2002; Sturm 2013b). In the fourth section, we evaluate the performance statistics of these three systems, starting from high-level FoMs such as classification accuracy, recall and precision, continuing to mid-level class confusions. In the fifth section, we evaluate the behaviors of these systems by inspecting low-level excerpt misclassifications, and performing a listening test that proves the behaviors of all three systems are highly distinguishable from those of humans. We conclude by discussing our results and further criticisms, and a look forward to the development and practice of better means for evaluation, not only in MGR, but also the more general problem of music description.

We use the following conventions throughout. When we refer to Disco, we are referring to those 100 excerpts in the GTZAN category named “Disco” without advocating that they are exemplary of the genre disco. The same applies for the excerpts of the other nine categories of GTZAN. We capitalize the categories of GTZAN, e.g., Disco, capitalize and quote labels, e.g., “Disco,” but do not capitalize genres, e.g., disco. A number following a category in GTZAN refers to the identifying number of its excerpt filename. All together, “it appears this system does not recognize disco because it classifies Disco 72 as ‘Metal’.”

2 Evaluation in music genre recognition research

Surprisingly little has been written about evaluation, i.e., experimental design, data, and FoMs, with respect to MGR (Sturm 2012a). An experimental design is a method for testing a hypothesis. Data is the material on which a system is tested. A FoM reflects the confidence in the hypothesis after conducting an experiment. Of three review articles devoted in large part to MGR (Aucouturier and Pachet 2003; Scaringella et al. 2006; Fu et al. 2011), only Aucouturier and Pachet (2003) give a brief paragraph on evaluation. The work by Vatolkin (2012) provides a comparison of various performance statistics for music classification. Other works (Berenzweig et al. 2004; Craft et al. 2007; Craft 2007; Lippens et al. 2004; Wiggins 2009; Seyerlehner et al. 2010; Sturm 2012b) argue for measuring performance in ways that take into account the natural ambiguity of music genre and similarity. For instance, we Sturm (2012b), Craft et al. (2007) and Craft (2007) argue for richer experimental designs than having a system apply a single label to music with a possibly problematic “ground truth.” Flexer (2006) criticizes the absence of formal statistical testing in music information research, and provides an excellent tutorial based upon MGR for how to apply statistical tests. Derived from our survey (Sturm 2012a), Fig. 1 shows the annual number of publications in MGR, and the proportion that use formal statistical testing in comparing MGR systems.

Fig. 1
figure 1

Annual numbers of references in MGR divided by which use and do not use formal statistical tests for making comparisons (Sturm 2012a). Only about 12 % of references in MGR employ formal statistical testing; and only 19.4 % of the work (91 papers) appears at the Conference of the International Society for Music Information Retrieval

Table 1 summarizes ten experimental designs we find in our survey (Sturm 2012a). Here we see that the most widely used design by far is Classify. The experimental design used the least is Compose, and appears in only three works (Cruz and Vidal 2003; Cruz and Vidal 2008; Sturm 2012b). Almost half of the works we survey (213 references), uses only one experimental design; and of these, 47 % employ Classify. We find only 36 works explicitly mention evaluating with an artist or album filter (Pampalk et al. 2005; Flexer 2007; Flexer and Schnitzer 2009; Flexer and Schnitzer 2010). We find only 12 works using human evaluation for gauging the success of a system.

Table 1 Ten experimental designs of MGR, and the percentage of references having an experimental component (435 references) in our survey (Sturm 2012a) that employ them

Typically, formally justifying a misclassification as an error is a task research in MGR often defers to the “ground truth” of a dataset, whether created by a listener (Tzanetakis and Cook 2002), the artist (Seyerlehner et al. 2010), music vendors (Gjerdingen and Perrott 2008; Ariyaratne and Zhang 2012), the collective agreement of several listeners (Lippens et al. 2004; García et al. 2007) professional musicologists (Abeßer et al. 2012), or multiple tags given by an online community (Law 2011). Table 2 shows the datasets used by references in our survey (Sturm 2012a). Overall, 79 % of this work uses audio data or features derived from audio data, about 19 % uses symbolic music data, and 6 % uses features derived from other sources, e.g., lyrics, the WWW, and album art. (Some works use more than one type of data.) About 27 % of work evaluates MGR systems using two or more datasets. While more than 58 % of the works uses datasets that are not publicly available, the most-used public dataset is GTZAN (Tzanetakis and Cook 2002; Sturm 2013b).

Table 2 Datasets used in MGR, the type of data they contain, and the percentage of experimental work (435 references) in our survey (Sturm 2012a) that use them

Table 3 shows the FoMs used in the works we survey (Sturm 2012a). Given Classify is the most-used design, it is not surprising to find mean accuracy appears the most often. When it appears, only about 25 % of the time is it accompanied by standard deviation (or equivalent). We find 6 % of the references report mean accuracy as well as recall and precision. Confusion tables are the next most prevalent FoM; and when one appears, it is not accompanied by any kind of musicological reflection about half the time. Of the works that use Classify, we find about 44 % of them report one FoM only, and about 53 % report more than one FoM. At least six works report human-weighted ratings of classification and/or clustering results.

Table 3 Figures of merit (FoMs) of MGR, their description, and the percentage of work (467 references) in our survey (Sturm 2012a) that use them

One might argue that the evaluation above does not clearly reflect that most papers on automatic music tagging report recall, precision, and F-measures, and not mean accuracy. However, in our survey we do not consider work in automatic tagging unless part of the evaluation specifically considers the resulting genre tags. Hence, we see that most work in MGR uses classification accuracy (the experimental design Classify with mean accuracy as a FoM) in private datasets, or GTZAN (Tzanetakis and Cook 2002; Sturm 2013b).

3 Three state-of-the-art systems for music genre recognition

We now discuss three MGR systems that appear to perform well with respect to state of the art classification accuracy in GTZAN (Tzanetakis and Cook 2002; Sturm 2013b), and which we evaluate in later sections.

3.1 AdaBoost with decision trees and bags of frames of features (AdaBFFs)

AdaBFFs was proposed by Bergstra et al. (2006a), and performed the best in the 2005 MIREX MGR task (MIREX 2005). It combines weak classifiers trained by multiclass AdaBoost (Freund and Schapire 1997; Schapire and Singer 1999), which creates a strong classifier by counting “votes” of weak classifiers given observation x. With the features in ℝM of a training set labeled in K classes, iteration l adds a weak classifier v l (x): ℝM →{ − 1,1}K and weight \(\mathit{w}_l \in [0,1]\) to minimize the total prediction error. A positive element means it favors a class, whereas negative means the opposite. After L training iterations, the classifier is the function f(x): ℝM →[ − 1,1]K defined

$$ \mathbf{f}(\mathbf{x}) := \sum_{l=1}^L \mathit{w}_l \mathbf{v}_l(\mathbf{x})\Bigr/\sum_{l=1}^L \mathit{w}_l. $$
(1)

For an excerpt of recorded music consisting of a set of features \(\mathcal{X}\) : = {x i }, AdaBFFs picks the class k ∈ {1, ..., K} associated with the maximum element in the sum of weighted votes:

$$ f_k(\mathcal{X}) := \sum_{i=1}^{|\mathcal{X}|} [\mathbf{f}(\mathbf{x}_i)]_k \label{eq:Adaexcerptprob} $$
(2)

where [a] k is the kth element of the vector a.

We use the “multiboost package” (Benbouzid et al. 2012) with decision trees as the weak learners, and AdaBoost.MH (Schapire and Singer 1999) as the strong learner. The features we use are computed from a sliding Hann window of 46.4 ms and 50 % overlap: 40 Mel-frequency cepstral coefficients (MFCCs) (Slaney 1998), zero crossings, mean and variance of the magnitude Fourier transform (centroid and spread), 16 quantiles of the magnitude Fourier transform (rolloff), and the error of a 32-order linear predictor. We disjointly partition the set of features into groups of 130 consecutive frames, and then compute for each group the means and variances of each dimension. For a 30-s music excerpt, this produces 9 feature vectors of 120 dimensions. Bergstra et al. (2006a) report this approach obtains a classification accuracy of up to 83 % in GTZAN. In our reproduction of the approach (Sturm 2012b), we achieve using stumps (single node decision trees) as weak classifiers a classification accuracy of up to 77.6 % in GTZAN. We increase this to about 80 % by using two-node decision trees.

3.2 Sparse representation classification with auditory temporal modulations (SRCAM)

SRCAM (Panagakis et al. 2009b; Sturm and Noorzad 2012) uses sparse representation classification (Wright et al. 2009) in a dictionary composed of auditory features. This approach is reported to have classification accuracies above 90 % (Panagakis et al. 2009a, b; Panagakis and Kotropoulos 2010), but those results arise from a flaw in the experiment inflating accuracies from around 60 % (Sturm and Noorzad 2012) (private correspondence with Y. Panagakis). We modify the approach to produce classification accuracies above 80 % (Sturm 2012b). Each feature comes from a modulation analysis of a time-frequency representation; and for a 30-s sound excerpt with sampling rate 22,050 Hz, the feature dimensionality is 768. To create a dictionary, we either normalize the set of features (mapping all values in each dimension to [0,1] by subtracting the minimum value and dividing by the largest difference), or standardize them (making all dimensions have zero mean and unit variance).

With the dictionary D : = [d 1 | d 2 | ⋯ |d N ], and a mapping of columns to class identities \(\bigcup_{k=1}^K \mathcal{I}_k = \{1,\dotsc,N\}\), where \(\mathcal{I}_k\) specifies the columns of D belonging to class k, SRCAM finds for a feature vector x (which is the feature x we transform by the same normalization or standardization approach used to create the dictionary) a sparse representation s by solving

$$ \min \|\mathbf{s}\|_1 \; \textrm{subject to} \; \| \mathbf{x}^\prime - \mathbf{Ds}\|_2^2 \le \varepsilon^2 \label{eq:BPDN} $$
(3)

for a ε 2 > 0 we specify. SRCAM then defines the set of class-restricted weights \( \{\mathbf{s}_k \in \mathbb{R}^N\}_{k\in \{1, \ldots, K\} } \)

$$[\mathbf{s}_k]_n := \begin{cases} [\mathbf{s}]_n, & n \in \mathcal{I}_k \\ 0, & \textrm{else}. \end{cases} $$
(4)

Thus, s k are the weights in s specific to class k. Finally, SRCAM classifies x by finding the class-dependent weights giving the smallest error

$$ \hat{k}(\mathbf{x}^\prime) := \arg\min_{k\in \{1, \ldots, K\} } \|\mathbf{x}^\prime - \mathbf{Ds}_k\|_2^2. \label{eq:nnapprox} $$
(5)

We define the confidence of SRCAM in assigning class k to x by comparing the errors:

$$ C(k|\mathbf{x}) := \frac{\max_{k'} J_{k'} - J_k}{\sum_l [\max_{k'} J_{k'} - J_{l}]} \label{eq:SRCconfidence} $$
(6)

where \(J_k := \|\mathbf{x}^\prime - \mathbf{Ds}_k\|_2^2\). Thus, C(k|x) ∈ [0,1] where 1 is certainty.

3.3 Maximum a posteriori classification of scattering coefficients (MAPsCAT)

MAPsCAT uses the novel features proposed in Mallat (2012), the use of which for MGR was first proposed by Andén and Mallat (2011). MAPsCAT applies these features within a Bayesian framework, which seeks to choose the class with minimum expected risk given observation x. Assuming the cost of all misclassifications are the same, and that all classes are equally likely, the Bayesian classifier becomes the maximum a posteriori (MAP) classifier (Theodoridis and Koutroumbas 2009):

$$ k^* = \arg\max_{k\in\{1, \ldots, K\}} P[\mathbf{x}|k] P(k) $$
(7)

where P[x|k] is the conditional model of the observations for class k, and P(k) is a prior. MAPsCAT assumes \(P[\mathbf{x}|k] \sim \mathcal{N}\)(μ k , C k ), i.e., the observations from class k are distributed multivariate Gaussian with mean μ k and covariance C k . MAPsCAT estimates these parameters using unbiased minimum mean-squared error estimation and the training set. When a music excerpt produces several features \(\mathcal{X} := \{\mathbf{x}_i\}\), MAPsCAT assumes independence between them, and picks the class maximizing the sum of the log posteriors:

$$ p_k(\mathcal{X}) := \log P(k) + \sum_{i=1}^{|\mathcal{X}|} \log P[\mathbf{x}_i|k]. \label{eq:MAPsCATpost} $$
(8)

Scattering coefficients are attractive features for classification because they are designed to be invariant to particular transformations, such as translation and rotation, to preserve distances between stationary processes, and to embody both large- and short-scale structures (Mallat 2012). One computes these features by convolving the modulus of successive wavelet decompositions with the scaling wavelet. We use the “scatterbox” implementation (Andén and Mallat 2012) with a second-order decomposition, filter q-factor of 16, and a maximum scale of 160. For a 30-s sound excerpt with sampling rate 22,050 Hz, this produces 40 feature vectors of dimension 469. Andén and Mallat (2011) report these features used with a support vector machine obtains a classification accuracy of 82 % in GTZAN. We obtain comparable results.

4 Evaluating the performance statistics of MGR systems

We now evaluate the performance of AdaBFFs, SRCAM and MAPsCAT using Classify and mean accuracy in GTZAN (Tzanetakis and Cook 2002). Despite the fact that GTZAN is a problematic dataset—it has many repetitions, mislabelings, and distortions (Sturm 2013b)—we use it for four reasons: 1) it is the public benchmark dataset most used in MGR research (Table 2); 2) it was used in the initial evaluation of AdaBFFs (Bergstra et al. 2006a), SRCAM (Panagakis et al. 2009b), and the features of MAPsCAT (Andén and Mallat 2011); 3) evaluations of MGR systems using GTZAN and other datasets show comparable performance, e.g., Moerchen et al. (2006), Ren and Jang (2012), Dixon et al. (2010), Schindler and Rauber (2012); and 4) since its contents and faults are now well-studied (Sturm 2013b), we can appropriately handle its problems, and in fact use them to our advantage.

We test each system by 10 trials of stratified 10-fold cross-validation (10×10 fCV). For each fold, we test all systems using the same training and testing data. Every music excerpt is thus classified ten times by each system trained with the same data. For AdaBFFs, we run AdaBoost for 4000 iterations, and test both decision trees of two nodes or one node (stumps). For SRCAM, we test both standardized and normalized features, and solve its inequality-constrained optimization problem (3) for ε 2 = 0.01 using SPGL1 (van den Berg and Friedlander 2008) with at most 200 iterations. For MAPsCAT, we test systems trained with class-dependent covariances (each C k can be different) or total covariance (all C k the same). We define all priors to be equal. It might be that the size of this dataset is too small for some approaches. For instance, since for SRCAM one excerpt produces a 768-dimensional feature, we might not expect it to learn a good model from only 90 excerpts. However, we start as many have before: assume GTZAN is large enough and has enough integrity for evaluating an MGR system.

4.1 Evaluating classification accuracy

Table 4 shows classification accuracy statistics for two configurations of each system presented above. In their review of several MGR systems, Fu et al. (2011) compare the performance of several algorithms using only classification accuracy in GTZAN. The work proposing AdaBFFs (Bergstra et al. 2006a), SRCAM (Panagakis et al. 2009b), and the features of MAPsCAT (Andén and Mallat 2011), present only classification accuracy. Furthermore, based on classification accuracy, Seyerlehner et al. (2010) argue that the performance gap between MGR systems and humans is narrowing; and in this issue, Humphrey et al. conclude “progress in content-based music informatics is plateauing” (Humphrey et al. 2013). Figure 2 shows that with respect to the classification accuracies in GTZAN reported in 83 published works (Sturm 2013b), those of AdaBFFs, SRCAM, and MAPsCAT lie above what is reported best in half of this work. It is thus tempting to conclude from these that, with respect to the mean accuracy and its standard deviation, some configurations of these systems are better than others, that AdaBFFs is not as good as SRCAM and MAPsCAT, and that AdaBFFs, SRCAM, and MAPsCAT are recognizing genre better than at least half of the “competition”.

Table 4 Mean accuracies in GTZAN for each system, and the maximum {p i } (9) over all 10 CV runs
Fig. 2
figure 2

Highest reported classification accuracies in GTZAN (Sturm 2013b). The legend shows evaluation parameters. Top gray line is the estimated maximum accuracy possible in GTZAN given its repetitions and mislabelings. The five “x” are results that are disputed, or known to be invalid. The dashed gray line is the accuracy we observe for SRCAM with normalized features and 2 fCV using an artist-filtered GTZAN without repetitions

These conclusions are unwarranted for at least three reasons. First, we cannot compare mean classification accuracies computed from 10×10 fCV because the samples are highly dependent (Dietterich 1996; Salzberg 1997). Hence, we cannot test a hypothesis of one system being better than another by using, e.g., a t-test, as we have erroneously done before (Sturm 2012b). Second, Classify is answering the question, “How well does the system predict a label assigned to a piece of data?” Since many independent variables change between genre labels in GTZAN, and since Classify does nothing to deal with that, we cannot guard against confounds (Sturm 2012b; Urbano et al. 2013). This becomes clear when we see that artist filtering (Pampalk et al. 2005; Flexer 2007; Flexer and Schnitzer 2009, 2010) drops classification accuracy 20 points in GTZAN (Sturm 2013b). Thus, even if a system obtains the highest possible classification accuracy in GTZAN of 94.3 % (Sturm 2013b), we cannot reasonably say it is due to a capacity to recognize genre. Finally, since none of the results shown in Fig. 2 come from procedures that guard against confounds, we still cannot make meaningful comparisons between them.

We can, however, make some supportable conclusions about our systems and their configurations. For each of the systems in Table 4, we test for a significant difference in performance using a binomial hypothesis test (Salzberg 1997). For one run of 10 fCV, we define c h > l as the number of times the system with the high mean accuracy is correct and the other wrong; and c h < l as the number of times the system with the low mean accuracy is correct, and the other wrong. We define the random variable C h > l as that from which c h > l is a sample; and similarly for C h < l . When the two systems perform equally well, we expect C h > l  = C h < l , and each to be distributed Binomial with parameters S = c h > l  + c h < l and q = 0.5—assuming each of S trials is iid Bernoulli. Then, the probability the system with high mean accuracy performs better than the other given that they actually perform the same is

$$ p = P[C_{h>l} \ge c_{h>l} | q = 0.5] = \sum_{s=c_{h>l}}^{c_{h>l}+c_{h<l}}{c_{h>l}+c_{h<l} \choose s} (0.5)^{c_{h>l}+c_{h<l}}. \label{eq:binomp} $$
(9)

We define statistical significance as α = 0.025 (one-tailed test). With the Bonferroni correction, we consider a result statistically significant if over all 10 CV runs, max {p i } < α/10.

The last column of Table 4 shows that we can only reject the null hypothesis for the two configurations of MAPsCAT. In the same way, we test for significant differences between pairs of the systems showing the highest mean accuracy, e.g., SRCAM with normalized features and MAPsCAT with total covariance. Since we find no significant differences between any of these pairs, we fail to reject the null hypothesis that any performs better than another.

4.2 Evaluating performance in particular classes

Figure 3 shows the recalls, precisions, and F-measures for AdaBFFs, SRCAM, and MAPsCAT. These FoMs, which appear infrequently in the MGR literature, can be more specific than mean accuracy, and provide a measure of how a system performs for specific classes. Wu et al. (2011) concludes on the relevance of their features to MGR by observing that the empirical recalls for Classical and Rock in GTZAN are above that expected for random. With respect to precision, Lin et al. (2004) concludes their system is better than another. We see in Fig. 3 for Disco that MAPsCAT using total covariance shows the highest recall (0.76±0.01, std. dev.) of all systems. Since high recall can come at the price of many false positives, we look at the precision. MAPsCAT displays this characteristic for Country. When it comes to Classical, we see MAPsCAT using class-dependent covariance has perfect recall; and using class-dependent covariance it shows high precision (0.85±0.01). The F-measure combines recall and precision to reflect class accuracy. We see that AdaBFFs appears to be one of the most accurate for Classical, and one of the least accurate for Disco.

Fig. 3
figure 3

Boxplots of recalls, precisions, and F-measures in 10×10-fold CV in GTZAN. Classes: Blues (bl), Classical (cl), Country (co), Disco (di), Hip hop (hi), Jazz (ja), Metal (me), Pop (po), Reggae (re), Rock (ro)

It is tempting to conclude that for these systems classical is quite easy to identify (high F-measure), and that this is reasonable since humans can easily identify such music as well. One might also be tempted to conclude that from looking at the F-measures between each system, AdaBFFs is the worst at identifying rock but the best at identifying classical. All of these are unfounded, however. First, our implementation of Classify is not testing whether classical is identifiable by these systems, but rather the extent to which they identify excerpts labeled “Classical” among the others in the dataset. Second, we cannot implicitly assume that whatever aspects humans use to identify classical music are the same used by these systems. Finally, just as the statistics of the accuracy in Table 4 come from dependent samples, we cannot simply compare these statistics over 10×10 fCV.

We can, however, test the null hypothesis that two systems perform equally well for identifying a particular class. We compare the performance of pairs of systems in each of the classes in each run of 10 fCV, i.e., we compute (9) but restricted to each class. We find that MAPsCAT with total covariance performs significantly better than MAPsCAT with class-dependent covariances for Hip hop and Metal (p < 10 − 4). We are unable, however, to reject the null hypothesis for any of the systems having the highest mean accuracy in Table 4.

4.3 Evaluating confusions

Figure 4 shows the mean confusions, recalls (diagonal), precisions (right), F-measures (bottom), and accuracy (bottom right), for three of our systems. (We herein only consider the configuration that shows the highest mean classification accuracy in Table 4.) We find confusion tables reported in 32 % of MGR work (Sturm 2012a). Confusion tables are sometimes accompanied by a discussion of how a system appears to perform in ways that make sense with respect to what experience and musicology say about the variety of influences and characteristics shared between particular genres, e.g., Tzanetakis et al. (2003), Ahrendt (2006), Rizzi et al. (2008), Sundaram and Narayanan (2007), Abeßer et al. (2012), Yao et al. (2010), Ren and Jang (2011, 2012), Umapathy et al. (2005), Homburg et al. (2005), Tzanetakis and Cook (2002), Holzapfel and Stylianou (2008), Chen and Chen (2009). For instance, Tzanetakis and Cook (2002) write that the misclassifications of their system “... are similar to what a human would do. For example, classical music is misclassified as jazz music for pieces with strong rhythm from composers like Leonard Bernstein and George Gershwin. Rock music has the worst classification accuracy and is easily confused with other genres which is expected because of its broad nature.” Of their results, Holzapfel and Stylianou (2008) write, “In most cases, misclassifications have musical sense. For example, the genre Rock ... was confused most of the time with Country, while a Disco track is quite possible to be classified as a Pop music piece. ... [The] Rock/Pop genre was mostly misclassified as Metal/Punk. Genres which are assumed to be very different, like Metal and Classic, were never confused.” The human-like confusion tables found in MGR work, as well as the ambiguity between music genres, motivates evaluating MGR systems by considering as less troublesome the confusions we expect from humans (Craft et al. 2007; Craft 2007; Lippens et al. 2004; Seyerlehner et al. 2010).

Fig. 4
figure 4

Mean confusions with standard deviations for each system (only one configuration each for lack of space). Columns are true genres, with mean precision (Pr × 100) shown in last column. Classification accuracy shown in bottom right corner. Rows are predicted genres, with mean F-measure (F × 100) in last row. Mean recalls × 100 are on diagonal. Classes as in Fig. 3

In Fig. 4 we see of our systems the same kind of behaviors mentioned above: Rock is often labeled “Metal,” and Metal is often labeled “Rock.” We also see that no system labels Blues, Disco, Hip hop, Pop or Reggae as “Classical.” It is thus tempting to claim that though these systems are sometimes not picking the correct labels, at least their mistakes make musical sense because even humans confuse, e.g., metal and rock, but never confuse, e.g., classical as blues or hip hop. This conclusion is unwarranted because it implicitly makes two assumptions: 1) the labels in GTZAN correspond to what we think they mean in a musical sense, e.g., that the Disco excerpts are exemplary of disco (Ammer 2004; Shapiro 2005); and 2) the systems are using cues similar, or equivalent in some way, to those used by humans when categorizing music. The first assumption is disputed by an analysis of the contents of GTZAN and how people describe them (Sturm 2013b). We see from this that the meaning of its categories are more broad than what its labels imply, e.g., many Blues excerpts are tagged on last.fm as “zydeco,” “cajun,” and “swing”; many Disco excerpts are tagged “80s,” “pop,” and “funk”; and many Metal excerpts are tagged “rock,” “hard rock,” and “classic rock.” The second assumption is disputed by work showing the inability of low-level features to capture musical information (Aucouturier and Pachet 2004; Aucouturier 2009; Marques et al. 2010), e.g., what instruments are present, and how they are being played; what rhythms are used, and what is the tempo; whether the music is for dancing or listening; whether a person is singing or rapping, and the subject matter.

4.4 Summary

After evaluating the performance statistics of our systems using Classify in GTZAN, we are able to answer, in terms of classification accuracy, recall, precision, F-measure, and confusions, as well as our formal hypothesis testing, the question of how well each system can identify the labels assigned to excerpts in GTZAN. We are, however, no closer to determining the extents to which they can recognize from audio signals the music genres on which they are supposedly trained. One might argue, “the goal of any machine learning system is to achieve good results for a majority of cases. Furthermore, since machine learning is a methodology for building computational systems that improve their performance—whether with respect to classification accuracy, precision, or reasonable confusions—by training on examples provided by experts, then accuracy, precision, and confusion tables are reasonable indicators of the extent to which a system improves in this task. Therefore, no one can argue that an MGR system with a classification accuracy of 0.5—even for a small and problematic dataset like GTZAN—could be better than one of 0.8.” Even if “better” simply means, “with respect to classification accuracy in GTZAN,” there can still be question.

We find (Sturm 2013b) that classification accuracy of MAPsCAT in GTZAN using 2 fCV is more than 4 points above that of SRCAM. Using the Binomial hypothesis test above, we can reject the null hypothesis that MAPsCAT performs no better than SRCAM. Testing the same systems but with artist filtering, we find the classification accuracies of SRCAM are higher than those of MAPsCAT. Hence, even for this restricted and ultimately useless definition of “better”—for what meaning does it have to an actual user? (Schedl and Flexer 2012; Schedl et al. 2013)—we can still question the ranking of MGR systems in terms of classification accuracy. With doubt even for this limited sense of “better,” how could there be less doubt with a sense that is more broad? For example, it can only be speculative to claim that a high accuracy system is “better” than another in the sense that it is doing so by recognizing genre (not confounds), or that a system will have similar performance in the real world, and so on.

A vital point to make clear is that our aim here is not to test whether an MGR system understands music, or whether it is selecting labels in a way indistinguishable from that way used by humans (Guaus 2009), or even koi (Chase 2001), primates (McDermott and Hauser 2007), pigeons (Porter and Neuringer 1984), or sparrows (Watanabe and Sato 1999). It is unnecessary to require such things of a machine before we can accept that it is capable of merely selecting labels that are indistinguishable from those humans would choose. Our only concern here is how to reliably measure the extent to which an MGR system is recognizing genre and not confounds, such as bandwidth, dynamic range, etc. We take up this challenge in the next section by evaluating the behaviors of the systems.

5 Evaluating the behaviors of MGR systems

Figure 5 shows how the Disco excerpts are classified by AdaBFFs, SRCAM, and MAPsCAT. (For lack of space, we only look at excerpts labeled Disco, though all categories show similar behaviors.) Some evaluation in MGR describes particular misclassifications, and sometimes authors describe listening to confused excerpts to determine what is occurring, e.g., Deshpande et al. (2001), Lee et al. (2006), Langlois and Marques (2009), Scaringella et al. (2006). Of their experiments, Deshpande et al. (2001) writes “... at least in some cases, the classifiers seemed to be making the right mistakes. There was a [classical] song clip that was classified by all classifiers as rock ... When we listened to it, we realized that the clip was the final part of an opera with a significant element of rock in it. As such, even a normal person would also have made such an erroneous classification.” Of the confusion table in their review of MGR research, Scaringella et al. (2006) finds “... it is noticeable that classification errors make sense. For example, 29.41 % of the ambient songs were misclassified as new-age, and these two classes seem to clearly overlap when listening to the audio files.”

Fig. 5
figure 5

Excerpt-specific confusions for Disco for each system (same configurations as Fig. 4) in 10×10 fCV, with number of classifications in each genre labeled at right. Classes as in Fig. 3. Darkness of a square indicates the number of times an excerpt is labeled a genre (left), with black as all 10 times

Unlike the confusion tables in Fig. 4, Fig. 5 shows the specific Disco excerpts that AdaBFFs most often classifies as “Pop” and “Rock,” that SRCAM most often classifies as “Reggae” and “Rock,” and that MAPsCAT most often classifies as “Rock” and “Hip hop.” We also see that some excerpts are never classified “Disco,” and/or are always classified with the same class; and that five excerpts always classified the same way by all three systems. We might see such pathological behavior as bias in a system, or perhaps an indication of an error in GTZAN; nonetheless, the pathological behaviors of our systems present themselves as teachable moments (Sturm 2012b): by considering the specific excerpts producing these consistent errors, we might learn what causes problems for some or all of the systems.

We find very few works discussing and using such behaviors of an MGR system. Lopes et al. (2010) propose training an MGR system using only the training data that is easily separable. In a similar vein, Bağci and Erzin (2007) refine their class models by only using perfectly separable training data (Sturm and Gouyon 2013, unpublished). Related is experimental work (Pampalk et al. 2005; Flexer 2007; Flexer and Schnitzer 2009) showing inflated performance of systems for MGR or music similarity when the same artists and/or albums occur in training and testing sets. Within music similarity and retrieval, a related topic is hubs (Aucouturier and Pachet 2004; Gasser et al. 2010; Flexer and Schnitzer 2010; Schnitzer et al. 2012), where some music signals appear equally close to many other music signals regardless of similarity.

We define a consistent misclassification (CM) when in all CV runs a system selects the same but “wrong” class for an excerpt. When in all CV runs a system selects different “wrong” classes for an excerpt, we call it a persistent misclassification (PM). When in all CV runs a system selects the “correct” class for an excerpt, we call it a consistently correct classification (C3). Table 5 summarizes these types of classification in our 10×10 fCV for AdaBFFs, SRCAM, and MAPsCAT. We see that of the Disco excerpts, AdaBFFs produced 55 C3s, 15 CMs, 5 PMs, and it consistently misclassified 11 excerpts as “Disco.” In total, MAPsCAT appears to have the highest number of C3s and CMs, and AdaBFFs the lowest.

Table 5 Classification type results for each system (same configurations as Fig. 4) on GTZAN

We can consider three different possibilities with such errors. The first is that the GTZAN label is “wrong,” and the system is “right”—which can occur since GTZAN has mislabelings (Sturm 2013b). Such a result provides strong evidence that the system is recognizing genre, and means that the classification accuracy of the system we compute above is too low. The second possibility is that the label is “right,” but the system almost had it “right.” This suggests it is unfair to judge a system by considering only its top choice for each excerpt. For instance, perhaps the decision statistic of the system in the “correct” class is close enough to the one it picked that we can award “partial credit.” The third possibility is that the label is “right,” but the class selected by the system would have been selected by a human. Hence, this suggests the error is “acceptable” because the selected class is indistinguishable from what humans would choose. Such a result provides evidence that the system is recognizing genre. We now deal with each of these possibilities in order.

5.1 Evaluating behavior through the mislabelings in GTZAN

So far, our evaluation has employed descriptive and inferential statistics, looked at how specific excerpt numbers are classified, and noted pathological behaviors, but we have yet to make use the actual music embodied by any excerpts. It is quite rare to find in the MGR literature any identification of the music behind problematic classifications (Sturm 2012a). Langlois and Marques (2009) notice in their evaluation that all tracks from an album by João Gilberto are PMs for their system. They attribute this to the tracks coming from a live recording with speaking and applause. In their results, we see that the system by Lee et al. (2006) classifies as “Techno,” John Denver’s “Rocky Mountain High,” but they do not discuss this result.

The first three columns of Table 6 list for only Disco (we see similar results for all other categories) the specific excerpts of each pathological classification of our systems (same configurations as in Fig. 4). We know that in Disco, GTZAN has six repeated excerpts, two excerpts from the same recording, seven mislabeled excerpts, and one with a problematic label (Sturm 2013b). Of the excerpts consistently misclassified “Disco” by our systems, six are mislabeled in GTZAN. The last column lists those excerpts in other categories that each system consistently misclassifies “Disco.” A number struck-through in Table 6 is a mislabeled excerpt; and a circled number is an excerpt for which the system selects the “correct” class. It is important to note that the categories of GTZAN are actually more broad than what their titles suggest (Sturm 2013b), i.e., Disco includes music that is not considered exemplary of disco as described: “A style of dance music of the late 1970s and early 1980s ... It is characterized by a relentless 4/4 beat, instrumental breaks, and erotic lyrics or rhythmic chants” (Ammer 2004).

Table 6 Classification type results for each system (same configurations as Fig. 4) for GTZAN Disco excerpts considering the replicas and mislabelings in GTZAN (Sturm 2013b)

We see from Table 6 and Fig. 5, all seven Disco mislabelings in GTZAN appear in the pathological behaviors of the three systems. AdaBFFs consistently “correctly” classifies four of the six mislabelings it finds, SRCAM two of the five it finds, and MAPsCAT two of the six it finds. All systems are firm in classifying as “Pop” Disco 23 (“Playboy,” Latoya Jackson). Both AdaBFFs and MAPsCAT are firm in classifying as “Hip hop” Disco 27 (“Rapper’s Delight,” The Sugarhill Gang), but SRCAM most often selects “Reggae.” Both AdaBFFs and SRCAM are firm in classifying as “Pop” Disco 29 (“Heartless,” Evelyn Thomas). While AdaBFFs is adamant in its classification as “Pop” Disco 26 (“(Baby) Do The Salsa,” Latoya Jackson), MAPsCAT is “incorrect” in its firm classification as “Hip hop.” Finally, both SRCAM and MAPsCAT are firm in classifying as “Rock” Disco 11 (“Can You Feel It,” Billy Ocean). AdaBFFs classifies as only “Country,” and MAPsCAT as only “Rock,” Disco 20 (“Patches,” Clarence Carter), but never as the “correct” label “Blues” (Sturm 2013b).

We now consider the excerpts consistently misclassified as “Disco” by the systems. First, AdaBFFs “correctly” classifies Pop 65 (“The Beautiful Ones,” Prince), and SRCAM “correctly” classifies Pop 63 (“Ain’t No Mountain High Enough,” Diana Ross). However, AdaBFFs insists Country 39 (“Johnnie Can’t Dance,” Wayne Toups & Zydecajun) is “Disco” though the “correct” label is “Blues” (Sturm 2013b). and insists Rock 40 (“The Crunge,” Led Zeppelin) is “Disco” though the “correct” labels are “Metal” and/or “Rock” (Sturm 2013b). SRCAM insists Rock 77 (“Freedom,” Simply Red) is “Disco,” though it should be “Pop” (Sturm 2013b). Finally, both AdaBFFs and SRCAM insist Rock 38 (“Knockin’ On Heaven’s Door,” Guns N’ Roses) is “Disco,” though it should be “Metal” and/or “Rock” (Sturm 2013b).

In summary, we see that each system finds some of the mislabelings in Country, Disco, Pop, and Rock. When a system “correctly” classifies such an error, it provides strong evidence that it could be recognizing genre, e.g., AdaBFFs finds and “correctly” classifies as “Pop” both excerpts of Latoya Jackson. When a system “incorrectly” classifies a mislabeling, however, it provides evidence that it is not recognizing genre, e.g., all three systems consistently misclassify as “Country” Disco 41 (“Always and Forever,” Heatwave). We now consider the remaining Disco excerpts in Table 6, and investigate the decision confidences.

5.2 Evaluating behavior through the decision statistics

We now explore the second possibility of these pathological behaviors: whether for these kinds of misclassifications a system deserves credit because its decision statistics between the correct and selected class are close, or at least the “correct” label is ranked second. Figure 6 shows a boxplot of the decision statistics for AdaBFFs (2), for SRCAM (6), and for MAPsCAT (8). We only consider Disco for lack of space. Figure 6a shows for AdaBFFs that its decision statistic for most Disco CMs and CMs as “Disco” lie within two standard deviations of the mean of the C3s decision statistics. This suggests AdaBFFs is as “confident” in its pathological misclassifications as it is in its C3s. We reach the same conclusion for SRCAM and MAPsCAT. We next consider each excerpt individually.

Fig. 6
figure 6

Boxplot of decision statistics for Disco excerpts: CMs labeled on left; C3s on right; CMs as “Disco” labeled in center. Excerpt numbers are shown. Mean of C3s shown as gray line with one standard deviation above and below. Classes as in Fig. 3

Table 7 lists the music artist and title of the CMs from Table 6 that are not struck-through, as well as statistics of decision statistics (confidence), the rank of “Disco,” and the top three last.fm tags (from the “count” parameter) of each song or artist (retrieved Oct. 15, 2012). We do not include tags that duplicate the artist or song name; and when a song has no tags associated with it, we take the top tags for the artist. We define the “confidence” of a classification as the difference between the decision statistic of the selected class with that of “Disco.” This is different in Fig. 6, where we show the decision statistics and not a difference. Table 8 lists in the same way the details of the CMs as “Disco.”

Table 7 Details of Disco CMs from Table 6
Table 8 Details of CMs as “Disco” from Table 6

In Table 7 we see that AdaBFFs, SRCAM, and MAPsCAT all appear quite confident in many of their CMs; however, “Disco” appears in the penultimate position for many by AdaBFFs and MAPsCAT. Only two excerpts are CMs for all three systems: Disco 41 is always classified “Country,” even though its top tags have no overlap with the top tags in Country (“country,” “classic country,” “oldies,” etc.) (Sturm 2013b). For AdaBFFs and SRCAM, “Disco” appears around rank 5; and rank 8 for MAPsCAT (above only “Metal” and “Classical”). The other common CM is Disco 86, which is consistently misclassified by each system in radically different ways. Its tags have little overlap with the top tags of Rock (i.e., “70s”), but none with those of Blues or Hip hop (Sturm 2013b).

Table 8 show for all three systems the details of the CMs as “Disco” in Table 6, as well as their confidences, and the rank of the GTZAN category to which each belongs. Some of these appear to be satisfactory, e.g., the excerpts with top tags “pop,” “70s,” “80s,” “soul,” “funk” and “dance” have overlap with the top tags of Disco (Sturm 2013b). Others, however, appear quite unsatisfactory, such as “Knockin’ on Heaven’s Door” by Guns ’N Roses, for which AdaBFFs ranks its labels from penultimate: “Pop” and “Hip hop” before “Rock” or “Metal.”

In summary, from our evaluation of the decision statistics involved in their pathological errors, it is hard to find evidence to support a conclusion that any of these systems is working in a way corresponding to confidence in its decision when it comes to music genre. The question thus arises: to what extent do humans show the same kinds of classification behavior as AdaBFFs, SRCAM, and MAPsCAT? We test this next using a listening experiment.

5.3 Evaluating behavior by listening tests

Listening tests for the evaluation of systems are by and large absent from the MGR literature (Sturm 2012a). The work of Gjerdingen and Perrott (2008) is a widely cited study of human music genre classification (Aucouturier and Pampalk 2008), and Krumhansl (2010) and Mace et al. (2011) extend this work. Both Ahrendt (2006) and Meng and Shawe-Taylor (2008) use listening tests to gauge the difficulty of discriminating the genres of their datasets, and to compare with the performance of their system. Lippens et al. (2004) use listening tests to produce a dataset that has excerpts more exemplary of single genres; and Craft et al. (2007) and Craft (2007) propose that evaluating MGR systems makes sense only with respect to the generic ambiguity of music. Seyerlehner et al. (2010) expands upon these works. Guaus (2009) conducts listening tests to determine the relative importance of timbre or rhythm in genre recognition. Finally, Cruz and Vidal (2003, 2008) and we (Sturm 2012b) conduct listening tests to determine if an MGR system can create music representative of a genre it has learned.

These latter works motivate the use of listening tests to circumvent the need to compare tags, or demarcate elements, necessary to argue that it is more acceptable to label an excerpt, e.g., “Pop” or “Rock,” neither or both. Our hypothesis is that the difference between the label given by a human and that given by a system will be large enough that it is extremely clear which label is given by a human. Hence, we wish to determine the extent to which the CMs of AdaBFFs, SRCAM or MAPsCAT—of which we see the systems are confident—are acceptable in the sense that they are indistinguishable from those humans would produce.

We conduct a listening test in which a human subject must choose for each 12 s excerpt which label of two was given by a human (i.e., G. Tzanetakis); the other label was given by AdaBFFs, SRCAM or MAPsCAT. The experiment has two parts, both facilitated by GUIs built in MATLAB. In the first part, we screen subjects for their ability to distinguish between the ten genres in GTZAN. (We pick “representative” excerpts by listening: Blues 05, John Lee Hooker, “Sugar Mama”; Classical 96, Vivaldi, “The Four Seasons, Summer, Presto”; Country 12, Billy Joe Shaver, “Music City”; Disco 66, Peaches and Herb, “Shake Your Groove Thing”; Hip hop 47, A Tribe Called Quest, “Award Tour”; Jazz 19, Joe Lovano, “Birds Of Springtimes Gone By”; Metal 11, unknown; Pop 95, Mandy Moore, “Love you for always”; Reggae 71, Dennis Brown, “Big Ships”; Rock 37, The Rolling Stones, “Brown Sugar.”) A subject correctly identifying the labels of all excerpts continues to the second part of the test, where s/he must discriminate between the human- and system-given genres for each music excerpt. For instance, the test GUI presents “Back Off Boogaloo” by Donna Summer along with the labels “Disco” and “Pop.” The subject selects the one s/he thinks is given by a human before proceeding. We record the time each subject uses to listen to an excerpt before proceeding. We test all Disco CMs and CMs as “Disco” in Tables 7 and 8. Although all other classes have CMs in all three systems (Table 5), we test only these ones because of the effort required. In total, 24 test subjects completed the second part.

Figure 7 shows the distribution of choices made by the test subjects. Figure 7a shows that out of the 24 times Disco 10 was presented, six people selected “Disco” (the human-given label), and 18 people selected “Pop” (the class selected by MAPsCAT). Some excerpts are presented more than 24 times because they are misclassified by more than one system. We see that of the Disco CMs of the systems, for only two of nine by AdaBFFs, one of 11 by MAPsCAT, and none of 10 by SRCAM, did a majority of subjects side with the non-human class. For some excerpts, agreement with the human label is unanimous. Figure 7b shows that for Hip hop 00, “Hip hop” (the class selected by AdaBFFs) was selected by 15 subjects; and 9 subjects picked “Disco” (the human-given label). We see that of the ten CMs as “Disco” of AdaBFFs, and of the twelve of MAPsCAT, in no case did a majority of subjects select “Disco.” Of the five CMs as “Disco” of SRCAM, we see for two excerpts—Reggae 88 and Rock 77—a majority of subjects chose “Disco.”

Fig. 7
figure 7

Distribution of choices from listening tests. a Disco CMs in Table 7. The class selected by a system is marked using a symbol (legend). b CMs as “Disco” in Table 8. Classes as in Fig. 3

We now test the null hypothesis that the subjects are unable to recognize the difference between the label given by a human and the class selected by a system. We can consider the outcome of each trial as a Bernoulli random variable with parameter x (the probability of a subject selecting the label given by a human). For a given excerpt for which the human label is selected h times by N independent subjects, we can estimate the Bernoulli parameter x using the minimum mean-squared error estimator, assuming x is distributed uniform in [0, 1]: \(\hat x(h) = (h+1)/(N+2) \) (Song et al. 2009). The variance of this estimate is given by

$$ \hat \sigma^2(\hat x) = \frac{\hat x (1-\hat x)}{(N-1) + \frac{N+1}{N\hat x (1 - \hat x)}}. $$
(10)

We test the null hypothesis by computing \(P[T > |\hat x - 0.5|/\hat \sigma(\hat x)]\) where T is distributed Student’s t with N − 2 degrees of freedom (two degrees lost in the estimation of the Bernoulli parameter and its variance). For only four Disco CM excerpts—11, 13, 15, and 18—do we find that we cannot reject the null hypothesis (p > 0.1). Furthermore, in the case of excerpts 10 and 34, we can reject the null hypothesis in favor of the misclassification of MAPsCAT and AdaBFFs, respectively (p < 0.012). For all other 21 Disco excerpts, we reject the null hypothesis in favor of the human-given labels (p < 0.008). For only two CMs as “Disco” excerpts (Hip hop 00 and Rock 77) do we find that we cannot reject the null hypothesis (p > 0.1). Furthermore, only in the case of Reggae 88 can we reject the null hypothesis in favor of SRCAM (p < 4 ·10 − 7). For all other 20 excerpts, we can reject the null hypothesis in favor of the human-given labels (p < 0.012).

So, what is it about Disco 13, 15 and 18 that makes subjects divided between the labels “Disco” and “Pop,” and choose “Pop” more often than “Disco” for Disco 10 and 34? Many subjects that passed the screening mentioned in post-test interviews that the most challenging pair of tags was “Disco” and “Pop.” When asked what cues they used to make the choice, many could not state specifics, referring instead to the “feel” of the music. Some said they decided based upon whether the excerpt sounded “old” or “produced.” In these five cases then, we can argue that AdaBFFs and MAPsCAT are classifying acceptably. (Some subjects were also dissatisfied by some label pairs, e.g., “Metal” and “Disco” for ABBA’s “Mamma Mia” while they wished to select “Pop” instead.)

In the case of Disco 11, subjects were divided between “Disco” and “Rock.” When asked in the post-test interview about how quickly they made each selection, many subjects said they were quite quick, e.g., within the first few seconds. Some mentioned that they changed their answers after listening to some of the excerpts longer; and a few subjects said that they made sure to listen beyond what sounded like the introduction. After inspecting the duration each subject listened to Disco 11 before proceeding, we find that the listening time difference between subjects who selected “Rock” (8.5 ±1.2 s, with 95 % confidence interval) versus those who selected “Disco” (7.9 ±1.1 s), is not statistically significant (p > 0.48). However, for Hip hop 00, the mean listening durations of subjects who selected “Disco” (4.9 ±1.1 s) versus those who selected “Hip hop” (9.5 ±1.6 s) is significant (p < 6 ·10 − 5). Apparently, many subjects hastily chose the label “Disco.” In these two cases, then, we can argue that SRCAM and MAPsCAT are classifying acceptably.

5.4 Summary

In Section 4, we were concerned with quantitatively measuring the extent to which an MGR system predicts the genre labels of GTZAN. This presents a rather rosy picture of performance: all of our systems have high classification accuracies, precision and F-measures in many categories, and confusion behaviors that appear to make musical sense. Though their classification accuracies in GTZAN drop significantly when using an artist filter (Sturm 2013b), they still remains high above that of chance. Due to Classify, however, we cannot reasonably argue that this means they are recognizing the genres in GTZAN, or more broadly that they will perform well in the real world recognizing the same genres (Urbano et al. 2013). In this section, we have thus been concerned with evaluating the extent to which an MGR system displays the kinds of behavior we expect of a system that has capacity to recognize genre.

By inspecting the pathological errors of the systems, and taking into consideration the mislabelings in GTZAN (Sturm 2013b), we find evidence for and against the claim that any of them can recognize genre, or that any of them are better than the others. We see MAPsCAT has over one hundred more C3s than SRCAM and AdaBFFs, but AdaBFFs “correctly” classifies the most mislabeled Disco excerpts than the other two. All three systems, however, make errors that are difficult to explain if genre is what each is recognizing. We see that the confidence of these systems in their pathological errors are for the most part indistinguishable from their confidence in their C3s. While the rank of the “correct” class is often penultimate to the “wrong” one they select, there are rankings that are difficult to explain if genre is what each is recognizing. Finally, our listening test reveals that for the most part the pathological errors of these systems are readily apparent from those humans would commit. Their performance in that respect is quite poor.

6 On evaluation

While genre is an inescapable result of human communication (Frow 2005), it can also sometimes be ambiguous and subjective, e.g., Lippens et al. (2004), Ahrendt (2006), Craft et al. (2007), Craft (2007), Meng and Shawe-Taylor (2008), Gjerdingen and Perrott (2008) and Seyerlehner et al. (2010). A major conundrum in the evaluation of MGR systems is thus the formal justification of why particular labels are better than others. For instance, while we deride it above, an argument might be made that ABBA’s “Mamma Mia” employs some of the same stylistic elements of metal used by Motörhead in “Ace Of Spades”—though it is difficult to imagine the audiences of the two would agree. The matter of evaluating MGR systems would be quite simple if only we had a checklist of essential, or at least important, attributes for each genre. Barbedo and Lopes (2007) provides a long list of such attributes in each of several genres and sub-genres, e.g., Light Orchestra Instrument Classical is marked by “light and slow songs ... played by an orchestra” and have no vocal element (like J. S. Bach’s “Air on the G String”); and Soft Country Organic Pop/Rock is marked by “slow and soft songs ... typical of southern United States [with] elements both from rock and blues [and where] electric guitars and vocals are [strongly] predominant [but there is little if any] electronic elements” (like “Your Cheating Heart” by Hank Williams Sr.). Some of these attributes are clear and actionable, like “slow,” but others are not, like, “[with] elements both from rock and blues.” Such an approach to evaluation might thus be a poor match with the nature of genre (Frow 2005).

We have shown how evaluating the performance statistics of MGR systems using Classify in GTZAN is inadequate to meaningfully measure the extents to which a system is recognizing genre, or even whether it addresses the fundamental problem of MGR. Indeed, replacing GTZAN with another dataset, e.g., ISMIR2004 (ISMIR 2004), or expanding it, does not help as long as we do not control for all independent variables in a dataset. On the other hand, there is no doubt that we see systems performing with classification accuracies significantly above random in GTZAN and other datasets. Hence, something is working in the prediction of the labels in these datasets, but is that “something” genre recognition? One might argue, “The answer to this question is irrelevant. The ‘engineering approach’—assemble a set of labeled data, extract features, and let the pattern recognition machinery learn the relevant characteristics and discriminating rules—results in performance significantly better than random. Furthermore, with a set of benchmark datasets and standard performance measures, we are able to make meaningful comparisons between systems.” This might be agreeable in so far that one restricts the application domain of MGR to predicting the single labels of the music recording excerpts in the handful of datasets in which they are trained and tested. When it comes to ascertaining their success in the real world, to decide which of several MGR systems is best and which is worst, which has promise and which does not, Classify and classification accuracy provide no reliable or even relevant gauge.

One might argue, “accuracy, recall, precision, F-measures are standard performance measures, and this is the way it has always been done for recognition systems in machine learning.” We do not advocate eliminating such measures, not using Classify, or even of avoiding or somehow “sanitizing” GTZAN. We build all of Section 5 upon the outcome of Classify in GTZAN, but with a major methodological difference from Section 4: we consider the contents of the categories. We use the faults of GTZAN, the decision statistics, and a listening test, to illuminate the pathological behaviors of each system. As we look more closely at their behaviors, the rosy picture of the systems evaporates, as well as our confidence that any of them is addressing the problem, that any one of them is better than the others, or even that one of them will be successful in a real-world context.

One might argue that confusion tables provide a realistic picture of system performance. However, in claiming that the confusion behavior of a system “makes musical sense,” one implicitly makes two critical assumptions: 1) that the dataset being used has integrity for MGR; and 2) that the system is using cues similar to those used by humans when categorizing music, e.g., what instruments are playing, and how are they being played? what is the rhythm, and how fast is the tempo? is it for dancing, moshing, protesting or listening? is someone singing, and if so what is the subject? The faults of GTZAN, and the wide composition of its categories, obviously do not bode well for the first assumption (Sturm 2013b). The second assumption is difficult to justify, and requires one to dig deeper than the confusion behaviors, to determine how the system is encoding and using such relevant features.

Analyzing the pathological behaviors of an MGR system provides insight into whether its internal models of genres make sense with respect to the ambiguous nature of genre. Comparing the classification results with the tags given by a community of listeners show that some behaviors do “make musical sense,” but other appear less acceptable. In the case of using tags, the implicit assumption is that the tags given by an unspecified population to make their music more useful to them are to be trusted in describing the elements of music that characterize the genre(s) it uses — whether users found these upon genre (“funk” and “soul”), style (“melodic” and “classic”), form (“ballad”), function (“dance”), history (“70s” and “old school”), geography (“jamaican” and “brit pop”), or others (“romantic”). This assumption is thus quite unsatisfying, and one wonders whether tags present a good way to formally evaluate MGR systems.

Analyzing the same pathological behaviors of an MGR system, but by a listening test designed specifically to test the acceptability of its choices, circumvents the need to compare tags, and gets to the heart of whether a system is producing genre labels indistinguishable from those humans would produce. Hence, we finally see by this that though our systems have classification accuracies and other statistics that are significantly higher than chance, and though each system has confusion tables that appear reasonable, a closer analysis of their confusions at the level of the music and a listening test measuring the acceptability of their classifications reveals that they are likely not recognizing genre at all.

If performance statistics better than random do not reflect the extents to which a system is solving a problem, then what can? The answer to this has import not just for MGR, but music information research in general. To this end, consider a man claiming his horse “Clever Hans” can add and subtract integers. We watch the owner ask Hans, “What is 2 and 3?” Then Hans taps his hoof until his ears raise after its fifth tap, at which point he is rewarded by the owner. To measure the extent to which Hans understands the addition and subtraction of integers, having the owner ask more questions in an uncontrolled environment does not add evidence. We can instead perform a variety of experiments that do. For instance, with the owner present and handling Hans, two people can whisper separate questions to Hans and the owner, with the ones whispering not knowing whether the same question is given or not. In place of real questions, we might ask Hans nonsensical questions, such as, “What is Bert and Ernie?” Then we can compare its answers with each of the questions. If this demonstrates that something other than an understanding of basic mathematics might be at play, then we must search for the mechanism by which Hans is able to correctly answer the owner’s questions in an uncontrolled environment. We can, for instance, blindfold Hans to determine whether it is vision; or isolate it in a sound proof room with the owner outside to determine whether it is sound. Such a historical case is well-documented by Pfungst (1911).

Classify using datasets having many independent variables changing between classes is akin to asking Hans to answer more questions in an uncontrolled environment. What is needed is a richer and more powerful toolbox for evaluation (Urbano et al. 2013). One must search for the mechanism of correct response, which can be evaluated by, e.g., Rules and Robust. Dixon et al. (2010) use Rules to inspect the sanity of what their system discovers useful for discriminating different genres. We show using Robust (Sturm 2012b) that two high-accuracy MGR systems can classify the same excerpt of music in radically different ways when we make minor adjustments by filtering that do not affect its musical content. Akin to nonsense questions, Matityaho and Furst (1995) notice that their system classifies a zero-amplitude signal as “Classical,” and white noise as “Pop.” Porter and Neuringer (1984), investigating the training and generalization capabilities of pigeons in discriminating between two genres, test whether responses are due to the music itself, or to confounds such as characteristics of the playback mechanisms, and the lengths and loudness of excerpts. Chase (2001) does the same for koi, and looks at the effect of timbre as well.

Since it is as remarkable a claim that an artificial system “recognizes genre with 85 % accuracy” as a horse is able to perform mathematics, this advocates approaching an MGR system—or autotagger, or any music information system—as if it were “Clever Hans.” This of course necessitates creativity in experimental design, and requires much more effort than comparing selected tags to a “ground truth.” One might argue, “One of the reasons MGR is so popular is because evaluation is straightforward and easy. Your approach is less straightforward, and certainly unscalable, e.g., using the million song dataset (Bertin-Mahieux et al. 2011; Hu and Ogihara 2012; Schindler et al. 2012).” To this we can only ask: why attempt to solve very big problems with a demonstrably weak approach to evaluation, when the smaller problems have yet to be indisputably solved?

7 Conclusion

In this work, we have evaluated the performance statistics and behaviors of three MGR systems. Table 4 shows their classification accuracies are significantly higher than chance, and are among the best observed (and reproduced) for the GTZAN dataset. Figure 3 shows their recalls, precisions, and F-measures to be similarly high. Finally, Fig. 4 shows their confusions “make musical sense.” Thus, one might take these as evidence that the systems are capable of recognizing some of the genres in GTZAN. The veracity of this claim is considerably challenged when we evaluate the behaviors of the systems. We see that SRCAM has just as high confidences in its consistent misclassifications as in its consistently correct classifications. We see MAPsCAT—a system with a high F-score in Metal—always mistakes the excerpt of “Mamma Mia” by ABBA as “Metal” first, “Rock” second, and “Reggae” or “Country” third. We see that all subjects of our listening test have little trouble discriminating between a label given by a human and that given by these systems. In short, though these systems have superb classification accuracy, recalls, etc., in GTZAN, they do not reliably produce genre labels indistinguishable from those humans produce.

From the very nature of Classify in GTZAN, we are unable to reject the hypothesis that any of these systems is not able to recognize genre, no matter the accuracy we observe. In essence, “genre” is not the only independent variable changing between the excerpts of particular genres in our dataset; and Classify does not account for them. There is also, just to name a few, instrumentation (disco and classical may or may not use strings), loudness (metal and classical can be played at high or low volumes), tempo (blues and country can be played fast or slow), dynamics (classical and jazz can have few or several large changes in dynamics), reverberation (reggae can involve spring reverberation, and classical can be performed in small or large halls), production (hip hop and rock can be produced in a studio or in a concert), channel bandwidth (country and classical can be heard on AM or FM radio), noise (blues and jazz can be heard from an old record or a new CD), etc. Hence, to determine if an MGR system has a capacity to recognize any genre, one must look deeper than classification accuracy and related statistics, and from many more perspectives than just Classify.