Classification Accuracy Is Not Enough

A recent review of the research literature evaluating music genre recognition (MGR) systems over the past two decades shows that most works (81%) measure the capacity of a system to recognize genre by its classiﬁcation accuracy. We show here, by implementing and testing three categorically different state-of-the-art MGR systems, that classiﬁcation accuracy does not necessarily reﬂect the capacity of a system to recognize genre in musical signals. We argue that a more comprehensive analysis of behavior at the level of the music is needed to address the problem of MGR, and that measuring classiﬁcation accuracy obscures the aim of MGR: to select labels indistinguishable from those a person would choose.


Introduction
For over fifty years, research in information technology has advanced the field of machine learning to reach almost human level performance in discriminating and categorizing the content of text, images, sounds, movies, and other media. For music in particular, the problem of identifying, discriminating between, and learning the criteria of music genres or styles -music genre recognition (MGR) -has motivated much work over the past 28 years [83]. Indeed, a recent review of MGR [34] writes, "Genre classification is the most widely studied area in MIR." There are a few reviews of the variety of features and approaches to MGR by machine listening [7,34,73]. MGR research is also making its appearance in textbooks [48].
Most published studies of MGR systems report classification performance significantly better than chance, and sometimes as well as or better than humans. For a benchmark dataset of music excerpts singly-labeled in ten genres (GTZAN [82,89]), classification accuracies are now reported above 90%, e.g., [19,38,[65][66][67]. Indeed, as [14] writes, "Given the steady and significant improvement in [genre] classification performance since 1997, we wonder if automatic methods are not already more efficient at learning genres than some people." This increase in performance not only merits a closer look at what works so well in these particular systems, but also motivates a re-evaluation of the argument that music genre exists to a large extent outside of the acoustic signal itself [28,58,93]. It might also, most excitingly, reveal fundamental aspects of how people hear and conceptualize the complex and mysterious phenomenon of "music." We might be getting ahead of ourselves, however.
The work in [85] casts doubt on the high classification accuracies reported in [65][66][67] -results that actually stem from a flaw in the simulations (private correspondence with Y. Panagakis). Another work [83] provides a comprehensive review of the approaches so far used for evaluating MGR systems. We see that over 92% of 375 papers approach evaluation of MGR systems by classifying several music excerpts and comparing the labels to the "true" ones. Nearly all of this work (334 papers) uses the classification accuracy as a figure of merit. Also shown is that the most used publicly available benchmark dataset is GTZAN -a dataset that has integrity problems for genre recognition [82]. And the work in [84] shows that, even with high classification accuracy, an MGR system can act as if music genre is not what it is recognizing. Thus, the advances we see in MGR might be misleading: a system with high classification accuracy might not be addressing the problem at all.
In this paper, we show that classification accuracy does not reliably reflect the capacity of an MGR system to recognize music genre. Indeed, recall, precision and confusion tables are still not enough. We claim that these figures of merit -which have been used in the past decade to rank MGR systems, e.g., [7,11,15,17,18,25,29,34,67,71,88,89] citing one publication from each year since 2001 -do not reliably rank MGR systems. While this claim has not been made in any work surveyed in [83], shades of it appear in [22,23,52,77,84,93]. Those works argue for measuring performance in ways that take into account the ambiguity of genre being in part a cultural and subjective construction. We, however, argue that the evaluation of MGR systems -the experimental designs, the datasets, and the figures of meritand indeed, the development of future systems, must embrace the fact that the problem of recognizing genre is a musical one, and must be evaluated as such. In short, classification accuracy is not enough to gauge the success of any MGR system.
In the next section, we distill the variety of MGR evaluation approaches used over the past two decades along three dimensions: experimental design, datasets, and figures of merit. This shows how most work reports classification accuracy of supervised approaches to machine learning using private datasets. The third section reviews three state-of-the-art MGR systems that show high classification accuracy in the most-used publicly-available music genre dataset GTZAN. In the fourth section, we analyze the behaviors of these three systems, from the high-level figures of merit classification accuracy, recall and precision, to mid-level class confusions, and finally to low-level excerpt misclassifications. At this lowest level, we show the pathological misclassifications of these systems argue against the claim that any of them have a capacity to discriminate between and recognize genre based upon musicological principles.  Fig. 1 Annual numbers of publications in MGR separated by which use any form of statistical testing for making comparisons [83]. Overall, about 12% of the MGR literature uses a statistical test 2 Evaluation in Music Genre Recognition Research Over the past 23 years of MGR research, surprisingly little has been written about evaluation, i.e., experimental design, data, and figures of merit. An experimental design is a method for testing a hypothesis. Data is the material on which a system is tested. A figure of merit describes the confidence in the hypothesis after conducting an experiment. Of three review articles devoted in large part to MGR [7,34,73], only [7] contains a brief paragraph on evaluation. The work in [92] provides a comparison of various figures of merit for music classification. Other works [12,22,23,52,77,93] argue for measuring performance in ways that take into account the natural ambiguity of music genre and similarity. The work in [22,23,84] argues for richer experimental designs than having a system apply a single label to music with a possibly problematic "ground truth." And Flexer [29] notes and criticizes the absence of formal statistical testing in music information retrieval research, and provides an excellent tutorial based upon MGR for how to apply statistical tests. The review in [83] compiles a near-complete bibliography of MGR (surveying over 400 published works), and focuses specifically upon MGR evaluation. Derived from this review, Fig. 1 shows the annual number of publications concerning MGR, and that formal statistical testing in comparing MGR systems remains absent [29]. Table 1 summarizes the ten experimental designs in the MGR literature, all of which address shades of the hypothesis, "system A recognizes genre X." (Some Table 1 Experimental designs of the music genre recognition literature [83] Design Description % Work Classify system classifies music; researcher compares against "ground truth" 92 Generalize Classify with two or more datasets, and/or various amounts of training data 20 Features system ranks and/or selects features; researcher inspects features 18 Cluster system creates clusters or trees of dataset; researcher inspects these 6 Eyeball system derives parameters from music; researcher visually compares 3 Robust system classifies music that researcher modifies or transforms in ways that do not harm its genre identification by a human 3 Scale Classify with varying numbers of genres 3 Retrieve system retrieves music similar to query; researcher compares against query 2 Rules researcher inspects rules used by a system to identify genres 1 Compose system creates music in specific genres; researcher analyzes representativeness 0.4 works use more than one experimental design.) Here we see that the most widely used design by far, Classify, is that of comparing to a "ground truth" the class(es) selected by a system for particular instances of music. The next most-used experimental design is Generalize. The least-used experimental design, appearing in only two papers [24,84], is having a system compose music that is exemplary of the genres in which it is trained, and testing the representativeness. Table 2 shows the most used datasets. Overall, 78% of the papers use audio data or features derived from audio data, and 14% use symbolic data. About 20% of work tests MGR systems with two or more datasets (which is the experimental design Generalize). Only 9% of work makes use of an artist or album filter [30,31,43,63]. While more than 50% of the papers use datasets that are not publicly available, the most used public dataset is GTZAN [89] -which has recently been formally shown to have replicas, mislabelings, and distortions [82]. Table 3 shows the figures of merit appearing most in the MGR literature. Consider a single-label classifier trained on M classes, and define the M × M confusion matrix Y produced from N observations. Its i jth element Y i j is the number of elements with true label i assigned label j by the system. For a multilabel system [54], define L as the set of all possible labels, and so the ith element of N observations has labels Table 3 Figures of merit of the music genre recognition literature [83]. For a single-label system of M classes, Y is the M × M confusion matrix, and N is the number of observations. For a multilabel system, Z n is the set of true labels of the nth observation, and Y n is the set of applied labels Precision single-label: Y i ⊆ L , whereas the system applies Z i ⊆ L . When accuracy appears as a figure of merit, only 22% of the time is it accompanied by variance, standard deviation, or the standard error of the mean. When a confusion table appears as a figure of merit, about 60% of the time it is not accompanied by any kind of musicological reflection.
3 Three State-of-the-art Systems for Music Genre Recognition We now present three MGR systems. Two of these (AdaBFFs and SRCAM) are used in [84]; but we adjust each one here. We also introduce a new approach (MAPsCAT).

AdaBFFs
AdaBoost with decision trees and bags of frames of features (AdaBFFs) [14,84], combines weak classifiers trained by multiclass AdaBoost [32,74] on bags of frames of features. This approach performed the best in the 2005 MIREX music genre classification task [62]. Multiclass AdaBoost [32,74] creates a strong classifier by counting "votes" cast by weak classifiers given an observation x. Its use for MGR is detailed in [14,84]. Given the labeled features in a training set, iteration l adds a new weak classifier v l (x) and weight w l ∈ [0, 1] to minimize the total prediction error. The weak classifier v l (x) produces a length-K vector with elements in {±w l }. A positive element means it favors a class, whereas a negative means the opposite. After L training steps, our classifier produces the vote vector For an excerpt of recorded music consisting of a set of features X := {x i }, we pick the class associated with the maximum element in the sum of weighted votes: We use the "multiboost package" [10] with decision trees as the weak learners, AdaBoost.MH [74] as the strong learner. The features we use are computed using a sliding Hann window of 46.4 ms and 50% overlap: 40 Mel-frequency cepstral coefficients (MFCCs) [80], zero crossings, mean and variance of the magnitude Fourier transform, 16 quantiles of the magnitude Fourier transform, and the error of a 32order linear predictor. We disjointly partition the set of features into groups of 130 consecutive frames, and then compute for each the means and variances of each dimension. For a 30-s music excerpt, this produces 9 feature vectors of 120 dimensions.

SRCAM
Sparse representation classification with auditory temporal modulations (SRCAM) [67,84,85], uses sparse representation classification of long-duration auditory features. This approach is reported to have mean accuracies above 90% [65][66][67], but those results arise from a flaw in the experiment (private correspondence with Y. Panagakis). Here, as in [84], we modify the approach to produce classification accuracies above 80%. Each feature comes from a modulation analysis of a time-frequency representation, and for a 30-s sound excerpt with sampling rate 22,050 Hz, the feature dimensionality is 768. One can create dictionary atoms by normalizing each feature (mapping all values in each dimension to [0, 1] by subtracting the minimum value and dividing by the largest difference). One can also standardize them, i.e., making all dimensions have zero mean and unit variance.
Given a matrix of "feature atoms" D := [d 1 |d 2 | · · · |d N ], and the set of class identities ∪ K k=1 I k = {1, . . . , N}, where I k specifies the columns of D belonging to class k, sparse representation classification (SRC) [94] first finds for a feature vector x (which is the feature x transformed by the same normalization or standardization approach to create the dictionary) a sparse representation s by for ε 2 > 0. SRC then defines the set of weights S := {s k ∈ R N : ∀n ∈ I k ([s k ] n = a n ), ∀n ∈ I k ([s k ] n = 0), k ∈ {1, . . . , K}}, where a n = [s] n , the nth row of s. Thus, s k are the weights in s specific to class k.
We gauge the confidence of SRC by comparing the class-dependent errors. To this end, we define the "confidence" of SRCAM for assigning class k to x as where J k := x − Ds k 2 2 . Thus, C(k|x ) ∈ [0, 1] where 1 is certainty.

MAPsCAT
Maximum a posteriori classification of scattering coefficients (MAPsCAT) uses the novel features proposed in [56]. The use of these features for MGR is first proposed in [4]. We use scattering coefficients within a Bayesian framework, and achieve accuracies on par with those reported in [4], and quite close to those of SRCAM. Bayesian classification seeks to minimize expected risk given the observation x. Assuming the cost of all misclassifications are the same, and that all classes are equally likely, the Bayesian classifier becomes the maximum a posteriori (MAP) classifier [87]: where P[x|k] models the observations for class k, and P(k) is the prior of class k. We assume P[x|k] ∼ N (µ k , C k ), i.e., the observations from class k are distributed multivariate Gaussian with mean µ k and covariance C k . We may also assume every class is distributed with the same covariance, i.e., P[x|k] ∼ N (µ k , C). With several features from a music excerpt X := {x i }, we assume independence between the features, and pick the class of X that maximizes the log posterior: Scattering coefficients are attractive features because they are designed to be invariant to particular transformations, such as translation and rotation [56]. They also preserve distances between stationary processes, and embody both large-and short-scale structures. One computes these features by convolving the modulus of successive wavelet decompositions with the scaling wavelet. We use the scatterbox implementation [5] with a second-order decomposition, filter q-factor of 16, and a maximum scale of 160. For a 30-s sound excerpt with sampling rate 22,050 Hz, this produces 40 feature vectors of dimension 469. We estimate each class mean and covariance using unbiased minimum mean-squared error estimators on the training set. 4 Analyzing the Behaviors of MGR Systems from High to Low Specificities As seen in Tables 1 and 3, at least 92% of the published evaluations of MGR systems uses Classify as the experimental design, and at least 81% uses accuracy as the figure of merit. With MGR system accuracies reportedly above 80%, and some over 90% [19,38,[65][66][67][68], it appears that something must be working -but is that "something" genre recognition? In this section, we evaluate each system above using the Classify experimental design, but unlike most work we analyze the behaviors of the three systems down to the music excerpts themselves. We use the GTZAN dataset [82,89] for three reasons: 1) it is the publicly available dataset most used in MGR research [83]; 2) it is used in the works proposing AdaBFFs [14], SRCAM [67,84], and the features of MAPsCAT [4]; and 3) because its contents and faults are now known [82], we can address its problems on a case-by-case basis.
We test each system with stratified 10-fold cross-validation (equal priors), but conduct 10 independent trials to measure the variability of results due to random partitioning of the dataset. For each test fold, we test the systems using the same training and testing data. Every music excerpt is thus classified ten times by each system trained on the same data. For AdaBFFs, we run AdaBoost for 4000 iterations, and test both decision trees of 1 node or no node (stumps). For SRCAM, we test both standardized and normalized features, and solve its inequality-constrained optimization problem (3) for ε 2 = 0.01 using SPGL1 [13] with at most 200 iterations. For MAPsCAT, we test systems trained with either class-dependent covariances or total covariance (covariance of the training data).

Analyzing Classification Accuracy
As discussed in Section 2, classification accuracy appears in 81% of the MGR literature as a figure of merit for the performance of systems in recognizing genres. In their review of several MGR systems, Fu et al. [34] compare performance using only classification accuracy. The work proposing AdaBFFs [14], SRCAM [67], and the features of MAPsCAT [4], present only classification accuracy. Furthermore, Seyerlehner et al. [77] argue that the gap between classification by MGR systems and humans is narrowing based only on classification accuracy.
For each of these systems (reviewed in Section 3), Table 4 shows the mean classification accuracies with 95% confidence intervals, and the p-values of paired t-tests between the two settings of each system. We see we can reject the null hypothesis of differences between mean accuracies being due to chance. The differences in mean accuracies for SRCAM with normalized features and MAPsCAT with total covariance is also statistically significant (p < 0.001). The low mean accuracy for MAPsCAT with class-dependent covariance is due to a lack of training data for some classes for estimating covariance matrices from high-dimensional features.

Analyzing Recall, Precision, and F-measure
The figures of merit recall, precision and the F-measure (see Table 3) are more specific than accuracy, and appear infrequently in the MGR literature. From the observation that their experimental recalls for the Classical-and Rock-labeled excerpts of GTZAN are above that expected from guessing randomly, Wu et al. [95] concludes on the relevance of their features to MGR. With respect to precision, Lin et al. [51] concludes their system is better than another. When it is reported, the F-measure often only accompanies other figures of merit, e.g., recall, precision and accuracy [50]. Figure 2 shows the recalls, precisions, and F-measures for AdaBFFs, SRCAM, and MAPsCAT. We see for the GTZAN   highest mean recall (0.76 ± 0.01, standard deviation) of all systems (p < 2 · 10 −5 ). Since high recall can come at the price of many false positives, we can look at the precision. Since it shows a high recall but very low precision, we see this is the case for GTZAN Country excerpts for MAPsCAT with class-dependent covariance. However, we see that for GTZAN Disco excerpts, MAPsCAT has the two highest mean precisions: 0.96 ± 0.01 for class-dependent covariance (p < 3 · 10 −9 ), and 0.80 ± 0.01 with total covariance (p < 0.01). When it comes to GTZAN Classical excerpts, MAP-sCAT using class-dependent covariance has perfect recall; and using class-dependent covariance it shows quite high mean precision (0.85 ± 0.01). The F-measure combines recall and precision to reflect class accuracy, where 1 is perfect. We see that AdaBFFs is the most accurate at classifying GTZAN Classical (p < 8 · 10 −7 ), and one of the least accurate at classifying GTZAN Disco excerpts.

Analyzing Class-specific Confusions
Confusion tables are reported in 31% of MGR work, of which only 40% discuss them in ways other than repeating what is shown by the table [83]. Sometimes, a confusion table is accompanied by a discussion of how a system appears to perform in ways that makes sense with respect to what experience and musicology say about the variety of influences and commonalities between particular genres, e.g., [1, 2, 21, 39, 40, 70-72, 86, 89-91, 96]. For instance, Tzanetakis and Cook [89] writes that the misclassifications of their system "... are similar to what a human would do. For example, classical music is misclassified as jazz music for pieces with strong rhythm from composers like Leonard Bernstein and George Gershwin. Rock music has the worst classification accuracy and is easily confused with other genres which is expected because of its broad nature." Of their confusion results, Holzapfel and Stylianou [39] writes, "In most cases, misclassifications have musical sense. For example, the genre Rock ... was confused most of the time with Country, while a Disco track is quite possible to be classified as a Pop music piece. ...
[The] Rock/Pop genre was mostly misclassified as Metal/Punk. Genres which are assumed to be very different, like Metal and Classic, were never confused." Figure 3 shows the mean confusions, recalls (diagonal), precisions (right), and F-measures (bottom), all with 95% confidence intervals, for AdaBFFs, SRCAM, and MAPsCAT. We see that all systems confuse the GTZAN Rock excerpts most with other genres: for Country and Disco using AdaBFFs, for Metal using SRCAM, and for Blues using MAPsCAT. It is clear that MAPsCAT with total covariance confuses no pairs of classes over 9 ± 0.51% than Rock as Blues and Disco as Rock, while the largest confusion for SRCAM with normalized features is 15.3 ± 1.28% for Rock as Metal, and for AdaBFFs with one-node trees is 12.20±0.82% for Hip hop as Reggae. 4.4 Analyzing Excerpt-Specific Confusions Some MGR evaluations describe particular misclassifications, e.g., [26,45,47,73]. Of their experiments, Deshpande et al. [26] writes "... at least in some cases, the classifiers seemed to be making the right mistakes. There was a [classical] song clip that was classified by all classifiers as rock ... When we listened to it, we realized that the clip was the final part of an opera with a significant element of rock in it. As such, even a normal person would also have made such an erroneous classification." Of the confusion   (c) MAPsCAT with total covariance Fig. 4 GTZAN Disco excerpt confusions for each system, with number of classifications in each genre labeled at right. Classes as in Fig. 2 when listening to the audio files. In the same way, 14.71% of the blues examples were considered as rock by the algorithm." Figure 4 shows how the GTZAN Disco excerpts are classified by AdaBFFs, SR-CAM, and MAPsCAT over all trials. For lack of space, we only look at these excerpts, and herein only consider the setting that shows the best classification accuracy ( Table  4): AdaBFFs with one-node decision tree; SRCAM with normalized features; and MAPsCAT with total covariance. Unlike in Fig. 3, we can see here the specific excerpts that AdaBFFs most often misclassifies as Pop and Rock, that SRCAM most often misclassifies as Reggae and Rock, and that MAPsCAT most often misclassifies as Rock and Hip hop. We can also see particular excerpts that are misclassified by the systems in all trials, i.e., GTZAN Disco excerpts 20, 27, 41, 47, and 85.
Here we see that of the GTZAN Disco excerpts, AdaBFFs produced 55 C3s, 15 CMs, 5 PMs, and it consistently misclassified 11 excerpts as Disco. We see that, in total, MAPsCAT has the highest and AdaBFFs the lowest number of C3s and CMs.
We can ask about the relative confidence of a system betwen a CM and C3. For instance, is for AdaBFFs the value (2) larger for its CMs than for its C3s? This amounts to comparing the votes (2) for the CMs and C3s in the GTZAN excerpts. We plot in Fig. 5 the statistics of (2) for AdaBFFs, (5) for SRCAM, and (7) for MAPsCAT, for only the GTZAN Disco excerpts. The left-most portion of each subfigure is of the CMs of Table 5; and the right-most portion is from the C3s. The middle portion is of those GTZAN excerpts not labeled Disco, but that each system consistently misclassifies as Disco (CMs as Disco). The gray horizontal line is the mean value of the C3s for a system; and the vertical gray line marks one standard deviation above and below the mean. Figure 5(a) shows that for AdaBFFs the votes (2) of most Disco CMs and CMs as Disco are indistinguishable from those of the Disco C3s, even though the majority of them lie under the mean of the C3s. They all are within two standard deviations of the mean. Figure 5(b) shows that the mean confidence (5) of all Disco CMs and CMs as Disco are indistinguishable from those of the Disco C3s. Most exist below the mean, but all are well within one standard deviation. Figure 5(c) shows that the mean log posteriors (7) of most Disco CMs and CMs as Disco are indistinguishable from those of the Disco C3s. About half lie above the mean than below it, and all but one are within two standard deviations. These results point to the idea that AdaBFFs, SRCAM, and MAPsCAT are as confident in their C3s as they are in their pathological misclassifications.

Analyzing Consistently Misclassified Excerpts
So far, our evaluation has used statistical tests, rough discussions of genre labels, and mentioned specific excerpt numbers, but has yet to make mention of and use  Mean for Disco C3s shown as gray line with one standard deviation above and below. Classes as in Fig. 2 the actual music embodied by any excerpts. It is in fact quite extraordinary to find in the MGR literature any identification of the music behind problematic classifications. Langlois and Marques [45] notice in their system evaluation that all tracks from an album by Bossa Nova artist João Gilberto are PMs. They attribute this to the tracks coming from a live recording with speaking and applause. We see that the system by Lee et al. [47] misclassifies John Denver's "Rocky Mountain High" as Techno, but they do not discuss this problematic result. In Table 6, we list the specific excerpts of each pathological classification of AdaBFFs, SRCAM, and MAPsCAT for only the Disco class. We take into account that GTZAN has among its Disco excerpts: six replicas, two coming from the same recording, and seven conspicuous and three contentious mislabelings [84]. Further- Table 6 Classification type results for each system: C3s, CMs and PMs for GTZAN Disco excerpts; and CMs as Disco. We take into account the problems of GTZAN [84], and strike-though particular excerpts more, the genre of Country excerpt 39 is none of the 10 labels in GTZAN, and so we do not consider it here [84]. We strike-through these problematic excerpts. We see that, as in Table 5, MAPsCAT has the highest number of C3s and CMs, and AdaBFFs has the lowest. All systems share a common CM and CM as Disco.
From the GTZAN track listing produced in [84], Tables 7 and 8 list the music artist and title of the GTZAN Disco excerpt CMs and the CMs as Disco of AdaBFFs, SRCAM, and MAPsCAT. Even though each system commits CMs in almost all the other genres of GTZAN (Table 5), we show only the CMs of the GTZAN Disco excerpts, and CMs as Disco, for lack of space. We list the statistics associated with each Table 7 Details of Disco CMs from Table 6. An excerpt number marked by * means the classifier "confidence" is larger than that of the Disco CM marked by † (p < 0.041)  system decision, and up to three top last.fm tags (ranked by the "count" parameter) of each song or artist (retrieved from last.fm on Oct. 15, 2012). We do not include tags that are the artist or song name; and when a song has no tags associated with it, we take the top tags for the artist. We define the "confidence" of a classification as the difference between the score -(2) for AdaBFFs, (5) for SRCAM, and (7) for MAPsCAT -of the selected class with that of the "correct" Disco class, This is different in Fig. 5, where we show the decision statistic but not a difference. Table 7 lists the class consistently selected by each system. We see the tag "disco" as a top tag for all eight identified Disco CMs of AdaBFFs. We see AdaBFFs consistently misclassifies as Pop excerpts 25 and 34, both by Evelyn Thomas. The mean of the differences between the votes for Pop and those for Disco is larger for excerpt 25 than it is for 34, but we find from a paired t-test that we cannot reject the null hypothesis that one is not larger than another (p > 0.07). However, with respect to excerpt 34, we can reject such a null hypothesis for excerpts 13, 15, 18, 64, 83 and 86 (p < 0.041), and thus consider these classifications to be "confident." In other words, if we assume excerpt 34 is a borderline classification in all trials, then the other six having larger votes are not borderline. Of those six, only one (13) shares a tag match-ing the class given by AdaBFFs. The consistent misclassification of "Funkytown" by Lipps, Inc. as Reggae, and Alicia Bridges' "I Love the Night Life" as Blues seem quite odd, as they do not sound similar to the GTZAN excerpts exemplifying each category, i.e., Bob Marley, Dennis Brown, Burning Spear, and Gregory Isaacs making more than 50% of the Reggae excerpts; and Robert Johnson, John Lee Hooker, Stevie Ray Vaughn, and Magic Slim making more than 50% of the Blues excerpts [82].
Of the nine identified Disco CMs of SRCAM, the tag "disco" exists for seven. Excerpt 25 has smallest mean confidence difference among the CMs, and we find the mean differences for excerpts 12, 79 and 91 are not larger (p > 0.09), but that of excerpt 86 is (p < 0.005). With respect to excerpt 86 then, we can reject the null hypothesis that its mean difference is not smaller than those of excerpts 2, 11, 39, 64, and 84 (p < 0.013), and thus consider these classifications to be confident. Of those five, only one (11) shares a tag matching the class given by SRCAM. SRCAM, like AdaBFFs, consistently misclassifies as Reggae "Funkytown" by Lipps, Inc.
Of the nine identified Disco CMs of MAPsCAT, the tag "disco" exists for six. The smallest log posterior difference occurs for excerpt 87, which we find is not smaller than 16 and 73 (p > 0.1). With respect to excerpt 15, we can reject the null hypothesis that its mean log posterior difference is not smaller than those of excerpts 00, 11, 67, 72, and 86 (p < 0.016), and thus consider these classifications as confident. Of these five, only one (11) shares a tag matching the class given. MAPsCAT consistently misclassifies "Lowdown" by Boz Scaggs as Hip hop, and ABBA's "Mamma Mia" as Metal. These are odd considering the composition of the majority of each class in GTZAN, i.e., excerpts by Beastie Boys, A Tribe Called Quest, and Public Enemy make more than 56% of the Hip hop excerpts; and Metallica, Dark Tranquillity, Iron Maiden, Black Sabbath, Anthrax, Dio, Motörhead, Rage Against The Machine, and New Bomb Turks make more than 50% of the Metal excerpts [82].
Of the CMs as Disco by all three systems, Table 8 shows none of the tags of the music or artist contains "disco," and only a few contain Disco-relatable tags, such as "dance" and "70s." For those CMs as Disco of AdaBFFs, we see that the mean vote difference for Rock excerpt 37 is the smallest, and all others are significantly larger (p < 0.015). For SRCAM, the mean difference in confidence for Pop excerpt 86, and Rock excerpts 38 and 77 are the smallest, but those for Hip hop 00 and Reggae 88 are significantly larger (p < 2 · 10 −6 ). For MAPsCAT, the mean log posterior difference for Pop excerpt 02 is the smallest, and all others are significantly larger (p < 0.013).
Common to all three systems are one CM and CM as Disco. All systems consistently misclassify Disco excerpt 86, "I Love the Night Life" by Alicia Bridges from 1978: AdaBFFs confidently labels it Blues, SRCAM labels it Hip hop, and MAP-sCAT confidently labels it Rock. These labels do not agree well with the tags for the song or the artist. The common CM as Disco is Hip hop 00, which is Afrika Bambaataa's "Looking for the Perfect Beat" from 1982. The insistence of all systems that this excerpt is Disco might be seen as forgivable, since early Hip hop used Disco records as background for rapping [78] -but such a claim assumes these systems learn that fact from the 90 GTZAN Hip hop excerpts in every cross-validation fold.
We also see more generally in Tables 7 and 8 that all systems misclassify several GTZAN Disco excerpts as Pop, and GTZAN Pop excerpts as Disco. We might also see these as forgivable as much music we now call "disco" was in fact part of the "popular" charts in the late 1970s [78]. Furthermore, the two excerpts of "High Energy" and "Reflections" by Evelyn Thomas come from 1984 and 1985, respectively, which is five years after "disco died" in 1979 in the USA [78]. Hence, a better single label for these particular excerpts is Pop. Aside from these, some consistent misclassifications appear quite unsatisfactory. For instance, AdaBFFs and SRCAM consistently misclassifies as Disco "Knockin' on Heaven's Door" by Guns 'N Roses; and MAPsCAT consistently misclassifies as Disco "Sally Let Your Bangs Hang Down" sung by Merle Haggard, and misclassifies as Metal "Mamma Mia" by ABBA.

Analyzing by Listening Tests
The question thus arises: to what extent do humans show the same kinds of classification behavior as AdaBFFs, SRCAM, and MAPsCAT? By and large, the MGR literature is concerned with the performance of algorithms, pigeons [69], and fish [20]. There have, however, been a few studies made of human music genre classification. One of the most widely cited [8] is the work of Gjerdingen and Perrott [36], which studies human genre recognition capacity as a function of excerpt length. Krumhansl [44], and Mace et al. [55] expands upon this in several directions. Ahrendt et al. [2,3] and Meng et al. [60,61] both use listening tests to gauge the difficulty of discriminating the genres of their genre datasets, and to evaluate their systems' performance. Lippens et al. [52] use listening tests to produce a music genre dataset that has excerpts more exemplary of single genres; and Craft et al. [22,23] addresses a fundamental problem of that work, proposing that evaluating MGR systems makes sense only with respect to the generic ambiguity of music. Seyerlehner et al. [77] reproduces and expands upon these works. In a novel direction, Guaus [38] conducts listening tests to determine the relative importance of timbre or rhythm in genre recognition. Finally, Cruz and Vidal [24] and Sturm [84] conduct listening tests to determine if a system can create music using the genres it has learned to identify.
These latter works motivate the use of listening tests to circumvent the need to demarcate the stylistic elements -assuming they exist to a large extent in the acoustic realm [93] -required to formally justify whether it is more appropriate, e.g., to label an excerpt Disco or Metal, neither or both. Our hypothesis is that for some excerpts the difference between the pair of genre labels given by a human and an MGR system will be large enough that it is extremely clear to listeners familiar with the genres which label is given by a human. For other pairs of genre labels for some excerpts, the difference will be small enough that listeners familiar with the genres can make no real distinction. This boils down to something like a Turing test to determine whether the consistent misclassifications of AdaBFFs, SRCAM, or MAPsCAT are appropriate. With this approach, we can circumvent the comparisons between tags, classes, and "ground truth" labels, and also the task of having to define genres in ways that allow us to, e.g., challenge the labeling of Alicia Bridge's "I Love the Night Life" as using the Blues, Hip hop, and/or Rock genres.
To approach our hypothesis, we conduct a listening test in which a subject must choose for each 12 second excerpt which label of two was given by a human (i.e., Tzanetakis); the other label was given by the computer. The experiment has two parts, both facilitated by GUIs built in MATLAB. In the first part, we screen subjects for their ability to distinguish between the ten genres in GTZAN. (The representative  Table 7. The class selected by each system is marked by the symbol shown in the legend. (b) CMs as Disco in Table 8. Classes as in Fig. 2 GTZAN excerpts we use are: Blues 05, John Lee Hooker, "Sugar Mama"; Classical 96, Vivaldi, "The Four Seasons, Summer, Presto"; Country 12, Billy Joe Shaver, "Music City"; Disco 66, Peaches and Herb, "Shake Your Groove Thing"; Hip hop 47, A Tribe Called Quest, "Award Tour"; Jazz 19, Joe Lovano, "Birds Of Springtimes Gone By"; Metal 11, unknown; Pop 95, Mandy Moore, "Love you for always"; Reggae 71, Dennis Brown, "Big Ships"; Rock 37, The Rolling Stones, "Brown Sugar.") A subject correctly identifying the genre of all excerpts continues to the second part of the test, where s/he must discriminate between the human-and algorithm-given genres for each music excerpt. For instance, the test application presents the excerpt of Donna Summer's "Back Off Boogaloo" along with the two labels "Disco" and "Pop." The subject must select the one s/he thinks is given by a human before proceeding to the next excerpt. We also record the time a subject spends listening to an excerpt before proceeding to the next one. We test all unique Disco CMs and CMs as Disco in Tables 7 and 8. In total, 24 test subjects completed the second part. Figure 6 shows the choices made by subjects for all Disco CMs, and CMs as Disco for each MGR system. Figure 6(a) shows that for the nine Disco CMs of AdaBFFs, a majority of subjects sided with the non-human class in two cases: excerpts 13 and 34. In one case, excerpt 39, no subject chose the class given by AdaBFFs. In no case for the 10 Disco CMs of SRCAM did a majority of subjects pick the non-human class; and for four excerpts -2, 12, 86, and 91 -no subject chose the class given by SRCAM. For the eleven Disco CMs of MAPsCAT, only for excerpt 10 did a majority of subjects choose the non-human class; and no subject chose the MAPsCAT class for three excerpts: 67, 72, and 73. In Fig. 6(b), we see that of the ten CMs as Disco of AdaBFFs, and of the twelve of MAPsCAT, in no case did a majority of subjects select "Disco." Of the five CMs as Disco of SRCAM, we see for two excerpts -Reggae 88 and Rock 77 -a majority of subjects chose "Disco". Now we test the null hypothesis that the subjects are unable to recognize the difference between the genre label given by a human and the class selected by Ad-aBFFs, SRCAM, or MAPsCAT. We can consider the outcome of each trial as a Bernoulli random variable with parameter x (the probability of a subject selecting the label given by a human). For a given excerpt to which the human label is selected h times by N independent subjects, we can estimate the Bernoulli parameter x using the minimum mean-squared error estimator, assuming x is distributed uniform in [0, 1]:x(h) = (h + 1)/(N + 2) [81]. The variance of this estimate is given by [81]  So, what is it about Disco excerpts 13, 15 and 18 that made subjects divided between the labels "Disco" and "Pop," and choose more often "Pop" for Disco excerpts 10 and 34? Many subjects that passed the screening mentioned in post-test interviews that the most challenging pair of tags was "Disco" and "Pop." When asked what cues they used to make the choice, many could not state specifics, referring instead to the "feel" of the music. Some said they decided based upon whether the excerpt sounded "old" or "more produced." Hence, it is reasonable to believe that whatever makes something Disco but not Pop is unclear without further specification, e.g., "Pop like Britney Spears" and "70s Disco." In these cases then, we might as well conclude that AdaBFFs and MAPsCAT are classifying appropriately. Some subjects were also dissatisfied by some label pairs, e.g., "Metal" and "Disco" for ABBA's "Mamma Mia" because in their opinion ABBA is Pop.
In the case of Disco excerpt 11, subjects were divided between "Disco" and "Rock." When asked in the post-test interview about how quickly they made each selection, many subjects said they were quite quick, e.g., within the first few seconds. Some mentioned that they changed their answers after listening to some of the excerpts longer; and a few subjects said that they made sure to listen beyond what sounded like the introduction. We thus look at the duration each subject spent listening to Disco excerpt 11 before proceeding. We find that the listening time difference between subjects who selected "Rock" (8.5 ± 1.2 s, with 95% confidence interval) versus those who selected "Disco" (7.9 ± 1.1 s), is not statistically significant (p > 0.48). However, for Hip hop excerpt 00, the mean listening durations of subjects who selected "Disco" (4.9 ± 1.1 s) versus those who selected "Hip hop" (9.5 ± 1.6 s) is significant (p < 6 · 10 −5 ). Apparently, many subjects hastily chose the label "Disco" -which brings up the question of whether the genres used by an entire piece of music applies to its parts [54]. In these cases, then, we can conclude that SRCAM and MAPsCAT are classifying appropriately.
Finally, there are Reggae 88 and Rock 77, for which subjects selected significantly more often the non-human class. In the first case, it is clear that people do not agree with the Tzanetakis label. "Electric Boogie" by Marcia Griffiths is quite unlike the majority of the GTZAN Reggae excerpts, even though one of its top tags is "reggae." Hence, we can consider as appropriate this misclassification by SRCAM. For the Rock 77 excerpt, since we do not find a significant difference in listening times (p > 0.6), we can regard SRCAM is classifying appropriately. All other CMs and CMs as Disco, however, are not appropriate: for AdaBFFs, 5 of its 9 CMs and 9 of its 10 CMs as Disco; for SRCAM, 9 of its 10 CMs and 3 of its 6 CMs as Disco; and for MAPsCAT, 8 of its 11 CMs and 13 of its 14 CMs as Disco.

Conclusion
While genre is an inescapable result of human communication [33], it is also ambiguous [22,23,93] as humans do not always agree, e.g., [2,3,22,23,36,52,61,77]. A major conundrum in the evaluation of MGR systems is thus the formal justification of why a particular label is better than another. For instance, while I deride the misclassification above, an argument might be made that ABBA's "Mamma Mia" employs some of the same stylistic elements used by Motörhead in "Ace Of Spades" -though it is difficult to imagine the audiences of the two would perceive that to be the case. The matter of evaluating MGR systems would be quite simple if only we had a checklist of essential, or at least important, attributes for each genre. Barbedo and Lopes [9] provides a long list of such attributes in each of several genres and sub-genres, e.g., Light Orchestra Instrument Classical is marked by "light and slow songs ... played by an orchestra" and have no vocal element (like J. S. Bach's "Air on the G String"); and Soft Country Organic Pop/Rock is marked by "slow and soft songs ... typical of southern United States [with] elements both from rock and blues [and where] electric guitars and vocals are [strongly] predominant [but there is little if any] electronic elements" (like "Your Cheating Heart" by Hank Williams Sr.). Some of these attributes are clear and actionable, like "slow," but others are not, like "[with] elements both from rock and blues." Categorically different from this is the expert system devised by Dixon et al. [27], where temporal characteristics of music, e.g., tempo and meter, can sometimes restrict its membership to particular dance styles.
In this work, we have analyzed from multiple perspectives the performance of three MGR systems to measure the extent to which they recognize music genre. From Table 4, we see the classification accuracies of AdaBFFs, SRCAM, and MAPsCAT are significantly higher than chance, and are among the best observed (and reproduced) for the GTZAN dataset. Thus, one might take such a high classification accuracy as evidence that a system is capable of recognizing the genres in a test set. However, from the nature of the Classify experimental design, we are not able to reject the null hypothesis that one of these systems is not able to recognize genre, no matter the accuracy observed. In essence, "genre" is not the only independent variable that changes between the excerpts of particular genres. There is also, just to name a few, instrumentation (Disco and Classical may or may not use strings), loud-ness (Metal and Classical can be listened to at high or low volume), tempo (Blues and Country can be played fast or slow), dynamics (Classical and Jazz can have few or several large changes in dynamics), reverberation (Reggae can involve spring reverberation, and Classical can be performed in small or large halls), production (Hip hop and Rock can be produced in a studio or in a concert), channel bandwidth (Country and Classical can be heard on AM or FM radio), noise (Blues and Jazz can be heard from an old record or a new CD), and so on. To determine if an MGR system has a capacity to recognize any genre, we must look deeper than classification accuracy.
In Fig. 2, we see the recalls, precisions, and F-measures for AdaBFFs, SRCAM, and MAPsCAT. With these figures of merit then, one might be inclined to claim that we can reject the null hypothesis that MAPsCAT cannot recognize Disco, or that AdaBFFs cannot recognize Classical. However, "to recognize" is not equivalent to having high recall, precision, or F-measure; and "recognize Disco" is not equivalent to "recognize as Disco an excerpt labeled Disco" -especially with the problems inherent to the GTZAN dataset [82]. Thus, we still cannot reject the null hypothesis that MAPsCAT cannot recognize Disco, even with perfect accuracy, and thus precision, recall, and F-measure. To answer whether any of these MGR systems has a capacity to recognize Disco, we must dig deeper than these figures of merit.
We might claim that the confusion behavior of AdaBFFs, SRCAM, and MAP-sCAT "makes musical sense;" but by doing so we implicitly make two critical assumptions: 1) that the dataset being used has integrity for MGR; and 2) that the system is using cues similar to those used by humans when categorizing music, e.g., what instruments are playing, and how are they being played? what is the rhythm, and how fast is the tempo? is it for dancing, moshing, protesting or listening? is someone singing, and if so what is the subject? For the first assumption, though it is the most used dataset in MGR -appearing in 23% of MGR research since 2002 -GTZAN has numerous problems, including repetitions of excerpts and artists, many mislabelings, and distortions [84]. Hence, GTZAN is not a dataset with high integrity. The second assumption is much harder to justify, and requires us again to dig deeper than the confusion behaviors. We thus have to look at the level of the music itself to answer these questions.
Analyzing the pathological behaviors of an MGR system provides insight into whether its internal models of genres make sense with respect to the ambiguous nature of genre. Tables 5 -8 provide details on persistent kinds of confusions that appear for AdaBFFs, SRCAM, and MAPsCAT. Comparing the classification results with the tags given by a community of listeners show that some behaviors do indeed "make musical sense," but other appear less rational. In the case of using tags, the implicit assumption is that the tags given by an unspecified population to make their music more useful to them are to be trusted in describing the elements of music that characterize the genre(s) it uses -whether users found these upon genre ("funk" and "soul"), style ("melodic" and "classic"), form ("ballad"), function ("dance"), history ("70s" and "old school"), geography ("jamaican" and "brit pop"), or others ("romantic"). This assumption is thus quite unsatisfying, and one wonders whether tags present a good way to formally evaluate MGR systems.
Analyzing the same pathological behaviors of an MGR system, but by a listening test designed specifically to test the sensibility of its choices, circumvents the need to compare tags, and gets to the heart of whether the system is recognizing and comparing salient characteristics typical to genres, e.g., instrumentation, rhythm, form, and so on. Hence, we finally see through this that though AdaBFFs, SRCAM, and MAP-sCAT have classification accuracies that are significantly higher than chance, and though each system has confusion tables that appear reasonable, a closer analysis of their confusions at the level of the music and a listening test measuring the appropriateness of their classifications, reveals that they are not recognizing genre since a large majority of their consistent misclassifications are easily detected as artificial.
Typically, formally justifying a misclassification as an error is a task MGR research often defers to the "ground truth" of a dataset, whether created by a listener [89], the artist [77], music vendors [6,36], the collective agreement of several listeners [35,52] professional musicologists [1], or multiple tags given by an online community [46]. However, the focus of developing an algorithm to pick the correct or best label actually obscures what should be the goal of any MGR system: to produce labels that are indistinguishable from those humans would produce. Hence, to this end, classification accuracy is not enough.