1 Introduction

Retail and online music stores usually index their collections by artist or album name. However, people often need to search for music by content. For example, a search facility is offered by emerging music-oriented recommendation services, such as last.fm (http://www.last.fm/) and Pandora (http://www.pandora.com/), where social tags are employed as semantic descriptors of the music content. Social tags are text-based labels, provided by either human experts or amateur users to categorize music with respect to genre, mood, and other semantic tags. The major drawbacks of this approach for the semantic annotation of music content are (1) a newly added music recording must be tagged manually, before it can be retrieved [1], which is a time-consuming and expensive process and (2) unpopular music recordings may not be tagged at all [2]. Consequently, an accurate content-based automatic classification of music should be exploited to mitigate the just mentioned drawbacks, allowing the deployment of robust music browsing and recommendation engines.

A considerable volume of research in content-based music classification has been conducted so far. The interested reader may refer to [25] for a comprehensive survey. Most music classification methods focus on music categorization with respect to genre, mood, or multiple semantic tags. They consist mainly of two stages, namely a music representation stage and a machine learning one. In the first stage, the various aspects of music (i.e., the timbral, the harmonic, the rhythmic content, etc.) are captured by extracting either low- or mid-level features from the audio signal. Such features include timbral texture features, rhythmic features, pitch content, or their combinations, yielding a bag-of-features (BOF) representation [1, 2, 618]. Furthermore, spectral, cepstral, and auditory modulation-based features have been recently employed either in BOF approaches or as autonomous music representations in order to capture both the timbral and the temporal structure of music [1922]. At the machine learning stage, music genre and mood classification are treated as single-label multi-class classification problems. To this end, support vector machines (SVMs) [23], nearest-neighbor (NN) classifiers, Gaussian mixture model-based ones [3], and classifiers relying on sparse and low-rank representations [24] have been employed to classify the audio features into genre or mood classes. On the contrary, automatic music tagging (or autotagging) is considered as a multi-label, multi-class classification problem. A variety of algorithms have been exploited in order to associate the tags with the audio features. For instance, music tag prediction may be treated as a set of binary classification problems, where standard classifiers, such as the SVMs [12, 14] or ada-boost [25], can be applied. Furthermore, probabilistic autotagging systems have been proposed, attempting to infer the correlations or joint probabilities between the tags and the audio features [1, 9, 26].

Despite the existence of many well-performing music classification methods, it is still unclear which music representation (i.e., audio features) and which machine learning algorithm is appropriate for a specific music classification task. A possible explanation for the aforementioned open question is that the classes (e.g., genre, mood, or other semantic classes) in music classification problems are related to and built on some common unknown latent variables, which are different in each problem. For instance, many different songs, although they share instrumentation (i.e., have similar timbral characteristics), convey different emotions and belong to different genres. Furthermore, cover songs, which have the same harmonic content with the originals, may differ in the instrumentation and possibly evoke a different mood, so they are classified into different genres. Therefore, the challenge is to reveal the common latent features based on given music representations, such as timbral, auditory, etc., and to simultaneously learn the models that are appropriate for each specific classification task.

In this paper, a novel, robust, general-purpose music classification method is proposed to address the aforementioned challenge. It is suitable for both single-label (i.e., genre or mood classification) and multi-label (i.e., music tagging) multi-class classification problems, providing a systematic way to handle multiple audio features capturing the different aspects of music. In particular, given a number of audio feature vectors for each training music recording, the goal is to find a set of linear mappings from the feature spaces to the semantic space defined by the class indicator vectors. Furthermore, these mappings should reveal the common latent variables, which characterize a given set of classes and simultaneously define a multi-class linear classifier that classifies the extracted latent common features. Such a model can be derived by building on the notion of the maximum margin matrix factorization [27]. That is, in the training phase, the set of mappings is found by minimizing a weighted sum of nuclear norms. To this end, an algorithm that resorts to the alternating direction augmented Lagrange multiplier method [28] is derived. In the test phase, the class indicator vector for labeling any test music recording is obtained by multiplying each mapping matrix with the corresponding feature vector and by summing all the resulting vectors next. Since the nuclear norm imposes rank constraints to the learnt mappings, the proposed classification method is referred to as low-rank semantic mappings (LRSMs).

The motivation behind the LRSMs arises from the fact that uncovering hidden shared variables among the classes facilitates the learning process [29]. To this end, various formulations for common latent variable extraction have been proposed for multi-task learning [30], multi-class classification [31], collaborative prediction [32], and multi-label classification [33]. The LRSMs differ significantly from the aforementioned methods [2931, 33] in that the extracted common latent variables come from many different (vector) feature spaces.

The performance of the LRSMs in music genre, mood, and multi-label classification is assessed by conducting experiments on seven manually annotated benchmark datasets. Both the standard evaluation protocols for each dataset and a small sample size setting are employed. The auditory cortical representations [34, 35], the mel-frequency cepstral coefficients [36], and the chroma features [37] were used for music representation. In the single-label case (i.e., genre or mood classification), the LRSMs are compared against three well-known classifiers, namely the sparse representation-based classifier (SRC) [38], the linear SVMs, and the NN classifier with a cosine distance metric. Multi-label extensions of the aforementioned classifiers, namely the multi-label sparse representation-based classifier (MLSRC)[39], the Rank-SVMs [40], and the multi-label k-nearest neighbor (MLkNN) [41], as well as the parallel factor analysis 2 (PARAFAC2)-based autotagging method [42] are compared with the LRSMs in music tagging. The reported experimental results demonstrate the superiority of the LRSMs over the classifiers that are compared to. Moreover, the best classification results disclosed are comparable with or slightly superior to those obtained by the state-of-the-art music classification systems.

To summarize, the contributions of the paper are as follows:

  •  A novel method for music classification (i.e., the LRSMs) is proposed that is able to extract the common latent variables that are shared among all the classes and simultaneously learn the models that are appropriate for each specific classification task.

  •  An efficient algorithm for the LRSMs is derived by resorting to the alternating direction augmented Lagrange multiplier method, which is suitable for large-scale data.

  •  The LRSMs provide a systematic way to handle multiple audio features for music classification.

  •  Extensive experiments on seven datasets demonstrate the effectiveness of the LRSMs in music genre, mood, and multi-label classification when the mel-frequency cepstral coefficients (MFCCs), the chroma, and the auditory cortical representations are employed for music representation.

The paper is organized as follows: In Section 2, basic notation conventions are introduced. The audio feature extraction process is briefly described in Section 3. In Section 4, the LRSMs are detailed. Datasets and experimental results are presented in Section 5. Conclusions are drawn in Section 6.

2 Notations

Throughout the paper, matrices are denoted by uppercase boldface letters (e.g., X,L), vectors are denoted by lowercase boldface letters (e.g., x), and scalars appear as either uppercase or lowercase letters (e.g., N,K,i,μ,ϵ). I denotes the identity matrix of compatible dimensions. The i th column of x is denoted as x i . The set of real numbers is denoted by R, while the set of nonnegative real numbers is denoted by R + .

A variety of norms on real-valued vectors and matrices will be used. For example, ∥x0 is 0 quasi-norm counting the number of nonzero entries in x. The matrix 1 norm is denoted by X 1 = i j | x ij |. X F = i j x ij 2 = tr ( X T X ) is the Frobenius norm, where tr(.) denotes the trace of a square matrix. The nuclear norm of x (i.e., the sum of singular values of a matrix) is denoted by ∥X. The norm of x, denoted by ∥X , is defined as the element of x with the maximum absolute value.

3 Audio feature extraction

Each music recording is represented by three song-level feature vectors, namely the auditory cortical representations [34, 35], the MFCCs [36], and the chroma features [37]. Although much more elaborated music representations have been proposed in the literature, the just mentioned features perform quite well in practice [14, 2224]. Most importantly, song-level representations are suitable for large-scale music classification problems since the space complexity for audio processing and analysis is reduced and the database overflow is prevented [3].

3.1 Auditory cortical representations

The auditory cortex plays a crucial role in the hearing process since auditory sensations turn into perception and cognition only when they are processed by the cortical area. Therefore, one should focus on how audio information is encoded in the human primary auditory cortex in order to represent music signals in a psycho-physiologically consistent manner [43]. The mechanical and neural processing in the early and central stages of the auditory system can be modeled as a two-stage process. At the first stage, which models the cochlea, the audio signal is converted into an auditory representation by employing the constant-Q transform (CQT).The CQT is a time-frequency representation, where the frequency bins are geometrically spaced and the Q-factors (i.e., the ratios of the center frequencies to the bandwidths) of all bins are equal [44]. The neurons in the primary auditory cortex are organized according to their selectivity in different spectral and temporal stimuli [43]. To this end, in the second stage, the spectral and temporal modulation content of the CQT is estimated by two-dimensional (2D) multi-resolution wavelet analysis, ranging from slow to fast temporal rates and from narrow to broad spectral scales. The analysis yields a four-dimensional (4D) representation of time, frequency, rate, and scale that captures the slow spectral and temporal modulation content of audio that is referred to as auditory cortical representation[34]. Details on the mathematical formulation of the auditory cortical representations can be found in [34, 35].

In this paper, the CQT is computed efficiently by employing the fast implementation scheme proposed in [44]. The audio signal is analyzed by employing 128 constant-Q filters covering eight octaves from 44.9 Hz to 11 KHz (i.e., 16 filters per octave). The magnitude of the CQT is compressed by raising each element of the CQT matrix to the power of 0.1. At the second stage, the 2D multi-resolution wavelet analysis is implemented via a bank of 2D Gaussian filters with scales ∈{0.25,0.5,1,2,4,8} (cycles/octave) and (both positive and negative) rates ∈{±2,±4,±8,±16,±32} (Hz). The choice of the just mentioned parameters is based on psychophysiological evidence [34]. For each music recording, the extracted 4D cortical representation is time-averaged, and the 3D rate-scale-frequency cortical representation is obtained. The overall procedure is depicted in Figure 1. Accordingly, each music recording can be represented by a vector x R + 7 , 680 by stacking the elements of the 3D cortical representation into a vector. The dimension of the vectorized cortical representation comes from the product of 128 frequency channels, 6 scales, and 10 rates. An ensemble of music recordings is represented by the data matrix X R + 7 , 680 × S , where S is the number of the available recordings in each dataset. Finally, the entries of x are post-processed as follows: Each row of x is normalized to the range [0,1] by subtracting from each entry the row minimum and then by dividing it with the range (i.e., the difference between the row maximum and the row minimum).

Figure 1
figure 1

Flow chart of auditory cortical representation extraction.

3.2 Mel-frequency cepstral coefficients

The MFCCs encode the timbral properties of the music signal by encoding the rough shape of the log-power spectrum on the mel-frequency scale [36]. They exhibit the desirable property that a numerical change in the MFCC coefficients corresponds to a perceptual change. In this paper, MFCC extraction employs frames of 92.9-ms duration with a hop size of 46.45 ms and a 42 band-pass filter bank. The filters are uniformly spaced on the mel-frequency scale. The correlation between the frequency bands is reduced by applying the discrete cosine transform along the log-energies of the bands yielding a sequence of 20-dimensional MFCC vectors. By averaging the MFCCs along the time axis, each music recording is represented by a 20-dimensional MFCC vector.

3.3 Chroma features

The chroma features [37] are adept in characterizing the harmonic content of the music signal by projecting the entire spectrum onto 12 bins representing the 12 distinct semitones (or chroma) of a musical octave. They are calculated by employing 92.9 ms frames with a hop size of 23.22 ms as follows: First, the salience of different fundamental frequencies in the range 80 to 640 Hz is calculated. The linear frequency scale is transformed into a musical one by selecting the maximum salience value in each frequency range corresponding to one semitone. Finally, the octave equivalence classes are summed over the whole pitch range to yield a sequence of 12-dimensional chroma vectors.

The chroma as well as the MFCCs, extracted from an ensemble of music recordings, is post-processed as described in subsection 3.1.

4 Classification by low-rank semantic mappings

Let each music recording be represented by R types of feature vectors of size d r , x ( r ) R d r , r=1,2,…,R. Consequently, an ensemble of N training music recordings is represented by the set {X(1),X(2),…,X(R)}, where X ( r ) = x 1 ( r ) , x 2 ( r ) , , x N ( r ) R d r × N , r=1,2,…,R. The class labels of the N training samples are represented as indicator vectors forming the matrix L∈{0,1}K×N, where K denotes the number of classes. Clearly, l k n =1 if the n th training sample belongs to the k th class. In a multi-label setting, more than one non-zero elements may appear in the class indicator vector l n ∈{0,1}K.

These R different feature vectors characterize different aspects of music (i.e., timbre, rhythm, harmony, etc.), having different properties, and thus, they live in different (vector) feature spaces. Since different feature vectors have different intrinsic discriminative power, an intuitive idea is to combine them in order to improve the classification performance. However, in practice, most of the machine learning algorithms can handle only a single type of feature vectors and thus cannot be naturally applied to multiple features. A straightforward strategy to handle multiple features is to concatenate all the feature vectors into a single feature vector. However, the resulting feature space is rather ad hoc and lacks physical interpretation. It is more reasonable to assume that multiple feature vectors live in a union of feature spaces, which is what the proposed method actually does in a principled way. Leveraging information contained in multiple features can dramatically improve the learning performance as indicated by the recent results in multi-view learning [30, 45].

Given a set of (possibly few) training samples along with the associated class indicator vectors, the goal is to learn R mappings M ( r ) R K × d r from the feature spaces R d r , r=1,2,…,R, to the label space {0,1}K, having a generalization ability and appropriately utilizing the cross-feature information, so that

L= r = 1 R M ( r ) X ( r ) .
(1)

As discussed in Section 1, the mappings M R K × d r , r=1,2,…,R, should be able to (1) reveal the common latent variables across the classes and (2) predict simultaneously the class memberships based on these latent variables. To do this, we seek for C ( r ) R K × p r and F R p r × d r , such that M ( r ) = C ( r ) F ( r ) R K × d r , r=1,2,…,R. In this formulation, the rows of F(r) reveal the p r latent features (variables), and the rows of C(r) are the weights predicting the classes. Clearly, the number of p r common latent variables and the matrices C(r), F(r) are unknown and need to be jointly estimated.

Since the dimensionality of the R latent feature spaces (i.e., p r ) is unknown, inspired by maximum margin matrix factorization [27], we can allow the unknown matrices C(r) to have an unbounded number of columns and F(r), r=1,2,…,R to have an unbounded number of rows. Here, the matrices C(r) and F(r) are required to be low-norm. This constraint is mandatory because otherwise the resulting linear transform induced by applying first F(r) and then C(r) would degenerate to a single transform. Accordingly, the unknown matrices are obtained by solving the following minimization problem:

arg min { C ( r ) , F ( r ) | r = 1 R } r = 1 R λ r 2 C ( r ) F 2 + F ( r ) F 2 + 1 2 L r = 1 R C ( r ) F ( r ) X ( r ) F 2 ,
(2)

where λ r , r=1,2,…,R, are regularization parameters and the least squares loss function 1 2 L r = 1 R C ( r ) F ( r ) X ( r ) F 2 measures the labeling approximation error. It is worth mentioning that the least squares loss function is comparable to other loss functions, such as the hinge loss employed in SVMs [46], since it has been proved to be (universally) Fisher consistent [47]. This property along with the fact that it leads into the formulation of a tractable optimization problem motivated us to adopt the least squares loss here. By Lemma 1 in [27], it is known that

λ r M ( r ) = arg min M ( r ) = F ( r ) C ( r ) λ r 2 C ( r ) F 2 + F ( r ) F 2 .
(3)

Thus, based on (3), the optimization problem (2) can be rewritten as

arg min { M ( r ) | r = 1 R } r = 1 R λ r M ( r ) + 1 2 L r = 1 R M ( r ) X ( r ) F 2 .
(4)

Therefore, the mappings M(r), r=1,2,…R, are obtained by minimizing the weighted sum of their nuclear norms and the labeling approximation error, that is, the nuclear norm-regularized least squares labeling approximation error. Since the nuclear norm is the convex envelope of the rank function [48], the derived mappings between the feature spaces and the semantic space spanned by the class indicator matrix L are low-rank as well. This justifies why the solution of (4) yields low-rank semantic mappings (LRSMs). The LRSMs are strongly related and share the same motivations with the methods in [31] and [32], which have been proposed for multi-class classification and prediction, respectively. In both methods, the nuclear norm-regularized loss is minimized in order to infer relationships between the label vectors and feature vectors. The two key differences between the methods in [31] and [32] and the LRSMs are (1) the LRSMs are able to adequately handle multiple features, drawn from different feature spaces, and (2) the least squares loss function is employed instead of hinge loss, resulting into formulation (4) which can be efficiently solved for large-scale data.

Problem (4) is solved as follows: By introducing the auxiliary variables W(r), r=1,2,…,R, (4) is equivalent to

arg min { M ( r ) , W ( r ) | r = 1 R } r = 1 R λ r W ( r ) + 1 2 L r = 1 R M ( r ) X ( r ) F 2 s.t. M ( r ) = W ( r ) , r = 1 , 2 , , R ,
(5)

which can be solved by employing the alternating direction augmented Lagrange multiplier (ADALM) method, which is a simple, but powerful, algorithm that is well suited to large-scale optimization problems [28, 49]. That is, by minimizing the augmented Lagrange function [28],

W ( 1 ) , W ( 2 ) , W ( R ) , M ( 1 ) , M ( 2 ) , , M ( R ) , Ξ ( 1 ) , Ξ ( 2 ) , , Ξ ( R ) = r = 1 R λ r W ( r ) + 1 2 L r = 1 R M ( r ) X ( r ) F 2 + r = 1 R tr Ξ ( r ) T M ( r ) W ( r ) + ζ 2 r = 1 R M ( r ) W ( r ) F 2 ,
(6)

where Ξ(r), r=1,2,…,R, are the Lagrange multipliers and ζ>0 is a penalty parameter. By applying the ADALM, (6) is minimized with respect to each variable in an alternating fashion, and finally, the Lagrange multipliers are updated at each iteration. If only W(1) is varying and all the other variables are kept fixed, we simplify (6) writing ( W ( 1 ) )instead of W ( 1 ) , W ( 2 ) , , W ( R ) , M ( 1 ) , M ( 2 ) , , M(R),Ξ(1),Ξ(2),…,Ξ(R)). Let t denote the iteration index. Given W [ t ] ( r ) , M [ t ] ( r ) , r=1,2,…R, and ζ[t], the iterative scheme of ADALM for (6) reads as follows:

W [ t + 1 ] ( r ) = arg min W [ t ] ( r ) W [ t ] ( r ) = arg min W [ t ] ( r ) λ r W [ t ] ( r ) + tr Ξ [ t ] ( r ) T M [ t ] ( r ) W [ t ] ( r ) + ζ [ t ] 2 M [ t ] ( r ) W [ t ] ( r ) F 2 = arg min W [ t ] ( r ) λ r ζ [ t ] W [ t ] ( r ) + 1 2 W [ t ] ( r ) M [ t ] ( r ) + Ξ [ t ] ( r ) ζ [ t ] F 2 .
(7)
M [ t + 1 ] ( r ) = arg min M [ t ] ( r ) M [ t ] ( r ) ) = arg min M [ t ] ( r ) 1 2 L r = 1 R M [ t ] ( r ) X ( r ) F 2 + tr Ξ [ t ] ( r ) T M [ t ] ( r ) W [ t + 1 ] ( r ) + ζ [ t ] 2 M [ t ] ( r ) W [ t + 1 ] ( r ) F 2 .
(8)
Ξ [ t + 1 ] ( r ) = Ξ [ t ] ( r ) + ζ [ t ] M [ t + 1 ] ( r ) W [ t + 1 ] ( r ) , r = 1 , 2 , , R.
(9)

The solution of (7) is obtained in closed form via the singular value thresholding operator defined for any matrix Q as [50]: D τ [Q]=U S τ V T with Q=U Σ VT being the singular value decomposition and S τ [q]=sgn(q)max(|q|τ,0) being the shrinkage operator [51]. The shrinkage operator can be extended to matrices by applying it element-wise. Consequently, W [ t + 1 ] ( r ) = D λ r ζ [ t ] M [ t ] ( r ) + Ξ [ t ] ( r ) ζ [ t ] . Problem (8) is an unconstrained least squares problem, which admits a unique closed-form solution, as is indicated in Algorithm 1 summarizing the ADALM method for the minimization of (5). The convergence of Algorithm 1 is just a special case of that of the generic ADALM [28, 49].

The set of the low-rank semantic matrices {M(1),M(2),…,M(R)}, obtained by Algorithm 1, captures the semantic relationships between the label space and the R audio feature spaces. In music classification, the semantic relationships are expected to propagate from the R feature spaces to the label vector space. Therefore, a test music recording can be labeled as follows: Let x ̂ ( r ) R d r , r=1,2,…,R, be a set of feature vectors extracted from the test music recording and l∈{0,1}K be the class indicator vector of this recording. First, the intermediate class indicator vector l ̂ R K is obtained by

l ̂ = r = 1 R M ( r ) x ( r ) .
(10)

Algorithm 1 Solving ( 5 ) by the ADALM method

The (final) class indicator vector (i.e., L) has ∥l0=v<K, containing 1 in the positions, which are associated with the v largest values in l ̂ . Clearly, for single-label multi-class classification, v=1.

4.1 Computational complexity

The dominant cost for each iteration in Algorithm 1 is the computation of the singular value thresholding operator (i.e., step 4), that is, the calculation of the singular vectors of M [ t ] ( r ) + Ξ [ t ] ( r ) ζ [ t ] whose corresponding singular values are larger than the threshold λ r ζ [ t ] . Thus, the complexity of each iteration is O(R·d·N2).

Since the computational cost of the LRSMs depends highly on the dimensionality of feature spaces, dimensionality reduction methods can be applied. For computational tractability, dimensionality reduction via random projections is considered. Let the true low dimensionality of the data be denoted by z. Following [52], a random projection matrix, drawn from a normal zero-mean distribution, provides with high probability a stable embedding[53] with the dimensionality of the projection d r selected as the minimum value such that d r >2zlog(7,680/ d r ). Roughly speaking, a stable embedding approximately preserves the Euclidean distances between all vectors in the original space in the feature space of reduced dimensions. In this paper, we propose to estimate z by robust principal component analysis [51] on the high-dimensional training data (e.g., X(r)). That is, the principal component pursuit is solved:

arg min L ( r ) , S ( r ) Γ ( r ) +λ Δ ( r ) 1 s.t. X ( r ) = Γ ( r ) + Δ ( r ) .
(11)

Then, z is the rank of the outlier-free data matrix Γ(r)[51] and corresponds to the number of its non-zero singular values.

5 Experimental evaluation

5.1 Datasets and evaluation procedure

The performance of the LRSMs in music genre, mood, and multi-label music classification is assessed by conducting experiments on seven manually annotated benchmark datasets for which the audio files are publicly available. In particular, the GTZAN [17], ISMIR, Homburg [54], Unique [16], and 1517-Artists [16] datasets are employed for music genre classification, the MTV dataset [15] for music mood classification, and the CAL500 dataset [1] for music tagging. Brief descriptions of these datasets are provided next.

The GTZAN (http://marsyas.info/download/data_sets) consists of 10 genre classes, namely blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae, and rock. Each genre class contains 100 excerpts of 30-s duration.

The ISMIR (http://ismir2004.ismir.net/ISMIR_Contest.html) comes from the ISMIR 2004 Genre classification contest and contains 1,458 full music recordings distributed over six genre classes as follows: classical (640), electronic (229), jazz-blues (52), metal-punk (90), rock-pop (203), and world (244), where the number within parentheses refers to the number of recordings which belong to each genre class. Therefore, 43.9% of the music recordings belong to the classical genre.

The Homburg (http://www-ai.cs.uni-dortmund.de/audio.html) contains 1,886 music excerpts of 10-s length by 1,463 different artists. These excerpts are unequally distributed over nine genres, namely alternative, blues, electronic, folk-country, funk/soul/RnB, jazz, pop, rap/hip-hop, and rock. The largest class is the rap/hip-hop genre containing 26.72% of the music excerpts, while the funk/soul/RnB is the smallest one containing 2.49% of the music excerpts.

The 1517-Artists (http://www.seyerlehner.info/index.php?p=1_3_Download) consists of 3,180 full-length music recordings from 1,517 different artists, downloaded free from download.com. The 190 most popular songs, according to the number of total listenings, were selected for each of the 19 genres, i.e., alternative/punk, blues, children’s, classical, comedy/spoken, country, easy listening/vocal, electronic, folk, hip-hop, jazz, latin, new age, RnB/soul, reggae, religious, rock/pop, soundtracks, and world. In this dataset, the music recordings are distributed almost uniformly over the genre classes.

The Unique (http://www.seyerlehner.info/index.php?p=1_3_Download) consist of 3,115 music excerpts of popular and well-known songs, distributed over 14 genres, namely blues, classic, country, dance, electronica, hip-hop, jazz, reggae, rock, schlager (i.e., music hits), soul/RnB, folk, world, and spoken. Each excerpt has 30-s duration. The class distribution is skewed. That is, the smallest class (i.e., spoken music) accounts for 0.83%, and the largest class (i.e., classic) for 24.59% of the available music excerpts.

The MTV (http://www.openaudio.eu/) contains 195 full-music recordings with a total duration of 14.2 h from the MTV Europe Most Wanted Top Ten of 20 years (1981 to 2000), covering a wide variety of popular music genres. The ground truth was obtained by five annotators (Rater A to Rater E, four males and one female), who were asked to make a forced binary decision according to the two dimensions in Thayer’s mood plane [55] (i.e., assigning either +1 or −1 for arousal and valence, respectively) according their mood perception.

The CAL500 (http://cosmal.ucsd.edu/cal/) is a corpus of 500 recordings of Western popular music, each of which has been manually annotated by at least three human annotators, who employ a vocabulary of 174 tags. The tags used in CAL500 dataset annotation span six semantic categories, namely instrumentation, vocal characteristics, genres, emotions, acoustic quality of the song, and usage terms (e.g., ‘I would like to listen this song while driving’) [1].

Each music recording in the aforementioned datasets was represented by three song-level feature vectors, namely the 20-dimensional MFCCs, the 12-dimensional chroma features, and the auditory cortical representations of reduced dimensions. The dimensionality of the cortical features was reduced via random projections as described in Section 4. In particular, the dimensions of the cortical features after random projections are 1,570 for the GTZAN, 1,391 for the ISMIR, 2,261 for the Homburg, 2,842 for the 1517-Artists, 2,868 for the Unique, 518 for the MTV, and 935 for the CAL500 dataset, respectively.

Two sets of experiments in music classification were conducted. First, to be able to compare the performance of the LRSMs with that of the state-of-the-art music classification methods, standard evaluation protocols were applied to the seven datasets. In particular, following [16, 17, 20, 22, 56, 57], stratified 10-fold cross-validation was applied to the GTZAN dataset. According to [15, 16, 54], the same protocol was also applied to the Homburg, Unique, 1517-Artists, and MTV datasets. The experiments on the ISMIR 2004 Genre dataset were conducted according to the ISMIR 2004 Audio Description Contest protocol. The protocol defines training and evaluation sets, which consist of 729 audio files each. The experiments on music tagging were conducted following the experimental procedure defined in [26]. That is, 78 tags, which have been employed to annotate at least 50 music recordings in the CAL500 dataset, were used in the experiments by applying fivefold cross-validation.

Fu et al. [3] indicated that the main challenge for future music information retrieval systems is to be able to train the music classification systems for large-scale datasets from few labeled data. This situation is very common in practice since the number of annotated music recordings per class is often limited [3]. To this end, the performance of the LRSMs in music classification given a few training music recordings is investigated in the second set of experiments. In this small-sample size setting, only 10% of the available recordings were used as the training set and the remaining 90% for the test in all, but the CAL500, datasets. The experiments were repeated 10 times. In music tagging, 20% of the recordings in the CAL500 were used as the training set and the remaining 80% for the test. This experiment was repeated five times.

The LRSMs are compared against three well-known classifiers, namely the SRC [38], the linear SVMs a, and the NN classifier with a cosine distance metric in music genre and mood classification, by applying the aforementioned experimental procedures. In music tagging, the LRSMs are compared against the multi-label variants of the aforementioned single-label classifiers, namely the MLSRC [39], the Rank-SVMs [40], the MLkNN [41], as well as the well-performing PARAFAC2-based autotagging method [42]. The number of neighbors used in the MLkNN was set to 15. The sparse coefficients in the SRC and MLSRC are estimated by the LASSO b[58].

The performance in music genre and mood classification is assessed by reporting the classification accuracy. Three metrics, namely the mean per-tag precision, the mean per-tag recall, and the F1 score, are used in order to assess the performance of autotagging. These metrics are defined as follows [1]: Per-tag precision is defined as the fraction of music recordings annotated by any method with label w that are actually labeled with tag w. Per-tag recall is defined as the fraction of music recordings actually labeled with tag w that the method annotates with label w. The F1 score is the harmonic mean of precision and recall. That is, F 1 =2· precision · recall precision + recall yields a scalar measure of overall annotation performance. If a tag is never selected for annotation, then following [1, 26], the corresponding precision (that otherwise would be undefined) is set to the tag prior to the training set, which equals the performance of a random classifier. In the music tagging experiments, the length of the class indicator vector returned by the LRSMs as well as the MLSRC, the Rank-SVMs, the MLkNN, and the PARAFAC2-based autotagging method was set to 10 as in [1, 26]. That is, each test music recording is annotated with 10 tags. The parameters in the LRSMs have been estimated by employing the method in [59]. That is, for each training set, a validation set (disjoint from the test set) was randomly selected and used next for tuning the parameters (i.e., λ r , r=1,2,…,R).

5.2 Experimental results

In Tables 1, 2, and 3, the experimental results in music genre, mood, and multi-label classification are summarized, respectively. These results have been obtained by applying the standard protocol defined for each dataset. In Tables 4 and 5, music classification results are reported, when a small training set is employed. Each classifier is applied to the auditory cortical representations (cortical features) of reduced dimensions, the 20-dimensional MFCCs, the 12-dimensional chroma features, the linear combination of cortical features and MFCCs (fusion cm, i.e., R=2), and the linear combination of all the aforementioned features (fusion cmc, i.e., R=3). Apart from the proposed LRSMs, the other competitive classifiers handle the fusion of multiple audio features in an ad hoc manner. That is, an augmented feature vector is constructed by stacking the cortical features on the top of the 20-dimensional MFCCs and the 12-dimensional chroma features. In the last rows of Tables 1, 2, and 3, the figures of merit for the top performing music classification methods are included for comparison purposes.

Table 1 Music genre classification accuracies for the GTZAN, ISMIR, Homburg, 1517-Artists, and Unique datasets
Table 2 Music mood classification accuracies for the MTV dataset
Table 3 Music tagging performance on the CAL500 dataset by applying fivefold cross-validation
Table 4 Music classification results on various datasets obtained by employing a few labeled music recordings
Table 5 Music classification results on the CAL500 dataset obtained by employing a few labeled music recordings

By inspecting Table 1, the best music genre classification accuracy has been obtained by the LRSMs in four out five datasets, when all the features have been exploited for music representation. Comparable performance has been achieved by the combination of cortical features and the MFCCs. This is not the case for the Unique dataset, where the SVMs achieve the best classification accuracy when employing the fusion of the cortical features, the MFCCs, and the chroma features. Furthermore, the LRSMs outperform all the classifiers being compared to when they are applied to cortical features. The MFCCs are classified more accurately by the SRC or the SVMs than the LRSMs. This is because the MFCCs and the chroma features have a low dimensionality and the LRSMs are not able to extract the appropriate common latent features the genre classes are built on. The best classification accuracy obtained by the LRSMs on all datasets ranks high compared to that obtained by the majority of music genre classification techniques, as listed in last rows of Table 1. In particular, for the Homburg, 1517-Artists, and Unique datasets, the best accuracy achieved by the LRSMs outperforms that obtained by the state-of-the-art music classification methods. Regarding to the GTZAN and ISMIR datasets, it is worth mentioning that the results reported in [20] have been obtained by applying feature aggregation on the combination of four elaborated audio features.

Schuller et al. argued that the two dimensions in Thayer’s mood model, namely the arousal and the valence, are independent of each other [15]. Therefore, mood classification can be reasonably done independently in each dimension, as presented in Table 2. That is, each classifier makes binary decisions between excitation and calmness on the arousal scale as well as negativity and positivity in the valence dimension, respectively. Both overall and per-rater music mood classification accuracies are reported. The overall accuracies are the mean accuracies over all raters for all songs in the dataset. The LRSMs outperform the classifiers that are compared to when the cortical features and their fusion with the MFCCs and the chroma features are employed for music representation, yielding higher classification accuracies than those reported in the row entry NONLYR in Tables twelve and thirteen [15] when only audio features are employed. It is seen that the inclusion of the chroma features does not alter the measured figures of merit. Accordingly, the chroma features could be omitted without any performance deterioration. It is worth mentioning that substantial improvements in the classification accuracy are reported when audio features are combined with lyric features [15]. The overall accuracy achieved by the LRSMs in valence and arousal is considered satisfactory, considering the inherent ambiguity in the mood assignments and the realistic nature of the MTV dataset.

The results reported in Table 3 indicate that in music tagging, the LRSMs outperform the MLSRC, the MLkNN, and the PARAFAC2 with respect to per-tag precision, per-tag recall, and F1 score for all the music representations employed. The Rank-SVMs yield the best tagging performance with respect to the F1 score and the recall. The cortical features seem to be more appropriate for music annotation than the MFCCs, no matter which annotation method is employed. Although the LRSMs achieve top performance against the state-of-the-art methods with respect to per-tag precision, the reported recall is much smaller compared to that published for the majority of music tagging methods (last five rows in Table 3). This result is due to the song-level features employed here, which fail to capture the temporal information with some tags (e.g., instrumentation). In contrast, the well-performing autotagging method with respect to recall, which is reported in Table 3, employs sequences of audio features for music representation.

In Tables 4 and 5, music classification results, by applying a small-sample size setting, are summarized. These results have been obtained by employing either the fusion of the cortical features, the MFCCs, and the chroma features or the fusion of the former two audio representations. Clearly, the LRSMs outperform all the classifiers they are compared to in most music classification tasks. The only exceptions are the prediction of valence on the MTV dataset, where the best classification accuracy is achieved by the SRC, and the music genre classification accuracy on the Unique dataset, where the top performance is achieved by the SVMs. Given the relatively small number of training music recordings, the results in Tables 4 and 5 are quite acceptable, indicating that the LRSMs are an appealing method for music classification in real-world conditions.

6 Conclusions

The LRSMs have been proposed as a general-purpose music classification method. Given a number of music representations, the LRSMs are able to extract the appropriate features for each specific music classification task, yielding higher performance than the methods they are compared to. Furthermore, the best classification results obtained by the LRSMs either meet or slightly outperform those obtained by the state-of-the-art methods for music genre, mood, and multi-label music classification. The superiority of the auditory cortical representations has been demonstrated over the conventional MFCCs and chroma features in the three music classification tasks studied as well. Finally, the LRSMs yield high music classification performance when a small number of training recordings is employed. This result highlights the potential of the proposed method for practical music information retrieval systems.

Endnotes

a The LIBSVM was used in the experiments (http://www.csie.ntu.edu.tw/~cjlin/libsvm/).b The SPGL1 Matlab solver was used in the implementation of the SRC and the MLSRC (http://www.cs.ubc.ca/~mpf/spgl1/).