1 Introduction

Automatic classification of musical instruments in audio recordings is an example of research on music information retrieval. Huge repositories of audio data available for users are challenging from the point of view of content-based retrieval. The users can be interested in finding melodies sung to the microphone (queries by humming), identify the title and the performer of a piece of music submitted as an audio input containing a short excerpt from the piece (query by example), or finding pieces played by their favorite instruments. Browsing audio files by the users is a tedious task. Therefore, any automation comes very handy. If the audio data are labeled, searching is easy, but usually the text information added to the audio file is limited to the title, performer etc. In order to perform automatic content annotation, sound analysis is usually performed, sound features extracted, and then the contents can be classified into various categories, in order to fulfill the user’s query and find the contents specified.

The research presented in this paper is an extended version of an article presented at the ISMIS’09 conference (Wieczorkowska and Kubik-Komar 2009a), addressing the problem of instrument identification in sound mixes. The identification of instruments playing together can aid automatic music transcription with assigning recognized pitches to instrument voices (Klapuri 2004). Also, finding pieces of music with excerpts played by a specified instrument can be desirable for many users of audio repositories. Therefore, investigating the problem of automated identification of instruments in audio recordings is vital for music information retrieval tasks.

In our earlier research (Wieczorkowska et al. 2008), we performed automatic recognition of predominant instrument in sound mixes using SVM (Support Vector Machines). The feature vector applied was used before in research on automatic classification of instruments (NSF 2010; Zhang 2007), and it contains sound attributes commonly used for timbre identification purposes. Most of the attributes describe low-level sound properties, based on MPEG-7 audio descriptors (ISO/IEC JTC1/SC29/WG11 2004), and since many of them are multi-dimensional, derivative features were used instead (minimal/maximal value etc.). Still, the feature vector is quite long, and it contains groups of attributes that can constitute a descriptive feature set themselves. In this research, we decided to compare descriptive power of these groups. In-depth statistical analysis of the investigated sets of features for the selected instruments is presented in this paper.

Our paper is organized as follows. In Section 2 we briefly familiarize the reader with task and problems related to the automatic identification of musical instrument sounds in audio recordings. The feature groups used in our research are also presented here. In the next section, we describe settings and methodology of our research, as well as the audio data used to produce the feature vectors. In Section 4 we describe in depth the results of the performed analyzes. The last section concludes our paper.

2 Automatic identification of musical instruments based on audio descriptors

Since audio data basically represent sequences of samples encoding the shape of the sound wave, these data usually are processed in order to extract feature vectors, and then automatic classification of audio data is performed. Sound features used for musical instrument identification purposes include time domain descriptors of sound, spectral descriptors, time-frequency descriptors and can be based on Fourier or wavelet analysis etc. Feature sets applied in research on instrument recognition include MFCC (Mel-Frequency Cepstral Coefficients), Multidimensional Analysis Scaling trajectories of various sound features, statistical properties of spectrum and so on; more details can be found in Herrera et al. (2000). Many sound descriptors were incorporated into MPEG-7 standard for multimedia (including audio) content description (ISO/IEC JTC1/SC29/WG11 2004), as they are commonly used in audio research.

Various classifiers can be applied to the recognition of musical instruments. Research performed on isolated monophonic (monotimbral) sounds so far showed successful application of k-nearest neighbors, artificial neural networks, rough-set based classifiers (Wieczorkowska and Czyzewski 2003), SVM, and so on (Herrera et al. 2000). Research was also performed on polyphonic (polytimbral) data, when more than one instrument sound is present in the same time. In this case, researchers may also try to separate these sounds from the audio source. Outcome of research on polytimbral instrumental data can be found in Dziubinski et al. (2005), Itoyama et al. (2008), Little and Pardo (2008), Viste and Evangelista (2003), Wieczorkowska et al. (2008), Zhang (2007). The results of research in this area are rather difficult for comparison, since various scientists utilize different data sets: of different number of classes (instruments and/or articulation), different number of objects/sounds in each class, and basically different feature sets. The recognition of instruments for isolated sounds can reach 100% for a small number of classes, more than 90% if the instrument or articulation family is identified, or about 70% or less for recognition of an instrument when there are more classes to recognize. The identification of instruments in polytimbral mixes is lower than that, even below 50% for same-pitch sounds. More details can be found in the previous paper, focusing on this research (Wieczorkowska and Kubera 2009).

Recognition for monotimbral data is relatively easy, in particular for isolated sounds, and more challenging for polytimbral data. The research discussed in this paper aims at the identification of predominant instrument in mixes of sounds of the same pitch, as this is the most difficult case (harmonic partials in spectra overlap). Exemplary mix of two sound waves of the same pitch is shown in Fig. 1. As we can see, the flute sound is much more difficult to recognize after adding another sound with overlapping spectrum.

Fig. 1
figure 1

Sounds of the same pitch and their mixes. On the left hand side, time domain representation of sound waves is shown; on the right hand side, spectrum of these sounds is plotted. Triangular wave and flute sound are shown, both of frequency 440 Hz (i.e., A4 in MIDI notation). After mixing, spectral components (harmonic partials) overlap. The diagrams were prepared using Adobe Audition (Adobe 2003)

2.1 Feature groups

In the previous research, we have been investigating automatic identification of predominant instrument in same-pitch mixes (Wieczorkowska et al. 2008; Wieczorkowska and Kubik-Komar 2009b). The feature vector used in this research consisted of 219 features, based on MPEG-7 audio descriptors and other parameters used in automatic sound classification (ISO/IEC JTC1/SC29/WG11 2004; NSF 2010; Zhang 2007). Although these features were used before in various configurations in similar research, this feature vector was arbitrary chosen. Therefore, we decided to check if it could be limited. Actually, the feature vector contains (among others) a few groups of descriptors that alone can be applied to sound recognition (Wieczorkowska and Kubera 2009):

  • Audio Spectrum Basis: basis1, ..., basis165—parameters of the spectrum basis functions, used to reduce the dimensionality by projecting the spectrum (for each frame) from high dimensional space to low dimensional space with compact salient statistical information. The spectral basis descriptor is a series of basis functions derived from the Singular Value Decomposition (SVD) of a normalized power spectrum. The total number of sub-spaces in basis functions in our case was 33, and for each sub-space, minimum/maximum/mean/distance/standard deviation were extracted, yielding 33 subgroups, five elements in each group. The obtained values were averaged over all analyzed frames of the sound;

  • MFCC—minimum, maximum, mean, distance, and standard deviation of the MFCC vector, averaged through the entire sound. In order to extract MFCC, Fourier transform is calculated for the analyzed sound frames, then logarithm of the amplitude spectrum is taken. Next, spectral coefficients are grouped into 40 groups according to mel scale (perceptually uniform frequency scale). For the obtained 40 coefficients, Discrete Cosine Transform is applied, yielding 13 cepstral features per frame (Logan 2000). Distance of the vector is calculated as the sum of dissimilarity (absolute difference of values) of every pair of coordinates in the vector;

  • energy—average energy of spectrum in the parameterized sound;

  • Tris: tris1, ..., tris9—various ratios of harmonics (i.e. harmonic partials) in the spectrum; tris1: ratio of the energy of the fundamental to the total energy of all harmonics, tris2: amplitude difference [dB] between 1st and 2nd partial, tris3: ratio of the sum of 3rd and 4th partial to the total energy of harmonics, tris4: ratio of partials no. 5–7 to all harmonics, tris5: ratio of partials no. 8–10 to all harmonics, tris6: ratio of the remaining harmonic partials to all harmonics, tris7: brightness - gravity center of spectrum, tris8, tris9: contents of even and odd (without fundamental) harmonics in spectrum, respectively;

  • Audio Spectrum Flatness, flat1, ..., flat25—vector describing the flatness property of the power spectrum within a frequency bin for selected bins, averaged for the entire sound; flatness values describe 25 frequency bands. According to MPEG-7 recommendations, audible range is divided into eight octaves covering 62.5 Hz–16 kHz with 1/4-octave resolution, and additional two bands covering the lowest (below 62.5 Hz) and the highest (above 16 kHz) frequencies; our 25 bands cover 24 bands with 1/4-octave resolution, starting from approximately octave no. 4 (in MIDI notation), as usually implemented in MPEG-7 feature vectors, plus additional band covering the highest frequency range.

When investigating the significance for the set of 219 sound parameters (including the ones mentioned above), used in our previous research, the attributes representing the above groups were often pointed out as significant, i.e. of high discriminant power (Wieczorkowska and Kubik-Komar 2009b). Therefore, it seems promising to perform investigations for the groups mentioned above.

Since Audio Spectrum Basis group presents a high-dimensional vector itself, and the first subgroup basis 1, ..., basis 5 turned out to have high discriminant power, we decided to limit the Audio Spectrum Basis group to basis 1, ..., basis 5. In Audio Spectrum Flatness group, flat 10, ..., ,flat 25 had high discriminant power, whereas flat 1, ..., flat 9 had not, as we have observed in our previous research (Wieczorkowska and Kubik-Komar 2009b). Also, we decided to investigate energy as a single conditional attribute, and the Tris group, as well as MFCC group. One could discuss whether such parameters as minimum or maximum of MFCC (MFCC min, MFCC max) are meaningful, but since these parameters yielded high discriminant power, we decided to investigate them.

Altogether, the following groups (feature sets) were investigated in this paper:

  • Audio Spectrum Basis group: basis1, ..., basis5;

  • MFCC group: MFCCmin, MFCCmax, MFCCmean, MFCCdist, and MFCCsd parameters;

  • Energy: single parameter, energy;

  • Tris group: tris1, ..., tris9 parameters,

  • AudioSpectrumFlatness group: flat10, ..., flat25.

3 Research settings

In order to check if the particular groups of sound features can discriminate instruments, we performed multivariate analysis of variance (MANOVA). Next, we analyzed the parameters from this group using univariate method (ANOVA). In case of rejecting the null hypothesis about the equality of means between instruments for a given feature group, we used post hoc comparisons to find out how we can discriminate particular instruments on the basis of parameters included in this feature group, i.e. which sound attributes from this group are best suited to recognize a given instrument (discriminate it from other instruments).

3.1 Audio data

Our data represented sounds of 14 instruments from MUMS CDs (Opolko and Wapnick 1987): B-flat clarinet, flute, oboe, English horn, trumpet, French horn, tenor trombone, violin (bowed vibrato), viola (bowed vibrato), cello (bowed vibrato), piano, marimba, vibraphone, and tubular bells. Twelve sounds, representing octave no. 4 (in MIDI notation) were used for each instrument, as a target sound to be identified in classification. Additional sounds were mixed with the main sounds, both for training and testing of the classifiers in further experiments with automatic classification of musical instruments (Kursa et al. 2009). The level of added sounds was adjusted to 6.25%, \(12.5/\sqrt{2}\)%, 12.5%, \(25/\sqrt{2}\)%, 25%, \(50/\sqrt{2}\)%, and 50% of the level of the main sound, since our goal was to identify the predominant instrument. For each main instrumental sound, additional four mixes with artificial sounds were prepared for each level: with white noise, pink noise, with triangular wave and with saw-tooth wave (i.e., both of harmonic spectrum) of the same pitch as the main sound. This set was prepared to be used as a training set for classifiers, subsequently tested on musical instrument sound mixes, using same pitch sounds. Each sound to be identified was mixed with 13 sounds of the same pitch representing the remaining 13 instruments from this data set. Again, the sounds added in mixes were adjusted in level, at the same levels as in training. Results of these experiments can be found in Kursa et al. (2009).

In this research, we investigated data representing musical instrument sounds, as well as mixes with artificial sounds. All these data were parameterized using feature sets as described in Section 3.

3.2 Materials and methods

In the described experiments, our target was to recognize instrument as a class. For each instruments, sound samples represented various pitch (12 notes from one octave) and various levels of added sounds. Altogether, each instrument was represented by 420 samples. Since our goal was to identify instrument, we omitted distinguishing for particular levels or pitch values.

MANOVA was used in our research to verify the hypothesis about the lack of differences (between the instruments) for vectors of mean values of the selected features. The test statistic based on Wilks Λ (Morrison 1990) was applied, which can be transformed to a statistical test having approximately an F distribution. A transformation for Wilks’ lambda was given by Rao (Finn 1974; Rao 1951):

$$ F=\frac{1-\Lambda^{1/s}}{\Lambda^{1/s}}\cdot \frac{ms+1-d_{h}p/2}{d_{h}p} $$

where

$$ s=\sqrt{\frac{p^{2}d_{h}^{2}-4}{p^{2}+d_{h}^{2}-5}} $$
$$ m=d_{e}-(p+1-d_{h})/2 $$
  • p  number of variables,

  • d h   number of degrees of freedom for the hypothesis,

  • d e   number of degrees of freedom for the error.

Using Rao’s transformation, the hypothesis H 0 is rejected with confidence 1 − α if F exceeds the 100α upper percentage point of the F distribution with d h p and ms + 1 − d h p/2 degrees of freedom (Finn 1974). In our case, p = 5 + 5 + 1 + 9 + 16 = 36 parameters, d h  = k − 1, where k—number of groups (instruments), therefore k = 14, and d e  = N − k, N–total sample size, so N = 14·420 = 5,880.

This form of MANOVA results make it easier to obtain p-value and is definitely preferred (Bartlett et al. 2000). In case of rejecting this hypothesis, i.e. in case of finding out that there existed significant differences of means between instruments, we applied the post hoc comparisons between average values of the studied features, based on HSD Tukey test (HSD—Honestly Significant Difference) (Tukey 1993; Winer et al. 1991), preceded by univariate analysis of variance (ANOVA).

As assumptions of the analysis of variance, normality of distribution of the studied features is required, as well as homogeneity of variance between groups (instruments in our case). However, if the number of observations per group is fairly large, then deviations from normality do not really matter (StatSoft, Inc. 2001). This is because of the central limit theorem, according to which the sampling distribution of the mean approximates the normal distribution, irrespective of the distribution of the variable in the population. More detailed discussion of the robustness of the F statistic can be found in Box and Andersen (1955), or Lindman (1974). In our research, all features we used represented means, calculated for a sequence of frames for a given sound, thus improving normality of the distribution. As far as homogeneity of variance between groups is concerned, Lindman (1974, p. 33) shows that the F statistic is quite robust against violations of the homogeneity assumption, i.e. in case of heterogeneity of variances (Box 1954a, b; StatSoft, Inc. 2001). We might use two powerful and commonly applied tests of homogeneity of variance, namely the Levene test, and the Brown–Forsythe modification of this test. However, as mentioned above, we realize that the assumption of the homogeneity of variances is not crucial for the analysis of variance, in particular in the case of balanced (equal numbers of observations) designs. Moreover, these tests are not necessarily very robust themselves. For instance, in Glass and Hopkins (1995) the authors state that these tests are fatally flawed (1995, p. 436). Taking into consideration the above explanations, and the large number of observations in our research, each one representing a mean of frame-based features (as a single observation), as well as equality of groups, we did not pay too much attention regarding the assumptions of the analysis of variance.

The results of ANOVA give us the information if there are significant differences of instruments’ means separately for each of studied features in a given group; these calculations are performed before the post-hoc comparisons are made. Whilst changing the multivariate into univariate analysis, we sacrifice the information about the relationships within the studied features, but, on the other hand, we obtain very useful information about discriminative power of each parameter.

The post hoc comparisons are presented in the form of homogeneous groups defined by mean values of a given feature, and consisting of the instruments which are not significantly different with respect to this feature.

Therefore, mean values of each feature defined homogenous groups of instruments. If differences between means for some instruments were not statistically significant, they constituted a group. The less instruments (sometimes even only one) in such a homogenous group, the higher the discerning power of a given feature.

All statistical calculations presented in this paper were obtained using STATISTICA software (StatSoft, Inc. 2001).

4 Results of analyzes of feature groups

In this section, we present results of analyzes of feature groups, as defined in Section 3. Each group represents an approximately uniform set, meaningful from the point of view of identification of particular sound timbers, thus aiding recognition of particular musical instruments.

4.1 Analysis of the Audio Spectrum Basis group

The results of MANOVA show that the vector of mean values of the analyzed \(Audio \- Spectrum Basis\) features significantly differed between the instruments (F = 188.0, p < 0.01). On the basis of the univariate results (ANOVA) we conclude that the means for instruments differed significantly for each parameter separately. The statistics F having a Fisher distribution with parameters 13 and 5,866 (i.e. F(13, 5,866)), was equal to 113.8, 115.0, 12.9, 371.78, and 405.7 for basis 1, ..., , basis 5, respectively, and each of these values produced p-value less than 0.01. These results allowed us to apply post-hoc comparisons for each of the Audio Spectrum Basis parameters, presented in form of tables consisting of homogenous groups of instruments (Fig. 2).

Fig. 2
figure 2

Homogenous groups of instruments for Audio Spectrum Basis parameters. The columns labeled as 1, ..., 9 represent homogenous groups with respect to a given parameter (feature)

The results of post hoc analysis revealed that basis 4, basis 5 and basis 1 distinguish instruments to a large extent. The influence of basis 2 and basis 3 on differentiation between instruments is rather small. Marimba, piano, as well as vibraphone, and the pair of tubular bells and French horn, often determine separate groups. Piano, vibraphone, marimba, cello and trombone are very well separated by basis 5, since each of these instruments constitute a 1-element group. Piano, vibraphone, marimba, and cello are separated by basis 4, too. Also, basis 1 separates marimba and piano. The basis 3 parameter only discerns marimba from other instruments (only two groups are produced); basis 2 does not separate any single instrument.

Whilst looking at the means of basis parameters (Fig. 3) we can indicate the parameters producing similar plots of means—these are basis 5, basis 4, and, to a lesser degree, basis 2. Similar values in plots of means indicate that the parameters producing these plots represent similar discriminative properties for these data. When mean values of some parameters are similar for several classes, i.e. instruments (which means that they are almost aligned in the plot, and distances between these values are short), then these instruments may be collocated in the same group, i.e. homogenous group with respect to these parameters. When a mean value for a particular instrument is distant from the other mean values, then this instrument is very easy to distinguish from the others with respect to this feature. For example, marimba is well distinguished from the other instruments with respect to basis 3 (the mean value for marimba is distant from the others), whereas mean values of basis 4 and basis 5 for English horn and trumpet are similar, and these instrument are situated in the same group, so these instruments are difficult to be differentiated basing on these features.

Fig. 3
figure 3

Means for the Audio Spectrum Basis group

We can also see that despite the lowest number of groups produced by basis 3 the difference of means between marimba and other instruments is very high, so the influence of this parameter for such distinction (between marimba and others) is quite important.

The AudioSpectrumBasis group represents features extracted through SVD, so good differentiation of instruments was expected because SVD should yield the most salient features of the spectrum. On the other hand, these features might not be sufficient for discrimination of particular instruments. It is a satisfactory result that we can differentiate several instruments on the basis of three features (basis 1, basis 4, and basis 5). Especially, distinguishing between marimba, vibraphone and piano is a very good result, since these instruments pose difficulties in automatic instrument classification task. Their sounds (particularly marimba and vibraphone) are similar and have no sustained part, thus have no steady state, so calculation of spectrum is more difficult—but, as we can see, useful for instrument discrimination purposes.

4.2 Analysis of the MFCC group

The results of MANOVA indicate that the mean values of MFCC features differ significantly between the studied instruments (F = 262.84, p < 0.01).

The univariate results show that means of each parameter (Fig. 4) from this group were significantly different at the significance level p equal to 0.01 and F(13, 5,866) values equal to 218.67, 329.92, 479.27, 698.8, and 550.9 for MFCC min, MFCC max, MFCC mean, MFCC dis and MFCC std respectively.

Fig. 4
figure 4

Means for the MFCC group

The analysis of homogeneous groups see (Fig. 5) shows that MFCC sd and MFCC max yielded the highest difference of means, while MFCC min—the lowest one. Each feature defined six to nine groups, homogenous with respect to the mean value of a given feature.

Fig. 5
figure 5

Homogenous groups of instruments for MFCC-based parameters

The piano determined the separate group for every parameter from our MFCC feature set. The conclusion is that this instrument is very well distinguished by MFCC. However, there were no parameters here that would be capable to distinguish between marimba and flute. These two instruments were always situated in the same group, since the average values of studied parameters for these instruments do not differ too much. Vibraphone and bells were in different groups only for MFCC mean.

Piano, cello, viola, violin, bells, English horn, oboe, French horn, and trombone constitute separate groups, so these instruments can be easily recognized on the basis of MFCC. On the other hand, some groups overlap, i.e. the same instrument may belong to two groups.

The shape of plots of mean values of MFCC std, MFCC dis and, to a lower extend, MFCC max (Fig. 4) is very similar, however the homogeneous groups, apart from piano, are different. As we mentioned before, piano is very well distinguished on the basis of all parameters from the MFCC group, since it always constitutes a separate, 1-element group. This is because the means for piano and other instruments in most cases are extremely distant.

MFCC parameters described here represent general properties of the MFCC vector. We consider it a satisfactory result that such a compact representation turned out to be sufficient to discern between many instruments, mainly stringed instruments of sustained sounds, i.e. cello, viola, and violin, and wind instruments, both woodwinds (of very similar timbre, i.e. oboe and English horn, which can be considered as a type of oboe) and brass (French horn and trombone). Even non-sustained sounds can be distinguished, i.e. tubular bells and piano, separated by every feature from our MFCC group.

4.3 Analysis of energy

In our previous paper (Wieczorkowska and Kubik-Komar 2009a) we concluded that energy yielded different results then Tris parameters. So in this paper we decided to analyze this parameter in a separate group.

The results of analysis of variance indicated the significant differences between mean values of studied instruments (F = 1036.74, p < 0.01). In (Fig. 6) we can notice the extremely low value for piano and the highest one for violin, so we can expect that these two instruments are well distinguished by this parameter.

Fig. 6
figure 6

Means for Energy

Our presumption is confirmed by post-hoc results (Fig. 7). This parameter formed 8 homogeneous groups and 4 of them were determined by separate instruments such as piano, violin, vibraphone and trumpet.

Fig. 7
figure 7

Homogenous groups of instruments for energy

Energy turned out to be quite discriminative as a single parameter. We are aware that if more input data are added (more recordings), our outcomes may need re-adjustment; still, discriminating four instruments here (piano, violin, vibraphone and trumpet) on the basis of one feature confirms high discriminative power of this attribute.

4.4 Analysis of the tris group

The results of MANOVA show that mean values of tris parameters were significantly different for the studied set of instruments (F = 210.9, p < 0.01)

The univariate results of F(13,5,866) test are as follows: tris 1 − 352.08 tris 2 − 40.35, tris 3 − 402.114, tris 4 − 280.86, tris 5 − 84.14, tris 6 − 19.431, tris 7 − 12.645, tris 8 − 436.89, tris 9 − 543.39 and all these values indicated significant differences between means of studied instruments at the significance level equal to 0.01. For the tris feature set consisting of tris 1, ..., tris 9 parameters, each parameter defined from three to nine groups, as presented in Fig. 8.

Fig. 8
figure 8

Homogenous groups of instruments for the tris parameters

As we can see, tris 3, tris 8, and tris 9 produced the highest number of homogeneous groups. Some wind instruments (trumpet, trombone, English horn, oboe), or their pairs, were distinguished most easily—determined separate groups for the features forming eight to nine homogeneous groups.

Taking into considerations the plots of mean values (Fig. 9) we can add some more information. Namely, the tris 3 parameter, in spite of constituting the lowest number of homogeneous groups, distinguishes piano very well. In most cases, the means of vibraphone and marimba, sometimes also piano, are similar, and when they are high, then at the same time the means for oboe and trumpet are low, and vice versa.

Fig. 9
figure 9

Means for the tris group

In case of the Tris group, we were expecting good results, since these features were especially designed for the purpose of musical instrument identification. For instance, clarinet shows low contents of even harmonic partials in its spectrum for lower sounds (tris 8). However, as we can see, other instruments—piano, vibraphone—also show low value of contents of even partials, and marimba even lower than these instruments. However, clarinet shows very high tris 9, i.e. amount of odd harmonic partials (excluding the fundamental, marked as no. 1) in the spectrum, which corresponds somehow to small amount of even partials—and this feature discriminates clarinet very well. The results for the Tris parameters, presented in Fig. 8, show that this set of features is quite well designed and can be applied as a helpful tool for musical instrument identification purposes.

4.5 Analysis of the Audio Spectrum Flatness group

AudioSpectrumFlatness feature set consisted of flat10, ..., flat25 parameters. The vector of means for these parameters, similarly to other ones, significantly differed between the studied instruments (F = 94.00, p < 0.01).

All p-values for univariate variance test results were less than 0.01 with the following values of F(13, 5,866): flat 10 − 437.101, flat 11 − 386.281, flat 12 − 351.645, flat − 13 − 273.743, flat 14 − 255.16, flat 15 − 223.367, flat 16 − 194.828, flat 17 − 138.04, flat 18 − 106.26, flat 19 − 90.84, flat 20 − 62.56, flat 21 − 54.47, flat 22 − 62.54, flat 23 − 60.28, flat 24 − 70.61, flat 25 − 58.4. The plots of means for these parameters are presented in Fig. 10. As we can see, the higher number of the element of this feature vector (the higher the frequency), the higher mean values of flatness parameters are obtained.

Fig. 10
figure 10figure 10

Means for the Audio Spectrum Flatness group

For the first four plots we can notice that most of values are at the similar level except marimba and vibraphone, which means are high compare to the others instruments. Than values for other instruments are getting higher and higher except clarinet, viola, oboe, English horn, cello or trumpet, which means are changing to a lesser degree. We can observe these changes also as the results of post hoc comparisons. They show the high discriminating power of flat 10, ..., flat 14, distinguishing marimba, vibraphone, French horn (these instruments constitute separate, 1-element groups), and, to a lesser degree, piano (Fig. 11). For these features (flat 10, ..., flat 14), some 1-element groups are produced, and for the next features from the Audio Spectrum Flatness set, the size of the homogenous groups is growing.

Fig. 11
figure 11figure 11

Homogenous groups of instruments for Audio Spectrum Flatness parameters

To be more precise, for increasing i in flat i , the group consisting of marimba, vibraphone, and French horn was growing—other instruments were added. At the same time, homogeneous group determined by oboe, clarinet, trumpet, violin, and English horn was differentiating into separate groups.

AudioSpectrumFlatness is the biggest feature set analyzed here. High discriminative power of spectral flatness is confirmed by the results shown in Fig. 11, since in many cases 1-element groups are created wrt. particular elements of the flatness feature vector. This illustrates high descriptive power of the shape of the spectrum, represented here by the spectrum flatness.

5 Summary and conclusions

In this paper, we compared feature sets used for musical instrument sound classification. Mean values for data representing given instruments and statistical tests for these data were presented and discussed. Also, for each feature, homogeneous groups were found, representing instruments which are similar with respect to this feature. Instruments, for which the mean values of a given feature were significantly different, were assigned to different groups, and instruments, for which the mean values were not statistically different, were assigned to the same group.

Sound features were grouped according to the parameterization method, including MFCC, proportions of harmonics in the sound spectrum, and MPEG-7 based parameters (Audio Spectrum Flatness, Audio Spectrum Basis). These groups were chosen as a conclusion of our previous research, indicating high discriminant power of particular features for instrument discrimination purposes.

Piano, vibraphone, marimba, cello, English horn, French horn, and trombone turned out to be the most discernible instruments. It is very encouraging, because marimba and vibraphone represent idiophones (a part of percussion group), so sound is played by striking and is not sustained (similarly for piano), so there is no steady state, thus making parameterization more challenging. Also, since the investigations were performed for small groups of features (up to 16), we conclude that these groups constitute a good basis for instrument discernment.

The results enabled us to indicate, for each instrument, which parameters within a given group represent the highest distinguishing power, and indicate which features are most suitable to distinguish this instrument. Following the earlier research based on sound features described here and SVM classifiers (Wieczorkowska et al. 2008), experiments on automatic musical instrument identification were also performed using random forests as classifiers (Kursa et al. 2009). The obtained results confirmed significance of particular features, and yielded very good accuracy.