Introduction

Blind audio source separation (BASS) has been receiving increasing attention in recent years. The BASS techniques try to recover source signals from a mixture, when the mixing process is unknown. Blind means that very little information is needed to carry out the separation, although it is in fact absolutely necessary to make assumptions about the statistical nature of the sources or the mixing process itself.

In the most general case, see Figure 1, separation will deal with N sources and M mixtures (microphones). The number of mixtures defines each particular case, and for each situation the literature provides several methods of separation. Probably due to the absence of interesting problems in the over-determined case, which has properly been solved, the most extensively studied case is undetermined separation, where N > M(N > M does not always imply poorer results). For example, in stereo separation (through the DUET algorithm [1] and other time-frequency masking evolutions [24]), the delay and attenuation between the left and right channel information can be used to discriminate the sources present and some kind of scene situation [5].

Figure 1
figure 1

The general BASS task, with N mixed sources and M sensors.

In other applications, when a monaural solution is needed (i.e., when M = 1), the mathematical indetermination of the mixture significantly increases the difficulties of the task. Hence, monaural separation is probably the most difficult challenge for BASS, but even in this case, the human auditory system itself can somehow segregate the acoustic signal into separate streams [6]. Several techniques for solving the BASS problem in general (and the monaural separation in particular) have been developed.

Psychoacoustic studies, such as computational auditory scene analysis [7, 8], inspired by auditory scene analysis [6], attempts to explain the mentioned capability of the human auditory system in selective attention. Psychoacoustic also suggests that temporal and spectral coherence between sources can be used to discriminate between them [9]. Within the statistical techniques, independent component analysis (ICA) [10, 11] assumes statistical independence among sources, while independent subspace analysis [12] extends ICA to single-channel source separation. Sparse decomposition [13] assumes that a source is a weighted sum of bases from an overcomplete set, considering that most of these bases are inactive most of the time [14], that is, their relative weights are presumed to be mostly zero. Non-negative matrix factorization [15, 16] attempts to find a mixing matrix (with sparse weights [17, 18]) and a source matrix with non-negative elements so that the reconstruction error is minimized.

Finally, sinusoidal modeling techniques assume that every sound is a linear combination of sinusoids (partials) with time-varying frequencies, amplitudes, and phases. Therefore, sound separation requires a reliable estimation of these parameters for each source present in the mixture [1921], or some a priori knowledge, i.e., rough pitch estimates of each source [22, 23]. One of the most important applications is monaural speech enhancement and separation [24]. These are generally based on some analysis of speech or interference and subsequent speech amplification or noise reduction. Most authors have used STFT to analyze the mixed signal in order to obtain its main sinusoidal components or partials. Auditory-based representations [25] can also be used.

One of the most important and difficult problems to solve in the separation of pitched musical sounds is overlapping harmonics, that is, when frequencies of two harmonics are approximately the same. The problem of overlapping harmonics has been studied during the past decades [26], but it is only in recent years that there has been a significant increase in research on this topic. Given that the information in overlapped regions is unreliable, several recent systems have attempted to utilize the information from neighboring non-overlapped harmonics. Some systems assume that the spectral envelope of the instrument sounds is smooth [2729]; hence, the amplitude of an overlapped harmonic can be estimated from the amplitudes of non-overlapped harmonics from the same source, via weighted sum [20], or interpolation [21, 27]. The spectral smoothness approximation is often violated in real instrument recordings. A different approximation is known as the common amplitude modulation (CAM) [22], which assumes that the amplitude envelopes of different harmonics from the same source tend to be similar. The authors of [30] propose an alternate technique for harmonic envelope estimation, called harmonic temporal envelope similarity (HTES). They use the information from the non-overlapped harmonics of notes of a given instrument, wherever they occur in a recording, to create a model of the instrument which can be used to reconstruct the harmonic envelopes for overlapped harmonics, allowing separation of completely overlapped notes. Another option is the average harmonic structure (AHS) model [31] which, given the number of sources, creates a harmonic structure model for each present source, using these models to separate notes showing overlapping harmonics.

In this study, we use an experimentally less restrictive version of the CAM assumption within a sinusoidal model generated using a complex band pass filtering of the signal. Non-overlapping harmonics are obtained using a binary masking approach obtained from the complex wavelet additive synthesis (CWAS) algorithm [32], which is based on the complex continuous wavelet transform (CCWT). The main advantage of the proposed technique is the capability of synthesis of the CWAS algorithm. Using the CWAS wavelet coefficients, it is possible to synthesize an output signal which differs negligibly (numerically and acoustically) from the original input signal. Hence, the non-overlapped partials can be obtained with accuracy. The separated amplitudes of overlapping harmonics are reconstructed proportionally from the non-overlapping harmonics, following energy criteria in a least-squares framework. This way, it is possible to relax the phase restrictions, and the instantaneous phase for each overlapping source can also be constructed from the phase of non-overlapping partials. At its current stage, the proposed technique can be used to separate two or more musical instruments, each one playing a single note.

The rest of the article is divided as follows. “Complex bandpass filtering” section provides a brief introduction to the CCWT and the CWAS algorithms, including the interpretation of their results and the additive synthesis process. The proposed separation algorithm and its main blocks (as the fundamental frequency estimation) will be presented in “Separation algorithm’ section, with a detailed example. The numerical results of the different experiments and tests are shown in “Experimental results” section. Finally, the main conclusions and current and future lines of work are presented in “Conclusions” section.

Complex bandpass filtering

The CCWT

The CCWT can be defined in several ways [33]. For a certain input signal x(t), it can be written as

W x (a,b)= + x(t) Ψ a , b (t)dt
(1)

where is the complex conjugate and Ψa,b(t) is the mother wavelet, frequency scaled by a factor a and temporally shifted by a factor b:

Ψ a , b (t)= 1 a Ψ t b a
(2)

In our case, we will choose a complex analyzing wavelet, (specifically the Morlet wavelet). The Morlet wavelet is a complex exponential modulated by a Gaussian of width 2 2 /σ, centered in the frequency ω0/a. Its Fourier transform is

ψ a ̂ (ω)=C e σ 2 ( a ω ω 0 ) 2 2
(3)

where C is a normalization constant which can be calculated independently of the input signal in order to conserve the energy of the transform [34].

A general audio signal (and in particular a monocomponent signal) can be modeled as

x(t)=A(t)cos[ϕ(t)]
(4)

From the module and the argument of the complex wavelet coefficients of Equation (1) [32] it is possible to obtain a complex function, which can be written as

ρ(t)A(t) e j [ ϕ ( t ) ]
(5)

This result can locally be applied to every detected partial of the analyzed signal, providing a model of the audio signal close to its canonical pair. The output (synthetic) signal is the real part of ρ(t) (the real part of the additive synthesis of the detected partials in the general case). This synthetic signal remains very close to the original input signal x(t) in numerical and acoustical terms [32].

The CWAS algorithm

In the CWAS algorithm [32], a complex mother wavelet allows us to analyze the complex coefficients of Equation (1), stored in a matrix (the CWT matrix), in module and phase, obtaining directly the instantaneous amplitude and the instantaneous phase of each detected component [34, 35]. A single parameter, the number of divisions per octave D (a vector with as many dimensions as octaves present in the signal’s spectrum), controls the frequency resolution of the analysis.

Figure 2 (bottom left) depicts the module of the complex wavelet coefficients (also called the wavelet spectrogram) of the mixture of a tenor trombone playing a C 5 note and a trumpet playing a D 5 vibrato. In the figure, the dark zones are associated with the main trajectories of information (each one related with a partial).

Figure 2
figure 2

Left: The wavelet spectrogram, that is, the module of the CCWT coefficients (module of the CWT matrix). The dark zones are the different detected partials. Right: Scalogram. The analyzed signal is the mixture of a tenor trombone playing a C5 note and a trumpet playing a D5 vibrato.

The addition in the time axis of the module of the wavelet coefficients represents the scalogram of the signal. The scalogram presents a certain number of peaks, each one related to a detected component of the signal. We found that the quality of the resynthesis significantly improves by extending the definition of partial not exclusively to the scalogram peaks, but to their regions of influence. So, in our model, a partial contains all the information situated between an upper and a lower frequency limits (the region of influence of a certain peak). These regions of influence can be seen in Figure 3, which shows the scalogram of a guitar playing an E 4 note (330 Hz). Each maximum of the scalogram is marked with a black point. The associated upper and lower frequency limits for each partial are marked with red stars. They are located at the minimum point between adjacent maxima.

Figure 3
figure 3

Scalogram of a guitar playing an E 4. Each peak (black points) is related to a detected partial. Each partial has an upper and lower frequency (band) limit (red stars), located at the minimum point between two adjacent maxima.

For a certain peak i of the scalogram, its complex partial function P i can be defined as the summation of the complex wavelet coefficients obtained through Equation (1) between its related frequency limits [32]. Hence, we can write

P i (t)= m i = m i low m i up W x ( a m i ,t)i=1,,n
(6)

where W x ( a m i ,t) are the wavelet coefficients W x (a m ,t), related with the i th peak (partial).

Studying the complex-valued function P i (t) in module and phase, we can obtain the instantaneous amplitude A i (t) and the instantaneous phase Φ i (t) of each detected partial. The instantaneous frequency of the partial can be written [36] as

f i , ins (t)= 1 2 Π d [ Φ i ( t ) ] dt
(7)

The global contribution of P i (t) to the scalogram of x(t) can be approximated by

E i = m = 1 l i P i ( t m )
(8)

where t m is the m th sample of the temporal duration of the partial i (whose length is l i , in samples). Obviously, E i is a measure of the energy of the partial.

As Equation (5) is true for every detected partial of the signal, the original signal x(t) can be obtained through a simple additive synthesis method, performing the summation of the n detected partials, as was advanced in the previous section

x(t)= i = 1 n P i ( t ) = i = 1 n A i (t)cos[ Φ i (t)].
(9)

The objective of this study is to be able to use this information to somehow separate a signal composed of two or more mixed notes into the original isolated sources. The only input of the system is the mixed signal (no additional data is needed).

BASS

The monaural mixed signal x(t) can be written as

x(t)= k = 1 N s k (t)
(10)

As stated above, BASS attempts to obtain the original isolated sources s k (t) present in a certain signal x(t), when the mixture process is unknown.

As we do not know a priori the number N of sources present in x(t), the first problem is to divide the detected partials into as many different families or categories as sources, having a minimum error between members of a class [19]. A first approximation to the BASS task using the CWAS technique was performed and presented in [37]. There, we used an onset detection algorithm [38] to find a rough division of the partials, grouping them into the different sources. The main advantage of using the CWAS algorithm instead of the STFT is its proven ability of high-quality resynthesis. As explained, the time and frequency errors in the synthesis of signals using the CWAS algorithm is remarkably small, and the acoustical differences between the original and synthetic signals are negligible for most of people [32]. This high fidelity synthesis converts the CWAS algorithm in a very useful tool for source separation.

In the general case, when there are two or more audio sources present in the analyzed signal, a certain partial can be part of one of the sources, it can be shared by two or more sources, or it can be part of none of them (i.e., inharmonic or noisy partials). The algorithm will search for any fundamental frequency present in the mixed signal, and each f0 will be considered as an indicator of the presence of a source (see “Multiple f 0 estimation” section). A harmonic analysis will find the set of partials which belongs to each source, and the set of overlapping partials (and which sources are overlapping for each case). Then, the information of the isolated partials will be used to reconstruct an estimation of the contribution from each source to every overlapping partial, and the separated sources will be generated by additive synthesis (see “The separation process” section). This idea was used in [22], but in this study the only input information is the mixed signal (we do not need the estimated pitch, because the f0 estimator gives us this information). The quality of the separation (see “Qualityseparation measurement” section) will be measured using the standards proposed in [39].

Separation algorithm

In this section, we will detail the proposed separation technique, and in particular its two main blocks: the estimation of the fundamental frequencies present in the mixed signal, and the separation process. A detailed example will be developed in parallel, in order to clarify the specified separation process. In this example, we will use a signal chosen arbitrarily from the set of analyzed signals (see “Experimental results’’ Section). In this case, the separation of a mixture composed of a trumpet playing a D 5 (587 Hz) vibrato and a tenor trombone playing a C 5 (523 Hz) note. For this signal, Equation (10) becomes

x(t)= s 1 (t)+ s 2 (t)
(11)

The waveform, module of the CWT matrix, and scalogram of this signal can be seen in Figure 2. The numerical quality separation measurement of this signal can be seen in the following section. In the example, we will concentrate on a single overlapping partial. The isolated original partials will also be used to test the robustness of the method.

The main steps of the separation algorithm are summarized below.

  • From x(t)→P i (t)→A i (t), Φ i (t) (CWAS).

  • From Φ i (t), through Equation (7) →f i (t).

  • Estimation of f0kand their harmonic partials ∀k.

  • Separation of overlapping partials.

  • Additive synthesis →s k (t).

It is important to remark that, at its actual stage, the separation process is performed using the information of the whole signal.

Multiple f 0 estimation

In this study, we have considered that a musical instrument cannot play more than one note simultaneously (i.e., we work mainly with monophonic instruments). If an instrument plays two or more notes simultaneously (polyphony), the developed algorithm will consider that each note comes from a different source. With such an approximation, the present fundamental frequencies f0j, j = 1,…,N become the natural parameter which will be used to calculate the number of sources present in the mixture, and the reliability in the f0estimator acquires capital importance.

An algorithm of multipitch analysis based on the work of Klapuri [28, 29], specially adapted for the CWAS algorithm, which uses the spectral smoothing technique of Pérez-Sancho et al. [40] has been developed. Figure 4 shows a block diagram of this algorithm.

Figure 4
figure 4

Block diagram of the fundamental frequencies estimation algorithm. MIP is the most important (energetic) partial. The output SHP#i is the set of the harmonic partials corresponding to each detected source (i = 1,…,N). See text for details.

The input (mixed) signal is analyzed using the CWAS algorithm, which provides as results the n complex functions that define the temporal evolution of each detected partial. Using Equations (7) and (8), the instantaneous frequencies for each partial (and their respective average values, f j ¯ j = 1,…,n) and the energy distribution of the signal are obtained. This information is equivalent to the scalogram of the signal clustered around the set of detected partials. Only the partials with energy greater than the threshold E th = 1% will be considered in the search of the harmonic sets associated with each source. From the remaining energy distribution, the most energetic partial (MIP in Figure 4) is selected, and the harmonic analysis is computed next.

Starting from the average frequency of the most important partial f j ¯ , it is assumed that this partial is harmonic of a certain fundamental frequency f0k, that is

f 0 k = f j ¯ k ,k=1,2,, N A
(12)

In this study, we have taken N A = 10. In other words, the MIP will be at most the 10th harmonic of its related fundamental frequency. From the fundamental frequencies so obtained, the set of harmonic frequencies regarding each one is calculated.

f k , m =m f 0 k ,m=1,2,, N k
(13)

where N k is the higher natural such that satisfies N k f0kf s /2, being f s the sampling rate.

In the next step, for each fk,m, its related partial is searched. A partial of mean frequency f i ¯ is the m th harmonic of a certain fundamental frequency f0kif

f i ¯ f 0 k f k , m f 0 k θ a
(14)

where θ a is the inharmonicity threshold. Taking θ a = 0.03, the partials of an inharmonic instrument like the piano are correctly analyzed.

The decision on which is the fundamental frequency associated with the current MIP is taken through a weight function w k calculated for each of the candidates. This weigh function is proportional to the energy contribution of its set of partials:

w k = n ip , k 2 n a , k i = 1 n a , k E i , k
(15)

where na,k is the total number of harmonics associated with f0k and nip,k is the number of partials with energy above the threshold E th . Ei,k is the energy of the i th partial associated with f0k.

The fundamental frequency related to the current MIP is the one whose weight w k is maximum. The algorithm stores the set of harmonic partials or spectral pattern, P k = P 1 , k , P 2 , k , P n a , k , which includes the obtained fundamental frequency, and proceeds to apply the spectral smoothing [40] to its energy distribution E k = E 1 , k , E 2 , k , E n a , k

E ~ k = G w E k
(16)

where G w = {0.212,0.576,0.212} is a truncated normalized Gaussian window with three components and ⋆ is the convolution product operator. The smoothed energy for each harmonic partial is calculated as

E i , k = E i , k E i , k ~ if E i , k E i , k ~ > 0 0 if E i , k E i , k ~ 0
(17)

Substituting these new energy values into its corresponding partials of the original energy distribution, a new MIP can be obtained. The process is iterated until the energy of the distribution descends under a threshold or the maximum number of sources (MNS in Figure 4) has been reached. In this study, we have limited the number of sources to MNS = 5. Using this technique, it is possible to obtain the fundamental frequencies even in the most difficult cases, for example when a fundamental frequency is overlapped with a harmonic corresponding to other source or in the case of suppressed fundamentals. Overlapping fundamentals will not be detected using this technique.

This algorithm has been tested using a set of more than 200 signals, most of them extracted from the musical instrument samples of the University of Iowa [41].a Experimental results are shown in Table 1. In this table, the accuracy of the multipitch analysis is shown in four categories: Isolated instruments, synthetically generated mixtures of two and three harmonic instruments and mixtures of one harmonic and one inharmonic instrument. Errors can be due to missed detections, wrong estimations, or false fundamentals.

Table 1 Accuracy results of the fundamental frequency estimation algorithm

In the signal of the example (Figure 2), the exact results given by the fundamental frequency estimator are f01 = 589.25Hz for the trumpet and f02 = 525.96Hz for the trombone. The instantaneous amplitude from these partials is shown in Figure 5. The continuous line comes from the fundamental partial of the trumpet and the dashed line from the tenor trombone fundamental. The set of harmonic partials from each instrument will be shown later.

Figure 5
figure 5

Envelopes of the fundamental partials. Continuous line, trumpet fundamental. Dashed line, tenor trombone fundamental.

The separation process

Once the fundamental frequencies present in the mixed signal (and the number of sources N) have been obtained, the separation process begins. A detailed block diagram of the process is shown in Figure 6.

Figure 6
figure 6

Block diagram of the separation algorithm. The input SHP#i is the set of the harmonic partials corresponding to each detected source (i = 1,…,N). See text for details.

Analyzing the sets of harmonic partials for each source, it is easy to distinguish between isolated harmonics (that is, partials which only belong to a single source) and overlapping harmonics (partials shared by two or more sources). The isolated harmonics and the fundamental partial of each source will be used later to separate the overlapping partials, through their onset and offset times, instantaneous envelopes, and phases. The separated source is eventually synthesized by the additive synthesis of its related set of partials (isolated and separated).

The inharmonic limit

Inharmonicity is a phenomenon occurring mainly in string instruments due to the stiffness of the string and non-rigid terminations. As a result, every partial has a frequency that is higher than the corresponding harmonic value. For example, the inharmonicity equation for a piano can be written [42] as

f n =n f 0 1 + β n 2
(18)

where n is the harmonic number and β is the inharmonicity parameter. In Equation (18), β is assumed constant, although it can be modeled more accurately by a polynomial up to order 7 [43]. It means that the parameter β has different values depending on the partials used to calculate it. Partials situated in the 6–7 octave provide the optimal result. Using two partials of order m (lower) and n (higher), it is

β= δ ε ε n 2 δ m 2
(19)

where δ= ( m f n / n f m ) 2 and ε is an induced error due to the physical structure of the piano which cannot be evaluated [42]. If partials m and n are correctly selected, ε ≈ 1.

With the inharmonic model of Equation (19), it is possible to calculate the inharmonicity parameter β for each detected source, using (when possible) two isolated partials situated in the appropriate octaves. A priori, this technique includes inharmonic instruments (like piano) in the proposed model. Unfortunately, the obtention of the parameter β do not improve significantly the quality separation measurements evaluated in the tests.

Assumptions

In order to obtain the envelopes and phases of an overlapping partial related to each source, we will assume two approximations. The first one is a slightly less restrictive version of the CAM principle, which asserts that the amplitude envelopes of spectral components from the same source are correlated [22].

  • The amplitudes (envelopes) of two harmonics P1 and P2, with similar energy E1E2, both belonging to the same source, have a high correlation coefficient.

As long as this approximation is true, we will have better separation results. As we are using the global signal information, the correlation coefficient between the strongest harmonic (and/or the fundamental partial) and the other harmonics decreases as the amplitude/energy differences between the involved partials increase [22]. Hence, the choice for the reconstruction of non-overlapping harmonics whose presence is energetically similar to the energy of the overlapping harmonic suggests that the correlation factor between the involved partials will be higher. In fact, as the correlation between high-energy partials tends also to be high, while the errors related with this assumption in lower energy partials tend to be energetically negligible, in most cases the quality measurement parameters have a high value, and the acoustic differences between the original and the separated source are acceptable.

The second approximation is

  • The instantaneous phases of the p th and the q th harmonic partials belonging to the same source are approximately proportional with ratio p/q, except an initial phase gap, ϕ0. That is

    ϕ 2 (t) p q ϕ 1 (t)+Δ ϕ 0
    (20)

where Δ ϕ0 = 0 means that the initial phases of the involved partials are equal, that is, ϕ0p= ϕ0q.

We have found that in our model of the audio signal and even knowing the envelopes of the original overlapping harmonics, a difference in the initial phase Δ ϕ 0 =1 0 3 is enough to make impossible an adequate reconstruction of the mixed partial. Each partial has an aleatory initial phase (i.e., there is not a relation between ϕ0pand ϕ0q). However, as the instantaneous frequency of the mixed harmonics can be retrieved with accuracy independently of the value of the initial phase, the original and the synthetically mixed partials (using the separated contribution from each source) present similar sounds (provided that the first assumption is true).

Reconstruction process and additive synthesis

As mentioned above, in the proposed technique we use the information of the isolated partials to reconstruct the information of the overlapping partials. The output of the multipitch estimation algorithm is the harmonic set corresponding to each source present in the mixture. With this information, it is easy to distinguish between the isolated partials (partials belonging to a single given source) and the shared partials. For each overlapping partial, it is immediate to know the interfering sources. We can write

P k = P k ( iso ) P k ( sh )
(21)

In the example of the tenor trombone and the trumpet mixture, Figure 2, the instantaneous amplitudes of the isolated partials are shown in Figure 7. The instantaneous amplitudes of the fundamental partials are depicted in bold lines.

Figure 7
figure 7

Envelopes of the isolated set of harmonics from each source (dotted lines). The fundamental envelopes are marked with bold lines. Continuous trace, trumpet; dashed line, tenor trombone.

Using the information of the isolated partials and through an onset detection algorithm [38], it is easy to detect the beginning and the end of each present note. This information is necessary to avoid the artifacts and/or noise caused by the mixture process which tends to appear before and after active notes. This noise is acoustically annoying and makes worse the numerical quality separation measurement results.

Consider a certain mixed partial P m of mean frequency f m ¯ . The mixed partial can be written as follows

P m ( t ) = A m ( t ) e j [ ϕ m ( t ) ] = s k P s k ( t ) = s k A s k ( t ) e j [ ϕ s k ( t ) ]
(22)

where P s k (t) are the original harmonics which overlap in the mixed partial. In Equation (22), the only accessible information is the instantaneous amplitude and phase of the mixed partial, that is, A m (t) and ϕ m (t). The aim is to recover each A s k (t) and ϕ s k (t) as accurately as possible.

To do this, it is necessary to select a partial belonging to each overlapping source s k in order to separate the different contributions to P m . From each isolated set of partials P k iso corresponding to the interfering sources, we will search for a partial j with an energy E j as similar to the energy of P m as possible, and with a mean frequency f j ¯ as close to f m ¯ as possible. If Δ(Ej,m) = |E j E m | and Δ( f j , m )=| f j ¯ f m ¯ |, these conditions can be written as

P k , win =argmin(Δ E j , m ) | P j P k ( iso )
(23)

and

P k , win =argmin(Δ f j , m ) | P j P k ( iso )
(24)

The energy condition, Equation (23), is calculated in the first place. Only in doubtful cases, the frequency condition of Equation (24) is evaluated. However, both conditions often lead to the same winner. For the purposes of simplicity, let P wk denote the selected (winner) isolated partials of each source k. This can be written

P wk (t)= A wk (t) e j [ ϕ wk ( t ) ] k
(25)

If f wk ¯ is the mean frequency of the winner partial of the k source, it is easy to see that

f wk ¯ f m ¯ = p k q k
(26)

for some p k , q k inN.

In fact, the same ratio p k /q k can be used to reconstruct the corresponding instantaneous frequency for each interfering source with high accuracy. In Figure 8, the instantaneous frequencies of the original (interfering) partials and the estimated instantaneous frequency of each separated contribution are shown, for a certain case of overlapping partial. In this figure, the original instantaneous frequencies are depicted in blue, and the reconstructed instantaneous frequencies in red. Note the accuracy in the estimation of each instantaneous frequency. The blue line corresponding to the tenor trombone is shorter due to the signal duration.

Figure 8
figure 8

Comparison between the original (isolated) instantaneous frequencies and the estimated (separated) instantaneous frequencies. (a) Results for the trumpet source (continuous blue line, the original fins; red dotted line, the estimated one). (b) Results for the tenor trombone source.

Hence, it is possible to use Equation (20) to reconstruct the phases ϕ s k of the separated partials for each overlapping source.b

Unlike other works [22, 23], to reconstruct the envelope of the partials separated it is assumed that the instantaneous amplitude of the mixed partial A m (t) is directly a linear combination of the amplitudes of the interacting components A wk (t) (hence, unlike other existing techniques [22, 23], the phases of the winner partials are not taken into account in this process). Therefore,

A m ( t i )= s k α k A wk ( t i ) t i
(27)

The solution of Equation (27) that minimizes the error in the sum is equivalent to the least-squares solution in the presence of known covariance of the system.

Aα=b
(28)

where A is a matrix which contains the envelopes of each selected (winner) partial described by Equations (23) and (24), α is the mixture vector and b = A m (t).

Once found α k , p k , and q k for each source k, the overlapping partial can be written as

P m (t)= s k α k A wk (t) e j p k q k ϕ wk ( t )
(29)

and the separated contributions of each present source are of course

P s k (t)= α k A wk (t) e j p k q k ϕ wk ( t )
(30)

Once each separated partial is obtained using the technique described, it is added to its corresponding source. This iterative process eventually results in the separated sources.

Figures 9 and 10 show the wavelet spectrograms and scalograms (obtained from the CWAS algorithm) corresponding to the isolated signals (tenor trombone and trumpet, respectively) and their related separated sources. From the spectrograms (module of the CWT matrix), it can be observed that most of the harmonic information has properly been recovered. This conclusion is reinforced using the scalogram information. Note that the harmonic reconstruction produces an artificial scalogram (red line) harmonically coincident with the original scalogram (blue line).

Figure 9
figure 9

Spectrograms of the tenor trombone signals. (a) Wavelet spectrogram of the original (isolated) tenor trombone. (b) Blue line: Original scalogram. Red line: Scalogram of the separated source. (c) Wavelet spectrogram of the separated source.

Figure 10
figure 10

Spectrograms of the trumpet signals. (a) Wavelet spectrogram of the original (isolated) trumpet. (b) Blue line: Original scalogram. Red line: Scalogram of the separated source. (c) Wavelet spectrogram of the separated source.

In the figures, the separated wavelet spectrogram shows that only the harmonic partials have been recovered. When the inharmonic partials carry important (non noisy) information, the synthetic signal can sound somewhat different (as happened with the possible envelope errors in the high-frequency partials).

The values of the standard quality measurement parameters for this example and the rest of the analyzed signals will be detailed in “Summarizing: graphical results” section.

Main characteristics, advantages, and limitations

The reconstruction of overlapping partials causes that there is no information wrongly assigned to the separated sources using this technique, except the existing interference in the set of isolated partials. This means that the interference terms in the separation process will be in general negligible. This result will be numerically confirmed in “Experimental results’’ section.

The advantages of this separation process are mainly two. First, the process of separation of overlapping harmonics (multi-pitch estimation, calculus of the best linear combination for reconstruction, additive synthesis) is not computationally expensive. In fact, the obtention of the wavelet coefficients and their separation into partials uses much more computation time. The second advantage of this process is that the separation is completely blind. That is, we do not need any a priori characteristic of the input signal, neither the pitch contour of the original sources nor the relative energy, number of present sources, etc.

One of the most important limitations of this method is that is not valid for separating completely overlapping notes. Although the detailed algorithm of estimation of fundamental frequencies is capable of detecting overlapping fundamentals, in such a case the set of isolated partials of the overlapped source would be essentially empty, and therefore no isolated information would be available to carry out the reconstruction of phases and amplitudes of the corresponding source. To solve this problem (assuming the separation of musical themes of longer duration), it is possible to use models of the instruments present in the mixture, or previously separated notes from the same source. These ideas are the basis of HTES and AHS techniques (see “Introduction” section).

On the other hand, as was advanced in “Introduction” section, at its current stage, the proposed technique can be used to separate two or more musical instruments, each one playing a single note. The final quality of the separation depends of the number of mixed sources. This is due to the accuracy of the estimation of fundamental frequencies, and to the use of isolated partials to reconstruct the overlapping harmonics. The higher the number of sources, the lower the number of isolated harmonics and the poorer the final musical timbre of the separated sources.

Experimental results

The analyzed set of signals includes approximately 100 signals with two sources and 60 signals with three sources. All the analyzed signals are real recordings of musical instruments, most of them extracted from [41]. The final set of musical instruments includes flute, clarinet, sax, trombone, trumpet, oboe, bassoon, horn, tuba, violin, viola, guitar, and piano.

All the analyzed signals have been sub-sampled to f s = 22050 Hz, then synthetically mixed. The number of divisions per octave D and all the thresholds used in the CWAS and the separation algorithms are the same for all the analyzed signals. Specifically, D = {16;32;64;128;128;100;100;100;100}, θ th = 0.03, E th = 1%. Observe that the number of divisions per octave depends on the octave, so we have a variable resolution.

We have developed eight experiments with two and three synthetically mixed sources. In each experiment, we have analyzed 20 signals. These experiments are listed in Table 2. In the next paragraphs, we will explain these experiments. Graphical and numerical results are given in “Summarizing: graphical results” section.

Table 2 List of BASS experiments developed

Experiment 1: harmonic and inharmonic instruments

In the first experiment, we have mixed a inharmonic instrument (piano) with one harmonic instrument. Numerical data are presented in the first column of Figures 11, 12, and 13. The numerical separation results are not as good as results of Experiment 5, which is otherwise similar to this one (acoustically the situation is better). It is probably due to the uncertainty in the obtention of the inharmonicity parameter, β[43] (see “Theinharmonic limit” section).

Figure 11
figure 11

Experimental results of the SDR parameter for the eight separation experiments.

Figure 12
figure 12

Experimental results of the SIR parameter for the eight separation experiments.

Figure 13
figure 13

Experimental results of the SAR parameter for the eight separation experiments.

Experiment 2: single instrument, same octave

In the second test, two musical instruments (Alto Sax and Flute, respectively) were taken randomly from the original database. We have generated a total of 11 signals with each instrument, with two notes of the fourth octave (considering A 4 = 440 Hz) played by the same instrument. One of the notes is always a C# 4 (277 Hz), the other note corresponds to the same octave (C 4, D 4, D# 4, etc.). The experimental values of SDR, SIR, and SAR are presented in the second column of Figures 11, 12, and 13.

Experiment 3: single instrument, harmonic-related notes

In the third experiment, we mixed two harmonic note intervals from the same instrument. The used harmonic relations are: CG, DA, EB, FC, GD, AE, and A#F from the same or different octave. That is, 5th and 12th intervals. We have generated three sets of signals, each one corresponding to one musical instrument (concretely, Alto Sax, Flute and Bb Clarinet), and seven mixtures from each one. Numerical results of this experiment are shown in the third column of Figures 11, 12, and 13.

Experiment 4: two instruments, harmonic-related notes

In the next experiment, we have mixed in 20 signals the same harmonic intervals of the previous experiment, this time executed by different musical instruments: Alto sax, guitar, bassoon, Bb and Ee clarinets, horn, oboe, and flute. The experimental values of the quality separation measurement are presented in the fourth column of Figures 11, 12, and 13.

Experiment 5: two instruments, inharmonic notes

In this experiment, each analyzed signal contains the mixture of two aleatory chosen musical instruments playing aleatory (non-harmonically related) notes. The experimental values of the quality separation parameters are presented in the fifth column of Figures 11, 12, and 13.

Experiment 6: one instrument, major chord

A major chord is the mixture of three notes, concretely CEG. We have generated 20 of these chords, played by the same musical instrument, concretely Bassoon, Alto Sax, Bb Clarinet, Flute and Trumpet. Numerical data are presented in the sixth column of Figures 11, 12, and 13.

Experiment 7: one instrument, minor chord

A minor chord is the mixture of ACE notes. We have analyzed 20 signals, each one played by a single musical instrument: Bassoon, Bb Clarinet, Horn, Oboe, and Trumpet. The SDR, SIR, and SAR values for this experiment are depicted in the seventh column of Figures 11, 12, and 13.

Experiment 8: three instruments, inharmonic notes

Finally, 20 signals with three aleatory instruments playing aleatory (non-harmonically related) notes have been analyzed. These signals are randomly distributed from octaves 2 to 6, and 10 of the signals present widely separated notes. The experimental values of the quality separation measurement parameters are presented in the last column of Figures 11, 12, and 13.

Quality separation measurement

We will assume that the errors committed in the separation process can have three different origins: they can be due to interference between sources, to distortions inserted in the separated signal, and to artifacts introduced by the separation algorithm itself.

We have used three standard parameters to test the final quality of the separation results using the proposed method related to these distortions. These parameters are the signal-to-interference-ratio, (SIR), the signal-to-distortion-ratio, SDR, and the signal-to-artifacts-ratio, SAR[39, 44, 45]:

SIR=10 log 10 D interf 1
(31)
SDR=10 log 10 D total 1
(32)

and

SAR=10 log 10 D artif 1
(33)

where Dinterf, Dtotal, and Dartif are energy ratios involving the separated signals and the target (isolated, supposed known) signals. The quality separation measurements of the next sections have been obtained within the MATLAB® toolbox BSS_EVAL, developed by Févotte, Gribonval, and Vincent and distributed online under the GNU Public License [44].

Summarizing: graphical results

As advanced before, in Figures 11, 12, and 13, we show the numerical results of the detailed tests. In Figure 11, the experimental values of the SDR parameter for each experiment are presented. In Figure 12, we have depicted the obtained SIR values. Finally, in Figure 13, the experimental values of the SAR parameter are shown.

In Figure 11, marked with squares, the SDR mean result for each test; with triangles, the maximum and minimum value of the parameter. These results show significant differences in the quality separation measurements for the experiments of separation involving two sources. In the case of experiments with three sources, the differences are smaller.

In Figure 12, the SIR mean result for each test is marked with circles; with triangles, the maximum and minimum value of the parameter. As can be seen in the figure, the experimental values of SIR present less variations than in the previous case. It means that the proposed technique does not present significative tendency to high interference terms.

Finally, in Figure 13, the SAR results for each test are marked with stars. The maxima and minima of the experiments are depicted with triangles. The conclusions are the same that in Figure 11.

If we consider globally the whole set of signals with two mixed sources, the mean values of the quality separation measurement parameters can be used in some way to measure the final quality of the separation. These values (represented in Figures 11, 12, and 13 with horizontal dashed-dotted lines) are

  • SD R 2 s ¯ 16.07 dB.

  • SI R 2 s ¯ 58.85 dB.

  • SA R 2 s ¯ 16.08 dB.

The average of the standard parameters in the case of three mixed sources (horizontal dashed lines in Figures 11, 12, and 13) are

  • SD R 3 s ¯ 12.81 dB.

  • SI R 3 s ¯ 52.03 dB.

  • SA R 3 s ¯ 12.82 dB.

These results are consistent with the increasing number of sources in the mixture. Under the same degree of precision in the frequency axis, the higher the number of sources, the lower the separation between partials and the higher the probability of interference (lower SIR). Hence, the final distortions and artifacts tend to increase.

Conclusions

In this study, a BASS technique for monaural musical notes has been presented. There are two main differences between the proposed algorithm and the existing ones: first of all, the time–frequency analysis tool is not based on the STFT but in the CCWT, which offers a highly coherent model of the audio signal in both time and frequency domains. This tool allows us to obtain with great accuracy the instantaneous evolution (in time and frequency) of the isolated harmonics, easily assignable to the sources present in the mixture. Second, the separation algorithm only needs the mixed signal as input, no additional information is needed. The overlapping partials can entirely be reconstructed from the isolated partials searching for the best linear combination which minimizes the amplitude error in the mixture process, assuming the CAM principle. Using non-overlapping partials with similar energy to the overlapping partials, if the overlapping partial has high energy, the correlation factor tends to be high, and if the energy is low, errors associated with the low correlation are usually acceptable. The phase reconstruction is not as important as in other techniques, obtaining separated sources which have both high-quality separation measurement values and high-acoustic resemblance with respect to the original signals.

At its actual stage, the proposed technique can be used to separate two or more (monophonic) sources playing a single (and no proportional) note each. As the polyphony of the mixture signal increases, the acoustic performance of the separated signals tend to show a less resemblance timbre with respect to the original signals, because the set of isolated partials is decreasing in number of elements and therefore, in the reconstruction, the information used is smaller and less varied. Regarding the results of numerical quality, the SDR and SAR parameters descend with respect to the shown results from polyphony 5, while the SIR parameter, although it has a clear downward trend, remains high.

To develop a complete source separation algorithm, several improvements are needed.

First, it is necessary to implement this technique into an algorithm frame-to-frame to address the separation of long duration signals. The fundamental frequency, onset, and offset estimation algorithms presented in “Separationalgorithm” section and [38] are able to work dynamically, obtaining the parameters of pitch, starting, and ending time of each note present in the mixture.

There are several useful techniques to properly assign each separated note to its corresponding source. For example, to use a rough estimation of the pitches of the mixture [22] or the score of the analyzed signal. Other possibility is to develop an algorithm of timbre classification. This method has the advantage of maintaining the blindness of the system, but the drawback of a potential loss of generality. Both methods could also be used to solve the limitation of the presented technique for the separation of polyphonic instruments.

Finally, as discussed briefly in “Main characteristics, advantages, and limitations” section, the appearance of completely overlapping notes is statistically inevitable in real recordings. This problem (one of the core problems in BASS) must be addressed to develop a complete separation algorithm. Therefore, future challenges remain to be tackled.

Endnotes

aEach original archive consists of a certain number of notes. Each note is approximately 2-s long and is immediately preceded and followed by ambient silence. The instruments are recorded in an anechoic chamber. Some instruments are recorded with and without vibrato. All samples are in mono, 16 bit, 44.1 kHz, AIFF format. Resampled at 16 bits, 22.05 kHz, wav format, excerpts consist of isolated notes. Some of these notes have synthetically been mixed. bWe will suppose Δ ϕ0 = 0 in Equation (20), but in fact an aleatory initial phase can be inserted without any significant difference in either the numeric or in the acoustical results.