Generalized independent low-rank matrix analysis using heavy-tailed distributions for blind source separation
Abstract
In this paper, statistical-model generalizations of independent low-rank matrix analysis (ILRMA) are proposed for achieving high-quality blind source separation (BSS). BSS is a crucial problem in realizing many audio applications, where the audio sources must be separated using only the observed mixture signal. Many algorithms for solving BSS have been proposed, especially in the history of independent component analysis and nonnegative matrix factorization. In particular, ILRMA can achieve the highest separation performance for music or speech mixtures, where ILRMA assumes both independence between sources and the low-rankness of time-frequency structure in each source. In this paper, we propose two extensions of the source distribution assumed in ILRMA. We introduce a heavy-tailed property by replacing the conventional Gaussian source distribution with a generalized Gaussian or Student’s t distribution. Convergence-guaranteed efficient algorithms are derived for the proposed methods, and the relationship between the generalized Gaussian and Student’s t distributions in the source model estimation is revealed. By experimental evaluation, the validity of the heavy-tailed generalizations of ILRMA is confirmed.
Keywords
Blind audio source separation Independent low-rank matrix analysis Nonnegative matrix factorization Student’s t distribution Generalized Gaussian distributionAbbreviations
- BSS
Blind source separation
- FDICA
Frequency-domain independent component analysis
- GGD
Generalized Gaussian distribution
- GGD-ILRMA
Independent low-rank matrix analysis based on generalized Gaussian distribution
- GGD-IVA
Independent vector analysis based on spherical multivariate generalized Gaussian distribution
- ICA
Independent component analysis
- ILRMA
Independent low-rank matrix analysis
- IP
Iterative projection
- IS-NMF
Nonnegative matrix factorization based on Itakura–Saito divergence
- IVA
Independent vector analysis
- Laplace IVA
Independent vector analysis based on spherical multivariate Laplace distribution
- MM
Majorization-minimization
- MNMF
Multichannel nonnegative matrix factorization
- NMF
Nonnegative matrix factorization
- p.d.f.
Probability density function
- SDR
Signal-to-distortion ratio STFT: Short-time Fourier transform
- t-ILRMA
Independent low-rank matrix analysis based on Student’s t distribution
- t-MNMF
Multichannel nonnegative matrix factorization based on Student’s t distribution
- t-NMF
Nonnegative matrix factorization based on Student’s t distribution
- Time-varying Gaussian IVA
Independent vector analysis based on Gaussian distribution having a time-varying variance
1 Introduction
Blind source separation (BSS) is a technique for separating individual sources from an observed multichannel mixture without knowing the mixing system, such as the spatial locations of the sensors or sources, in advance. In particular, BSS for multichannel audio signals have been well studied so far. This problem can be divided into two situations: underdetermined (number of microphones < number of sources) and (over-)determined (number of microphones ≥ number of sources) cases. In the underdetermined case, the mixing system of the sources has to be estimated using several assumptions. For example, sparseness-assumption-based methods are popular and reliable approaches [1, 2, 3]. In contrast, the determined BSS methods often estimate the inverse system of a mixing process, and high-quality separation can be achieved compared with the underdetermined BSS methods. In this paper, we only focus on the determined BSS problem.
The most popular and successful algorithm for solving determined BSS problem is independent component analysis (ICA) [4], which assumes statistical independence between the sources and estimates a demixing matrix (the inverse system of the mixing process). For a mixture of audio signals, because the sources are mixed by convolution owing to the room reverberation, ICA is often applied to the time-frequency signals (spectrograms) of the observed signal, which are obtained by a short-time Fourier transform (STFT). Frequency-domain ICA (FDICA) [5, 6, 7, 8] independently applies ICA to the complex-valued time-series signals in each frequency bin and estimates a frequency-wise demixing matrix. Then, the estimated components in each frequency must be aligned over all frequency bins so that the components of the same source are grouped. This postprocessing of FDICA is the so-called permutation problem [6, 9, 10, 11], and several criteria have been used to solve this ambiguity of the signal permutation.
Independent vector analysis (IVA) [12, 13, 14] is a sophisticated algorithm that can simultaneously estimate the frequency-wise demixing matrix and solve the permutation problem using only one objective function. IVA assumes higher-order dependences (co-occurrence among the frequency bins) of each source by employing a spherical generative model of the source frequency vector, thus avoiding the permutation problem. The original IVA employs the spherical multivariate Laplace distribution as the source model (hereafter referred to as Laplace IVA). To improve the statistical model flexibility and source separation performance, Laplace IVA has been extended by replacing its source model with a spherical generalized Gaussian distribution [15] (GGD, also known as an exponential power distribution) in many papers [16, 17, 18, 19, 20] (hereafter referred to as GGD-IVA), or with a Gaussian distribution having a time-varying variance [21] (hereafter referred to as time-varying Gaussian IVA). Note that the GGD includes the Laplace and Gaussian distributions as special cases.
As another means of audio source modeling and separation, nonnegative matrix factorization (NMF) [22, 23] has been a very common approach during the last decade. NMF is a nonnegative-parts-based low-rank decomposition of an observed nonnegative data matrix that is typically a power or amplitude spectrogram. The decomposed nonnegative parts (bases and activations) can be used for source separation by clustering the parts into each source [24, 25, 26, 27, 28]. Also, NMF can be statistically interpreted as a parameter estimation based on a generative model of data, and the distribution of the model defines the objective function (divergence) in NMF. For example, it was revealed that NMF based on Itakura–Saito divergence (IS-NMF) assumes an isotropic complex Gaussian distribution independently defined in each time-frequency slot [29], where the variance of each Gaussian distribution can fluctuate depending on time and frequency. For multichannel audio signals, spatial modeling of the mixing system was introduced into the simple NMF, which is called multichannel NMF (MNMF) [30, 31, 32], to solve the BSS problem. MNMF estimates the spatial mixing system, whereas ICA-based BSS techniques optimize the demixing matrix, which yields a more stable and efficient algorithm than MNMF.
Motivated by this issue, a new BSS algorithm called independent low-rank matrix analysis (ILRMA) [33, 34, 35] has been proposed^{1}. In this method, IS-NMF-based low-rank source modeling is introduced into the source model of IVA, namely, a low-rank time-frequency structure (co-occurrence among the time-frequency slots) is estimated for each source by NMF, and the frequency-wise demixing matrix is optimized taking the NMF source model into account without causing the permutation problem. Since the vector source model in time-varying Gaussian IVA can be interpreted as NMF with a single spectral basis, ILRMA is a natural extension of IVA, where ILRMA utilizes an arbitrary number of bases in the source model. Also, ILRMA can be considered as a dual problem of MNMF (mixing) because ILRMA estimates the demixing matrix, i.e., the inverse of the mixing system (MNMF model), using the low-rank source modeling with NMF.
Note that this work extends our preliminary work on t-ILRMA in [44] by developing a new extension, GGD-ILRMA, and providing additional discussion that explains the theoretical relationship between GGD- and t-ILRMA. Also, the experimental results have been updated with new datasets and conditions for more difficult situations in BSS.
The rest of this paper is organized as follows. Section 2 describes the conventional algorithms including IVA and ILRMA, which are the basis for the proposed GGD- and t-ILRMA described in Section 3. Section 4 reports the validation of the proposed methods by conducting BSS experiments with music and speech sources. Finally, Section 5 concludes this paper.
2 Conventional method
2.1 Formulation
where w_{i,n} is the demixing filter for the nth source and ^{H} denotes a Hermitian transpose. The goal of BSS based on FDICA, IVA, or ILRMA is to estimate W_{ i } and obtain y_{ ij } from only the observations x_{ ij } by assuming statistical independence between s_{ij,n} and \(s_{ij,n'}\phantom {\dot {i}\!}\), where n^{′}≠n. In this paper, we only focus on BSS with the determined situation M=N. For the overdetermined situation M>N, principal component analysis is often applied to x_{ ij } for dimensionality reduction so that M=N [46].
2.2 IVA
where ∥·∥_{2} denotes the L_{2} norm and β>0 is the shape parameter of GGD. Laplace IVA [12, 13, 14] corresponds to β=1. Since the probability of (6) only depends on the norm of \(\bar {\boldsymbol {y}}_{j,n}\) (spherical property), the components in the vector \(\bar {\boldsymbol {y}}_{j,n}\) have higher-order dependence. Therefore, frequency components that have similar activations, such as a fundamental frequency and its harmonic components, will be merged as one source avoiding the permutation problem.
where \(G(\bar {\boldsymbol {y}}_{j,n}) = -\log p(\bar {\boldsymbol {y}}_{j,n})\) is called a contrast function and detW_{ i } denotes the determinant of a matrix W_{ i }. Note that the separated signal y_{ij,n} in \(\bar {\boldsymbol {y}}_{j,n}\) includes the variable W_{ i } as \(y_{ij,n}=\boldsymbol {w}_{i,n}^{\mathrm {H}}\boldsymbol {x}_{ij}\).
where r_{j,n} is the time-varying variance shared over all frequency bins. Similar to (6), (8) also has a spherical property. Note that even though (8) consists of Gaussian distributions, its marginal distribution over j becomes a super-Gaussian distribution because the variance can fluctuate depending on j [17].
Regarding the optimization of W_{ i }, a fast and stable optimization algorithm called iterative projection (IP), which is based on a majorization-minimization (MM) algorithm [47], has been derived for ICA [48], Laplace IVA [49], GGD-IVA [17], and time-varying Gaussian IVA [21]. IP can achieve better convergence than classical gradient-based algorithms.
2.3 ILRMA based on Gaussian distribution
2.3.1 Generative model
where t_{ik,n}≥0 and v_{kj,n}≥0 are the nonnegative basis and activation elements (NMF variables) of \(\boldsymbol {T}_{n}\in \mathbb {R}_{\geq 0}^{I{\times }K}\) (basis matrix) and \(\boldsymbol {V}_{n}\in \mathbb {R}_{\geq 0}^{K{\times }J}\) (activation matrix), respectively, k=1,⋯,K is the integral index of the basis, and K is the number of NMF bases (spectral patterns). Also, r_{ij,n}≥0 is a sourcewise time-frequency-varying variance that corresponds to the low-rank source model. Therefore, the nonnegative matrix T_{ n }V_{ n } represents the rank-K model spectrogram of the nth source as |Y_{ n }|^{.2}≈T_{ n }V_{ n }, where |·|^{.q} for matrices denotes the element-wise absolute and qth-power operations. Because of the fluctuation of the variance r_{ij,n} of the time and frequency, the marginal distribution of the generative model (9) over j becomes a super-Gaussian distribution, which can be used for independence-based BSS.
Note that the variances in p(y_{ij,n}) and p(c_{ij,nk}) are \(r_{ij,n}={\sum \nolimits }_{k} t_{ik,n}v_{kj,n}\) and t_{ik,n}v_{kj,n}, respectively, and they correspond to the expectation values of |y_{ij,n}|^{2} and |c_{ij,nk}|^{2} as r_{ij,n}=E[|y_{ij,k}|^{2}] and t_{ik,n}v_{kj,n}=E[|c_{ij,nk}|^{2}], respectively. Even if \(y_{ij,n}={\sum \nolimits }_{k} c_{ij,nk}\), the additivity of the power spectra does not hold (\(|y_{ij,n}|^{2}\neq {\sum \nolimits }_{k} |c_{ij,nk}|^{2}\)) because of the phase cancelation. However, (9) and (11) mean that the additivity of expectations t_{ik,n}v_{kj,n}=E[|c_{ij,nk}|^{2}] is satisfied as \(r_{ij,n}={\sum \nolimits }_{k} t_{ik,n}v_{kj,n}\) because of the stable property in Gaussian distribution. Therefore, the generative model (9) theoretically justifies to linearly decompose the power spectrogram |y_{ij,n}|^{2} into K nonnegative parts t_{ik,n}v_{kj,n}. This advantage was extended to a more general domain in [42] using an α-stable distribution, which is a distribution family ensuring the stable property. When α=2, α-stable distribution is equal to Gaussian distribution (9) and the additivity of power spectra holds in the expectation sense. When α=1, α-stable distribution converges to Cauchy distribution, which ensures the additivity of amplitude spectra in the expectation sense [38].
2.3.2 Objective function and update rules
where X={X_{1},⋯,X_{ M }} and Y={Y_{1},⋯,Y_{ N }} are the set of the observed and estimated signals and the independence between sources, \(p(\mathsf {Y}) = \prod _{n} p(\boldsymbol {Y}_{n})\), is assumed. The first and third terms in (13) correspond to the objective function in time-varying Gaussian IVA [21], and the second and third terms correspond to the objective function in IS-NMF [29]. The task of the ILRMA algorithm is to minimize the objective function \({\mathcal {L}}\) w.r.t. T_{ n }, V_{ n }, and W_{ i }.
where \(\hat {\boldsymbol {y}}_{ij,n}=(\hat {y}_{ij,n1}~\cdots ~\hat {y}_{ij,nM})^{\mathrm {T}}\) is a separated source image whose scale is fitted to the observed signals at each microphone and ∘ denotes the Hadamard product (entrywise multiplication). The detailed implementation can be found in [52].
3 Proposed generalization of ILRMA
3.1 Motivation and strategy
The conventional ILRMA described in Section 2.3 is based on the isotropic complex Gaussian distribution (9) with a time-frequency-varying variance r_{ij,n}. For independence-based BSS, non-Gaussianity of the source signals is required for the separation, and the model (9) relies on only the fluctuation of the variance r_{ij,n}. If the variance r_{ij,n} is a constant value for all i and j, the model (9) becomes completely Gaussian and the independence-based BSS collapses because the ICA algorithm cannot distinguish multiple Gaussian sources. Therefore, it is worth generalizing the distribution in ILRMA to a more flexible non-Gaussian source model. In fact, several approaches based on a non-Gaussian distribution with a time-frequency-varying parameter, such as t-NMF, have been proposed, and it has been reported that NMF audio source modeling based on a non-Gaussian distribution provides better separation performance [39]. From the IVA side, the source distribution has also been generalized by employing the GGD in many studies [16, 17, 18, 19, 20], which gave more accurate BSS results.
For the reasons mentioned above, in this section, we propose two generalizations of the source distribution (generative model) in ILRMA using heavy-tailed distributions: the isotropic complex GGD and the isotropic complex Student’s t distribution. The former is a natural extension of the conventional generative model (9) and has often been used for the generalization of Laplace IVA or time-varying Gaussian IVA as GGD-IVA. The GGD has a shape parameter that controls the super- or sub-Gaussianity. In particular, the GGD includes Laplace and Gaussian distributions as special cases. Since most audio sources follow super-Gaussian distributions, in this paper, we only focus on GGD-ILRMA with a super-Gaussian region.
The latter generalization was inspired by a recently developed framework [42] that ensures the stable property of complex-valued random variables, i.e., audio modeling based on an α-stable distribution. In this model, similar to IS-NMF (11), the decomposition of a complex-valued spectrogram into several nonnegative parts is theoretically justified by the stable property of this distribution family. Student’s t distribution has a degree-of-freedom parameter that determines the shape of the distribution and its super-Gaussianity. Similar to the GGD, Student’s t distribution includes Cauchy and Gaussian distributions as special cases, which are also special cases of the α-stable distribution. Therefore, NMF source modeling (decomposition of complex-valued spectrogram Y_{ n }) in t-ILRMA is partially justified when the Gaussian or Cauchy distribution is assumed, which is theoretically preferable for audio signal processing.
In addition, we introduce a new domain parameter for NMF modeling in GGD- and t-ILRMA because the generative model of a spectrogram strongly depends on the data domain, such as the selection of the amplitude- or power-domain spectrogram to be used. By controlling both the generative model and the modeling domain of data, we can find a suitable statistical assumption for the audio BSS problem.
3.2 ILRMA based on GGD
3.2.1 Generative model and objective function in GGD-ILRMA
It is obvious that GGD-ILRMA (30) coincides with the conventional ILRMA (13) when β=p=2.
3.2.2 Derivation of update rules for GGD-ILRMA
where \({\mathcal {C}}_{1}\) includes the constant terms that do not depend on w_{i,n}. Since (33) has the same form as the conventional ILRMA (14) w.r.t. w_{i,n}, we can apply IP to the majorization function (33). The update rules for w_{i,n} are derived as (34) with (32) and (17)–(19), where (34) coincides with (16) when β=p=2.
respectively. By applying (35) and (36) to (30), the majorization function of (30) can be designed as
Similar to (41), we can obtain the update rules for v_{kj,n} as
These algorithms can be interpreted as NMF based on the GGD (hereafter called GGD-NMF). Since the derivations of the update rules are based on the MM algorithm, they ensure the monotonic decrease in the objective function in each iteration.
3.3 ILRMA based on Student’s t distribution
3.3.1 Generative model and objective function in t-ILRMA
where ν>0 is the degree-of-freedom parameter that controls the super-Gaussianity of Student’s t distribution and σ_{ij,n} is defined as (29). The distribution (43) is depicted in Fig. 4b. Similar to (28), this p.d.f. also becomes identical to (9) when ν→∞, and the probability of (28) does not depend on the phase of y_{ij,n}. For ν=1, (43) corresponds to the complex Cauchy distribution. The objective function of t-ILRMA can be obtained from (43) as
3.3.2 Derivation of update rules for t-ILRMA
where \({\mathcal {C}}_{3}\) includes the constant terms that do not depend on w_{i,n}. Since (48) has the same form as the conventional ILRMA (14) w.r.t. w_{i,n}, we can apply IP to the majorization function (48). The update rules for w_{i,n} are derived as (49) with (46) and (17)–(19), where (49) coincides with (16) when ν→∞ and p=2.
where \({\mathcal {C}}_{4}\) includes the constant terms that do not depend on t_{ik,n} or v_{kj,n}. By setting the partial derivative of (52) w.r.t. t_{ik,n} to zero, we have
These update rules are similar to those in t-NMF [39], but they include the new domain parameter p. Similar to GGD-ILRMA, all the derivations of the update rules are based on the MM algorithm, thus ensuring their theoretical convergence.
3.4 Relationship between GGD- and t-ILRMA
where b is a new exponent parameter. Note that (58) and (59) with b=0.5 were originally derived on the basis of the MM algorithm [50], then the update rules with b=1 were derived using the majorization-equalization (ME) algorithm [53]. Recently, we have proven that (58) and (59) with any value of b in the range (0,1] can be interpreted as valid update rules of IS-NMF, which are obtained by applying the parametric ME algorithm to the objective function in IS-NMF, and can be used for IS-NMF or ILRMA without losing the theoretical convergence [54]. This parameter b controls the optimization speed of the NMF variables t_{ik,n} and v_{kj,n}, and b=1 provides the fastest convergence in IS-NMF.
As mentioned in [39], the update rule of t-NMF (62) corresponds to that of IS-NMF (58) by assuming the observed signal to be (63), which is the “harmonic mean” of |y_{ij,n}|^{2} and \(\sigma _{ij,n}^{2}\) with a ratio of ν to two. The same reformulation can be found for the variable v_{kj,n}.
These facts mean that both NMF algorithms approximate the virtual observation z_{ij,n} by the low-rank model σ_{ij,n} in the ISNMF sense. Since z_{ij,n} consists of the geometric or harmonic mean of the real observation |y_{ij,n}| and the current low-rank model σ_{ij,n}, low-rankness of the estimated (updated) model σ_{ij,n} tends to be more emphasized compared with the ISNMF decomposition using only the observation |y_{ij,n}|. In other words, the geometric or harmonic mean in z_{ij,n} prevents σ_{ij,n} from an overfitting to |y_{ij,n}| by ignoring sparse outliers in |y_{ij,n}|, which enhances the low-rank decomposition. In (61) or (63), the shape parameter β or ν controls the intensity of such low-rank enhancement in NMF decomposition. However, intriguingly, the domain parameter p also affects the estimation of the low-rank model σ_{ij,n}. In GGD-NMF (61), by setting p<β, the geometric mean corresponds to the point externally dividing |y_{ij,n}| and σ_{ij,n}, which mitigates the intensity of the low-rank enhancement mentioned above. Also, in t-NMF, p<2 causes the same behavior because the term \(\sigma _{ij,n}^{p-2}\) exists in (63), where the inverse of \(\sigma _{ij,n}^{2-p}\) (2−p>0) mitigates the low-rankness.
Summary of parameterized properties in GGD- and t-ILRMA
Shape parameter | Domain parameter | |
---|---|---|
(β, ν) | (p) | |
GGD-ILRMA | β→0 | p→0 |
∙ Low-rankness injection | ∙ Low-rankness mitigation | |
via geometric mean | ||
∙ Faster NMF update | ∙ Slower NMF update | |
Special cases | ||
∙ β=2: Gaussian dist. | ||
∙ β=1: Laplace dist. | ||
t-ILRMA | ν→1 | p→0 |
∙ Low-rankness injection | ∙ Low-rankness mitigation | |
via harmonic mean | ∙ Slower NMF update | |
Special cases | ||
∙ ν→∞: Gaussian dist. | ||
∙ ν=1: Cauchy dist. |
In addition, it is worth mentioning that the exponent value of the NMF update rules, b, is also important for ILRMA. It has been experimentally revealed that a smaller value of b is preferable for achieving better separation performance, although the optimized speed of r_{ij,n} becomes slow. This may be to avoid trapping at a poor local minimum in the early and middle stages of the iteration in ILRMA because the optimization balance between W_{ i } and r_{ij,n} is significant for converging toward a better solution. In GGD- or t-ILRMA, the exponent value in (60) or (62) is defined as p/(β+p) or p/(p+2), respectively. These values become small when p is small and β is large, which may result in a better separation result.
4 Results and discussion
To evaluate our proposed algorithms, we conducted some BSS experiments using music and speech mixtures. We first compared various conventional methods using observed signals in the case of two sources and two microphones. Then, we compared the conventional and proposed ILRMA in a more difficult situation with three sources and three microphones.
4.1 Dataset
Musical instruments used in the music dataset
Part | Instruments |
---|---|
Melody 1 | Oboe, trumpet, and horn |
Melody 2 | Flute, violin, and clarinet |
Midrange | Piano and harpsichord |
Bass | Trombone, bassoon, and cello |
4.2 BSS experiment with two sources
4.2.1 Conditions
Dry sources used in two-source case
Signal | Data name | Sources (1/2) | Signal length [s] |
---|---|---|---|
Music 1 | Melody 2/Midrange | Flute/piano | 5.0 |
Music 2 | Melody 1/Melody 2 | Oboe/flute | 5.0 |
Music 3 | Melody 1/Bass | Trumpet/bassoon | 5.0 |
Music 4 | Melody 2/Midrange | Violin/harpsichord | 5.0 |
Music 5 | Melody 1/Melody 2 | Horn/blarinet | 5.0 |
Music 6 | Midrange/Bass | Piano/cello | 5.0 |
Speech 1 | dev1_female4 | src_1/src_2 | 10.0 |
Speech 2 | dev1_female4 | src_3/src_4 | 10.0 |
Speech 3 | dev1_male4 | src_1/src_2 | 10.0 |
Speech 4 | dev1_male4 | src_3/src_4 | 10.0 |
Experimental conditions
Sampling frequency | 16 kHz |
Window function in STFT | Hamming window |
Window length in STFT | 4096 points (256 ms) |
Shift length in STFT | 2048 points (128 ms) |
Number of NMF bases K | Four for music case and two for |
speech case | |
Number of iterations of update rules | 200 |
Initial values of T_{ n } and V_{ n } | Uniform random values in the |
range (0,1) | |
Initial values of W_{ i } | Identity matrix |
4.2.2 Results using fixed parameters
Overall average SDR improvements (dB) in two-source case for the best parameter settings
Source and impulse response | Laplace IVA | GGD-IVA | MNMF | t-MNMF | ILRMA | GGD-ILRMA | t-ILRMA |
---|---|---|---|---|---|---|---|
Music and E2A_1 | 2.41 | 3.11 (β = 0.5) | 2.42 | 3.30 (ν = 1) | 6.24 | 7.52 (β = 1.94, p = 0.5) | 7.61 (ν = 1000, p = 0.5) |
Speech and E2A_1 | 3.94 | 4.89 (β = 0.5) | - 2.04 | 0.94 (ν = 15) | 7.73 | 8.70 (β = 1.94, p = 0.5) | 8.73 (ν = 1000, p = 1.0) |
Music and E2A_2 | 1.97 | 2.19 (β = 0.5) | - 2.25 | - 0.03 (ν = 1) | 4.97 | 6.30 (β = 1.98, p = 0.5) | 6.39 (ν = 1000, p = 0.5) |
Speech and E2A_2 | 3.76 | 4.63 (β = 0.5) | - 3.41 | 0.79 (ν = 15) | 5.76 | 6.36 (β = 1.94, p = 0.5) | 6.17 (ν = 1000, p = 1.0) |
4.2.3 Results using parameter tempering
Overall average SDR improvements (dB) in two-source case employing parameter tempering for the best parameter settings
Source and impulse response | ILRMA | GGD-ILRMA | t-ILRMA |
---|---|---|---|
Music and E2A_1 | 6.24 | 7.66 (β = 1.99, p = 0.5) | 7.47 (ν = 1000, p = 0.5) |
Speech and E2A_1 | 7.73 | 9.09 (β = 1.94, p = 0.5) | 8.61 (ν = 3, p = 1.0) |
Music and E2A_2 | 5.22 | 6.87 (β = 1.94, p = 0.5) | 6.81 (ν = 1000, p = 0.5) |
Speech and E2A_2 | 6.09 | 6.45 (β = 1.98, p = 0.5) | 6.05 (ν = 30, p = 2.0) |
4.2.4 Performance for various signal lengths
In BSS framework, the length of observed signal is important to achieve the better separation performance. This is because the accuracy of statistical estimation decreases when the number of time frames J is insufficient [60, 61]. In the extreme case, the demixing matrix W_{ i } cannot be updated by IP when J=1 because the rank of U_{i,n} in (16), G_{i,n} in (34), or H_{i,n} in (49) becomes unity. However, it is not clarified whether the heavy-tailed source distribution provides more robust statistical estimation for fewer time frames or not. Thus, in this subsection, we experimentally compare the separation performance of ILRMA, GGD-ILRMA, and t-ILRMA for the observed signals with fewer and more time frames.
To simulate the short and long source signals, we utilized the dry sources described in Table 3. As the short dry sources, the music and speech signals were trimmed only to the former half, and their signal lengths were 2.5 s (music) and 5.0 s (speech), respectively. In contrast, the long music and speech signals were produced by repeating the entire length of the dry sources twice, namely, the signal lengths of the long music and speech dry sources become 10.0 s (music) and 20.0 s (speech), respectively. These dry sources were convoluted with E2A_1 to produce the observed mixture signal with two sources, where the combinations of dry sources were the same as those described in Table 3. The other experimental conditions were the same as those in Section 4.2.2.
Overall average SDR improvements (dB) in two-source case with various signal lengths for the best parameter settings
Source and signal length | ILRMA | GGD-ILRMA | t-ILRMA |
---|---|---|---|
Music (2.5 s, short) | 3.38 | 3.43 (β = 1.99, p = 2.0) | 3.51 (ν = 2, p = 1.0) |
Music (5.0 s, original) | 6.24 | 7.52 (β = 1.94, p = 0.5) | 7.61 (ν = 1000, p = 0.5) |
Music (10.0 s, long) | 7.29 | 8.83 (β = 2.00, p = 1.0) | 8.92 (ν = 1000, p = 0.5) |
Speech (5.0 s, short) | 7.26 | 7.69 (β = 1.98, p = 0.5) | 8.33 (ν = 1000, p = 1.0) |
Speech (10.0 s, original) | 7.73 | 8.70 (β = 1.94, p = 0.5) | 8.73 (ν = 1000, p = 1.0) |
Speech (20.0 s, long) | 8.05 | 8.41 (β = 1.94, p = 1.0) | 8.29 (ν = 300, p = 1.0) |
4.3 BSS experiment with three sources
Dry sources used in three-source case
Signal | Data name | Sources (1/2/3) | Signal lengths [s] |
---|---|---|---|
Music 1 | Melody 2/Midrange/Bass | Clarinet/Piano/Cello | 5.0 |
Music 2 | Melody 1/Melody 2/Bass | Horn/Clarinet/Bassoon | 5.0 |
Music 3 | Melody 1/Midrange/Bass | Trumpet/Piano/Bassoon | 5.0 |
Music 4 | Melody 2/Midrange/Bass | Violin/Harpsichord/Bassoon | 5.0 |
Speech 1 | dev1_female4 | src_1/src_2/src_3 | 10.0 |
Speech 2 | dev1_female4 | src_2/src_3/src_4 | 10.0 |
Speech 3 | dev1_male4 | src_1/src_2/src_3 | 10.0 |
Speech 4 | dev1_male4 | src_2/src_3/src_4 | 10.0 |
Overall average SDR improvements (dB) in three-source case for the best parameter settings
Source | ILRMA | GGD-ILRMA | t-ILRMA | GGD-ILRMA w/ tempering | t-ILRMA w/ tempering |
---|---|---|---|---|---|
Music | 1.76 | 3.24 (β = 1.94, p = 0.5) | 3.19 (ν = 300, p = 0.5) | 3.36 (β = 1.82, p = 1.0) | 3.29 (ν = 1, p = 1.0) |
Speech | 2.79 | 3.14 (β = 1.94, p = 1.0) | 2.94 (ν = 1000, p = 1.0) | 3.32 (β = 1.40, p = 0.5) | 3.22 (ν = 10, p = 2.0) |
4.4 Comparison of computational times
Relative computational times normalized by Laplace IVA based on IP, where the length of the observed signal is 10 s
5 Conclusions
In this paper, we proposed two generalizations of the source distribution assumed in ILRMA that introduce a heavy-tailed property by using the GGD and Student’s t distribution. The GGD can be considered as a natural extension of the conventional Gaussian source model, and Student’s t distribution partially satisfies the stable property of complex-valued random variables, which is desirable for NMF-based low-rank decomposition. We derived efficient optimization algorithms for GGD- and t-ILRMA, which ensure a monotonic decrease in the objective function and provide faster computation than existing MNMF-based BSS methods. Also, we revealed an interesting relationship between GGD- and t-NMF: GGD-NMF is equivalent to IS-NMF upon assuming the geometric mean of the data and the low-rank model as an observation, whereas t-NMF corresponds to the same algorithm with the harmonic mean of the data and the low-rank model as previously mentioned. These properties lead to more accurate parameter estimation in an ILRMA-based BSS framework, resulting in higher separation accuracy than the conventional ILRMA with the Gaussian source distribution. From the experiments, it is confirmed that the proposed generalized ILRMA improves the separation accuracy, especially for the music mixture signals. However, the improvement for speech mixture signals is still limited. This is because typical speech sources do not have an apparent low-rank time-frequency structure, and NMF-based source model in ILRMA cannot capture the precise spectral structures in speech sources even if the source model is generalized by the heavy-tailed distributions. The better modeling for speech sources remains as a future work.
Footnotes
Notes
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions that helped improve the quality of this manuscript.
Funding
This work was partly supported by the ImPACT Program of Council for Science, SECOM Science and Technology Foundation, and JSPS KAKENHI Grant Numbers JP16H01735, JP17H06101, and JP17H06572.
Availability of data and materials
Not available online. Please contact author for data requests.
Authors’ contributions
All authors have contributed equally. All authors have read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.P Bofill, M Zibulevsky, Underdetermined blind source separation using sparse representations. Signal Process.81(11), 2353–2362 (2001).CrossRefMATHGoogle Scholar
- 2.S Araki, H Sawada, R Mukai, S Makino, Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors. Signal Process.87(8), 1833–1847 (2007).CrossRefMATHGoogle Scholar
- 3.L Zhen, D Peng, Z Yi, Y Xiang, P Chen, Underdetermined blind source separation using sparse coding. IEEE Trans. Neural Netw. Learn. Syst.28(12), 3102–3108 (2017).MathSciNetCrossRefGoogle Scholar
- 4.P Comon, Independent component analysis, a new concept?Signal Process.36(3), 287–314 (1994).CrossRefMATHGoogle Scholar
- 5.P Smaragdis, Blind separation of convolved mixtures in the frequency domain. Neurocomputing. 22(1), 21–34 (1998).CrossRefMATHGoogle Scholar
- 6.S Kurita, H Saruwatari, S Kajita, K Takeda, F Itakura, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Evaluation of blind signal separation method using directivity pattern under reverberant conditions, (2000), pp. 3140–3143.Google Scholar
- 7.H Sawada, R Mukai, S Araki, S Makino, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Convolutive blind source separation for more than two sources in the frequency domain, (2004), pp. 885–888.Google Scholar
- 8.H Saruwatari, T Kawamura, T Nishikawa, A Lee, K Shikano, Blind source separation based on a fast-convergence algorithm combining ICA and beamforming. IEEE Trans. Audio Speech Lang. Process.14(2), 666–678 (2006).CrossRefGoogle Scholar
- 9.N Murata, S Ikeda, A Ziehe, An approach to blind source separation based on temporal structure of speech signals. Neurocomputing. 41(1–4), 1–24 (2001).CrossRefMATHGoogle Scholar
- 10.H Sawada, R Mukai, S Araki, S Makino, A robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Trans. Speech Audio Process.12(5), 530–538 (2004).CrossRefGoogle Scholar
- 11.H Sawada, S Araki, S Makino, in Proc. IEEE Int. Symp. Circuits Syst. Measuring Dependence of Bin-wise Separated Signals for Permutation Alignment in Frequency-Domain BSS, (2007), pp. 3247–3250.Google Scholar
- 12.A Hiroe, in Proc. Int. Conf. Independent Compon. Anal. Blind Source Separation. Solution of permutation problem in frequency domain ICA using multivariate probability density functions, (2006), pp. 601–608.Google Scholar
- 13.T Kim, T Eltoft, T-W Lee, in Proc. Int. Conf. Independent Compon. Anal. Blind Source Separation. Independent vector analysis: an extension of ICA to multivariate components, (2006), pp. 165–172.Google Scholar
- 14.T Kim, HT Attias, S-Y Lee, T-W Lee, Blind source separation exploiting higher-order frequency dependencies. IEEE Trans. Audio Speech Lang. Process.15(1), 70–79 (2007).CrossRefGoogle Scholar
- 15.G Box, G Tiao, Bayesian Inference in Statistical Analysis (Addison Wesley, Reading, Mass, 1973).MATHGoogle Scholar
- 16.T Itahashi, K Matsuoka, Stability of independent vector analysis. Signal Process.92(8), 1809–1820 (2012).CrossRefGoogle Scholar
- 17.N Ono, in Proc. Asia-Pacific Signal and Info. Process. Assoc. Annual Summit and Conf. Auxiliary-function-based independent vector analysis with power of vector-norm type weighting functions, (2012).Google Scholar
- 18.M Anderson, GS Fu, R Phlypo, T Adalı, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Independent vector analysis, the Kotz distribution, and performance bounds, (2013), pp. 3243–3247.Google Scholar
- 19.Y Liang, J Harris, SM Naqvi, G Chen, JA Chambers, Independent vector analysis with a generalized multivariate Gaussian source prior for frequency domain blind source separation. Signal Process.105:, 175–184 (2014).CrossRefGoogle Scholar
- 20.Z Boukouvalas, GS Fu, T Adalı, in Proc. Annual Conf. Info. Sci. and Syst. An efficient multivariate generalized Gaussian distribution estimator: application to IVA, (2015).Google Scholar
- 21.T Ono, N Ono, S Sagayama, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. User-guided independent vector analysis with source activity tuning, (2012), pp. 2417–2420.Google Scholar
- 22.DD Lee, HS Seung, Learning the parts of objects by non-negative matrix factorization. Nature. 401(6755), 788–791 (1999).CrossRefMATHGoogle Scholar
- 23.DD Lee, HS Seung, in Proc. Neural Info. Process. Syst. Algorithms for non-negative matrix factorization, (2000), pp. 556–562.Google Scholar
- 24.T Virtanen, Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio, Speech, Lang. Process.15(3), 1066–1074 (2007).CrossRefGoogle Scholar
- 25.P Smaragdis, B Raj, M Shashanka, in Proc. Int. Conf. Independent Compon. Anal. Signal Separation. Supervised and semi-supervised separation of sounds from single-channel mixtures, (2007), pp. 414–421.Google Scholar
- 26.A Ozerov, C Févotte, M Charbit, in Proc. IEEE Workshop Applicat. Signal Process. Audio Acoust. Factorial scaled hidden Markov model for polyphonic audio representation and source separation, (2009), pp. 121–124.Google Scholar
- 27.D Kitamura, H Saruwatari, K Yagi, K Shikano, Y Takahashi, K Kondo, Music signal separation based on supervised nonnegative matrix factorization with orthogonality and maximum-divergence penalties. IEICE Trans. Fundam. Electron. Commun. Comput. Sci.E97-A(5), 1113–1118 (2014).CrossRefGoogle Scholar
- 28.D Kitamura, H Saruwatari, H Kameoka, Y Takahashi, K Kondo, S Nakamura, Multichannel signal separation combining directional clustering and nonnegative matrix factorization with spectrogram restoration. IEEE/ACM Trans. Audio, Speech, Lang. Process.23(4), 654–669 (2015).CrossRefGoogle Scholar
- 29.C Févotte, N Bertin, J-L Durrieu, Nonnegative matrix factorization with the Itakura–Saito divergence. With application to music analysis. Neural Comput.21(3), 793–830 (2009).CrossRefMATHGoogle Scholar
- 30.A Ozerov, C Févotte, Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio, Speech, Lang. Process.18(3), 550–563 (2010).CrossRefGoogle Scholar
- 31.H Kameoka, T Yoshioka, M Hamamura, JL Roux, K Kashino, in Proc. Int. Conf. Latent Variable Anal. Signal Separation. Statistical model of speech signals based on composite autoregressive system with application to blind source separation, (2010), pp. 245–253.Google Scholar
- 32.H Sawada, H Kameoka, S Araki, N Ueda, Multichannel extensions of non-negative matrix factorization with complex-valued data. IEEE Trans. Audio, Speech, Lang. Process.21(5), 971–982 (2013).CrossRefGoogle Scholar
- 33.D Kitamura, N Ono, H Sawada, H Kameoka, H Saruwatari, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Efficient multichannel nonnegative matrix factorization exploiting rank-1 spatial model, (2015), pp. 276–280.Google Scholar
- 34.D Kitamura, N Ono, H Sawada, H Kameoka, H Saruwatari, Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization. IEEE/ACM Trans. Audio, Speech, Lang. Process.24(9), 1626–1641 (2016).CrossRefGoogle Scholar
- 35.D Kitamura, N Ono, H Sawada, H Kameoka, H Saruwatari, in Audio Source Separation, ed. by S Makino. Determined blind source separation with independent low-rank matrix analysis (SpringerCham, 2018), pp. 125–155. https://link.springer.com/chapter/10.1007%2F978-3-319-73031-8_6#citeas.
- 36.C Févotte, SJ Godsill, A Bayesian approach for blind separation of sparse sources. IEEE Trans. Audio, Speech, Lang. Process.14(6), 2174–2188 (2006).CrossRefMATHGoogle Scholar
- 37.S Leglaive, R Badeau, G Richard, in Proc. Eur. Signal Process. Conf. Semi-blind Student’s t source separation for multichannel audio convolutive mixtures, (2017).Google Scholar
- 38.A Liutkus, D FitzGerald, R Badeau, in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. Cauchy nonnegative matrix factorization, (2015).Google Scholar
- 39.K Yoshii, K Itoyama, M Goto, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Student’s t nonnegative matrix factorization and positive semidefinite tensor factorization for single-channel audio source separation, (2016), pp. 51–55.Google Scholar
- 40.K Kitamura, Y Bando, K Itoyama, K Yoshii, in Proc. Int. Workshop Acoust. Signal Enh. Student’s t multichannel nonnegative matrix factorization for blind source separation, (2016).Google Scholar
- 41.G Samorodnitsky, MS Taqqu, Stable Non-Gaussian Random Processes: Stochastic Models with Infinite Variance (Chapman & Hall/CRC Press, Florida, 1994).MATHGoogle Scholar
- 42.A Liutkus, R Badeau, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Generalized Wiener filtering with fractional power spectrograms, (2015), pp. 266–270.Google Scholar
- 43.S Leglaive, U Simsekli, A Liutkus, R Badeau, G Richard, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Alpha-stable multichannel audio source separation, (2017), pp. 576–580.Google Scholar
- 44.S Mogami, D Kitamura, Y Mitsui, N Takamune, H Saruwatari, N Ono, in Proc. IEEE Int. Workshop Mach. Learn. Signal Process. Independent low-rank matrix analysis based on complex Student’s t-distribution for blind audio source separation, (2017).Google Scholar
- 45.NQK Duong, E Vincent, R Gribonval, Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process.18(7), 1830–1840 (2010).CrossRefGoogle Scholar
- 46.D Kitamura, N Ono, H Sawada, H Kameoka, H Saruwatari, in Proc. Eur. Signal Process. Conf. Relaxation of rank-1 spatial constraint in overdetermined blind source separation, (2015), pp. 1271–1275.Google Scholar
- 47.DR Hunter, K Lange, Quantile regression via an MM algorithm. J. Comput. Graph. Stat.9(1), 60–77 (2000).MathSciNetGoogle Scholar
- 48.N Ono, S Miyabe, in Proc. Int. Conf. Latent Variable Anal. Signal Separation. Auxiliary-function-based independent component analysis for super-Gaussian sources, (2010), pp. 165–172.Google Scholar
- 49.N Ono, in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. Stable and fast update rules for independent vector analysis based on auxiliary function technique, (2011), pp. 189–192.Google Scholar
- 50.M Nakano, H Kameoka, JL Roux, Y Kitano, N Ono, S Sagayama, in Proc. IEEE Int. Workshop Mach. Learn. Signal Process. Convergence-guaranteed multiplicative algorithms for nonnegative matrix factorization with beta-divergence, (2010), pp. 283–288.Google Scholar
- 51.N Murata, S Ikeda, A Ziehe, An approach to blind source separation based on temporal structure of speech signals. Neurocomputing. 41(1–4), 1–24 (2001).CrossRefMATHGoogle Scholar
- 52.D Kitamura, Algorithms for Independent Low-rank Matrix Analysis.http://d-kitamura.net/pdf/misc/AlgorithmsForIndependentLowRankMatrixAnalysis.pdf. Accessed 27 Apr 2018.
- 53.C Févotte, J Idier, Algorithms for nonnegative matrix factorization with the β-divergence. Neural Comput.23(9), 2421–2456 (2011).MathSciNetCrossRefMATHGoogle Scholar
- 54.Y Mitsui, D Kitamura, N Takamune, H Saruwatari, Y Takahashi, K Kondo, in Proc. IEEE Int. Workshop Comput. Adv. Multi-Sensor Adaptive Process. Independent low-rank matrix analysis based on parametric majorization-equalization algorithm, (2017), pp. 98–102.Google Scholar
- 55.D Kitamura, Open Dataset: songKitamura. http://d-kitamura.net/en/dataset_en.htm. Accessed 27 Apr 2018.
- 56.S Araki, F Nesta, E Vincent, Koldovsky, Ź, G Nolte, A Ziehe, A Benichoux, in Proc. Int. Conf. Latent Variable Anal. Signal Separation. The 2011 signal separation evaluation campaign (SiSEC2011):-audio source separation, (2012), pp. 414–422.Google Scholar
- 57.Third Community-based Signal Separation Evaluation Campaign (SiSEC 2011). http://sisec2011.wiki.irisa.fr. Accessed 27 Apr 2018.
- 58.S Nakamura, K Hiyane, F Asano, T Nishiura, T Yamada, in Proc. Int. Conf. Lang. Res. Eval. Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition, (2000), pp. 965–968.Google Scholar
- 59.E Vincent, R Gribonval, C Févotte, Performance measurement in blind audio source separation. IEEE Trans. Audio, Speech, Lang. Process.14(4), 1462–1469 (2006).CrossRefGoogle Scholar
- 60.S Araki, R Mukai, S Makino, T Nishikawa, H Saruwatari, The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech. IEEE Trans. Speech and Audio Process.11(2), 109–116 (2003).CrossRefMATHGoogle Scholar
- 61.D Kitamura, N Ono, H Saruwatari, in Proc. Eur. Signal Process. Conf. Experimental analysis of optimal window length for independent low-rank matrix analysis, (2017), pp. 1210–1214.Google Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.