Abstract
Optimal transport as a loss for machine learning optimization problems has recently gained a lot of attention. Building upon recent advances in computational optimal transport, we develop an optimal transport nonnegative matrix factorization (NMF) algorithm for supervised speech blind source separation (BSS). Optimal transport allows us to design and leverage a cost between shorttime Fourier transform (STFT) spectrogram frequencies, which takes into account how humans perceive sound. We give empirical evidence that using our proposed optimal transport, NMF leads to perceptually better results than NMF with other losses, for both isolated voice reconstruction and speech denoising using BSS. Finally, we demonstrate how to use optimal transport for crossdomain sound processing tasks, where frequencies represented in the input spectrograms may be different from one spectrogram to another.
Similar content being viewed by others
1 Introduction
Source separation is the task of separating a mixed signal into different components, usually referred to as sources. In the context of sound processing, it can be used to separate speakers whose voices have been recorded simultaneously. Blind source separation (BSS) aims at doing so with only sound data, that is without information such as the time when each source is active or the location of the sources with respect to the recording devices. A common way to address this task is to decompose the signal spectrogram by nonnegative matrix factorization ([15], NMF), as proposed for example by [25] as well as [29]. Denoting \(\tilde {x}_{j, i}\), the (complex) shorttime Fourier transform (STFT) coefficient of the input signal at frequency bin j and time frame i, and X its magnitude spectrogram defined as \(x_{j, i} = \tilde {x}_{j, i}\), the BSS problem can be tackled by solving the NMF problem
where N is the number of sources, t is the number of time windows, x_{i} is the ith column of X, and ℓ is a loss function. Each dictionary matrix D^{(k)} and weight matrix W^{(k)} are related to a single source. In a supervised setting, each source has training data and all the D^{(k)}s are learned in advance during a training phase. At test time, given a new signal, separated spectrograms are recovered from the D^{(k)}s and W^{(k)}s and corresponding signals can be reconstructed with suitable postprocessing. Several loss functions ℓ have been considered in the literature, such as the squared Euclidean distance [15, 25], the KullbackLeibler divergence [15, 28], or the ItakuraSaito divergence [9, 24].
In the present article, we propose to use optimal transport as a loss between spectrograms to perform supervised speech BSS with NMF. Optimal transport is defined as the minimum cost of moving the mass from one histogram to another. By taking into account a transportation cost between frequencies, this provides a powerful metric to compare STFT spectrograms. One of the main advantages of using optimal transport as a loss is that it can quantify the amplitude of a frequency shift noise, coming for example from quantization or the tuning of a musical instrument. Other metrics such as the Euclidean distance or KullbackLeibler divergence, which compare spectrogram elementwise, are almost blind to this type of noise (see Fig. 1). Another advantage over elementwise metrics is that optimal transport enables the use of different quantizations, i.e., frequency supports, at training and test times. Indeed, the frequencies represented on a spectrogram depend on the sampling rate of the signal and the time windows used for its computation, both of which can change between training and test times. With optimal transport, we do not need to requantize the training and testing data so as they share the same frequency support: optimal transport is well defined between spectrograms with distinct supports as long as we can define a transportation cost between frequencies. Finally, the optimal transport framework enables us to generalize the Wiener filter, a common postprocessing for source separation, by using optimal transport plans, so that it can be applied to data quantized on different frequencies.
NMF with an optimal transport loss was first proposed by [23]. They solved this problem by using a biconvex formulation and relied on an approximation of optimal transport based on wavelets [27]. Recently, [22] proposed fast algorithms to compute NMF with an entropyregularized optimal transport loss, which are more flexible in the sense that they do not require any assumption on the frequency quantization or on the cost function used. However, their approach requires all columns x_{i} of the input matrix to be normalized so that they sum to 1. Normalizing all time frames is not desirable in sound processing tasks because time frames with low energy usually correspond to noise, and it would amplify their contribution to the objective.
Similarly, optimal transport was proposed as a loss for performing principal component analysis (PCA) [3, 5], a task which is closely related to dictionary learning and NMF. However, rather than learning a dictionary on data histograms directly, they proposed to learn a dictionary on optimal mappings between a reference histogram and these histograms. Their approach was motivated by the Riemannian geometry of the space of histograms equipped with the optimal transport distance w.r.t., the square Euclidean cost. Since their framework is limited to the square Euclidean cost, their approach is unsuitable for spectrogram data where a specifically designed cost should be considered as we advocate in this article.
Using optimal transport as a loss between spectrograms was also proposed by [10] under the name “optimal spectral transportation.” They developed a novel method for unsupervised music transcription which achieves stateoftheart performance. Their method relies on a cost matrix designed specifically for musical instruments, allowing them to use Diracs as dictionary columns. That is, they fix each dictionary column to a vector with a single nonzero entry and learn only the corresponding coefficients. This trivial structure of the dictionary results in efficient coefficient computation. However, this approach cannot be applied as is to speech separation since it relies on the assumption that a musical note can be represented as its fundamental. It also requires designing the cost of moving the fundamental to its harmonics and neighboring frequencies. Because human voices are intrinsically more complex, it is therefore necessary to learn both the dictionary and the coefficients, i.e., solve full NMF problems.
1.1 Our contributions
In this paper, we extend the optimal transport NMF of [22] to the case where the columns of the input matrix X are not normalized in order to propose an algorithm which is suitable for spectrogram data. We define a cost between frequencies so that the optimal transport objective between spectrograms provides a relevant metric between them. We apply our NMF framework to isolated voice reconstruction and show that an optimal transport loss yields better results than other classical losses. We show that optimal transport yields comparable results to other losses for BSS, where the sources to separate are voices. Moreover, we show that optimal transport achieves better results than other losses for learning a “universal” voice model, i.e., a model that can be applied to any voice, regardless of the speaker. We use this universal voice model to perform speech denoising, which is BSS where one of the source is a voice and the other is noise. Finally, we show how to use our framework for crossdomain BSS, where frequencies represented in the test spectrograms may be different from the ones in the dictionary. This may happen for example when train and test data are recorded with different equipment, or when the STFT is computed with different parameters.
1.2 Notations
We denote matrices in uppercase, vectors in bold lowercase, and scalars in lowercase. If M is a matrix, M^{⊤} is its transpose, m_{i} is its ith column, m_{j} its jth row and Im M its image. 1_{n} denotes the allones vector in \(\mathbb {R}^{n}\); when the dimension can be deduced from context, we simply write 1. For two matrices A and B of the same size, we denote their inner product 〈A,B〉:=tr(A^{⊤}B). We denote Σ_{n} the (n−1)dimensional simplex: \(\Sigma _{n}:= \left \{ x \in \mathbb {R}_{+}^{n} \colon \x\_{1}=1 \right \}\).
2 Background
We start by introducing optimal transport, its entropy regularization, which we will use as the loss ℓ, and previous works on optimal transport NMF. For a more comprehensive overview of optimal transport from a computational perspective, see [20].
2.1 Optimal transport
Exact optimal transport. Let a∈Σ_{m}, b∈Σ_{n}. The polytope of transportation matrices between a and b is defined as
Given a cost matrix \(C\in \mathbb {R}^{m \times n}\), the minimum transportation cost between a and b is defined as
When n=m and the cost matrix is the pth power (\(p\geqslant 1\)) of a distance matrix, i.e., c_{i,j}=d(y_{i},y_{j})^{p} for some (y_{i}) in a metric space (Ω, d), then OT(·,·)^{1/p} is a distance on the set of vectors in \(\mathbb {R}_{+}^{n}\) with the same ℓ 1 norm ([31], Theorem 7.3). We can see the vectors y_{i} as features and a and b as the quantization weights of the data onto these features. In sound processing applications, the vectors y_{i} are real numbers corresponding to the frequencies of the spectrogram and a and b are their corresponding magnitude. By computing the minimal transportation cost between frequencies of two spectrograms, optimal transport exhibits variations in accordance with the frequency noise involved in the signal generative process, which results for instance from the tuning of musical instruments or the subject’s condition in speech processing.
Unnormalized optimal transport. In this work, we wish to define optimal transport when a and b are nonnegative but not necessarily normalized. Note that the transportation polytope is not empty as long as a and b sum to the same value: U(a,b)=∅iif ∥a∥_{1}≠∥b∥_{1}. Hence, we define optimal transport between possibly unnormalized vectors a and b as,
Computing the optimal transport cost (1) amounts to solve a linear program (LP) which can be done with specialized versions of the simplex algorithm with worstcase complexity in \(\mathcal {O}\left (n^{3}\log n\right)\) when n=m [19]. When considering OT as a loss between histograms supported on more than a few hundred bins, such computation becomes quickly intractable. Moreover, using OT as a loss involves differentiating OT, which is not differentiable everywhere. Hence, one would have to resort to subgradient methods. This would be prohibitively slow since each iteration would require to obtain a subgradient at the current iterate, which requires to solve the LP (1).
Entropyregularized optimal transport. To remedy these limitations, [7] proposed to add an entropy regularization term to the optimal transport objective, thus making the OT loss differentiable everywhere and strictly convex. This entropyregularized optimal transport has since been used in numerous works as a loss for diverse tasks ([11, 12, 22], see for example).
Let γ>0, we define the (unnormalized) entropyregularized OT between \({a}\in \mathbb {R}_{+}^{m},\, {b}\in \mathbb {R}_{+}^{n}\) as
where \(E(T) := \sum _{ij}T_{ij}\log {T_{ij}}\) is the entropy of the transport plan T. Let us denote \(\text {OT}_{\gamma }^{\star }\) the convex conjugate of OT_{γ} with respect to its second variable
Cuturi and Peyré [8] showed that its value and gradient can be computed in closed form:
where K:=e^{−C/γ} and α:=e^{y/γ}.
2.2 Optimal transport NMF
NMF can be cast as an optimization problem of the form
where both D and W are optimized at train time, and D is fixed at test time. When ℓ is OT, problem (2) is convex in W and D separately, but not jointly. It can be solved by alternating full optimization with respect to W and D. Each resulting subproblem is a very high dimensional linear program with many constraints [23], which is intractable with standard LP solvers even for short sound signals. In addition, convergence proofs of such alternate minimization methods for NMF typically assume strictly convex subproblems (see e.g., [2, 30] Prop. 2.7.1), which is not the case when using nonregularized OT as a loss.
To address this issue, [22] proposed to use OT_{γ} instead and showed how to solve each subproblem in the dual using fast gradient computations. Formally, they tackle problems of the form:
where R_{1} and R_{2} are convex regularizers that enforce nonnegativity constraints, and Σ_{n} is the (n−1)dimensional simplex.
It was shown that each subproblem of (3) with either D or W fixed has a smooth FenchelRockafellar dual, which can be solved efficiently, leading to a fast overall algorithm. However, their definition of optimal transport requires inputs and reconstructions to have a ℓ 1 norm equal to 1. This is achieved by normalizing the input beforehand, restricting the columns of D and W to the simplex, and using as regularizers negative entropies defined on the simplex:
where
They showed that the coefficients and dictionary can be updated according to the following duality results.
Coefficient update. For D fixed, the optimizer of \(\underset {W\in \Sigma _{k}^{t}}{\min } \sum \limits _{i=1}^{t} \text {OT}_{\gamma }({x}_{i},D{w}_{i}) + R_{1}({w}_{i})\) is
with
We can solve problem (5) with accelerated gradient descent [18] and recover the optimal weight matrix with the primaldual relationship (4). The value and gradient of the convex conjugate of R with respect to its second variable are:
Dictionary update. For W fixed, the optimizer of \(\underset {D \in \Sigma _{m}^{k}}{\min }\sum \limits _{i=1}^{t} \text {OT}_{\gamma }({x}_{i}, D{w}_{i}) + \sum \limits _{i=1}^{k}R_{2}({D}_{i})\) is
with
Likewise, we can solve problem (7) with accelerated gradient descent and recover the optimal dictionary matrix with the primaldual relationship (6).
These duality results allow us to go from a constrained primal problem for which each evaluation of the objective and its gradient requires solving an optimal transport problem, to a nonconstrained dual problem whose objective and gradient can be evaluated in closed form. The primal constraints ∥x_{i}∥_{1}=∥DW_{i}∥_{1} and DW_{i}≥0∀i are enforced by the primaldual relationship. Moreover, the use of an entropy regularization, with γ>0, makes OT_{γ} smooth with respect to its second variable.
3 Method
We now present our approach for optimal transport BSS. First, we introduce the changes to [22] that are necessary for computing optimal transport NMF on STFT spectrograms of sound data. We then define a transportation cost between frequencies. Finally, we show how to reconstruct sound signals from the separated spectrograms.
3.1 Signal separation with NMF
We use a supervised BSS setting similar to the one described in [25]. For each source k, we have access to training data X^{(k)}, on which we learn a dictionary D^{(k)} with NMF
Then, given the STFT spectrum of a mixture of sources X, we reconstruct separated spectrograms X^{(k)}=D^{(k)}W^{(k)} for k=1,…N where W^{(k)}s are the solutions of
The separated spectrograms \(\hat {X}^{(k)}\) are then reconstructed from each X^{(k)} with the process described in Section 3.4.
In practice at test time, the dictionaries are concatenated in a single matrix \(D=\left (D^{(k)}\right)_{k=1}^{N}\), and a single matrix of coefficients W is learned, which we decompose as \(W=\left (W^{(k)}\right)_{k=1}^{N}\). This allows us to focus on problems of the form
Voicevoice separation. We use the method described to separate the voices of two speakers on the same soundtrack. In this case, we have access to training data on each speaker.
Denoising with universal models. We can also use BSS to denoise speech data. In this case, we do not have access to training data for speakers in the test set. We only have access to data of other speakers, which we use to learn a “universal” voice model, as in [29]. We also have two sources, the first one being a speaker and the second one a noise source. Here, we are only interested in the reconstruction of the voice, that is \(\hat {X}^{(1)}\).
3.2 Nonnormalized optimal transport NMF
Normalizing the columns of the input X, as in [22], is not a good option in the context of signal processing since frames with low amplitudes are typically noise and it would amplify them. Although this is not a problem for learning the coefficient matrix W, which is a columnindependant process, it would increase the contribution of noise when learning the dictionary matrix D.
With our definition of optimal transport however, inputs are not required to be in the simplex, but only to have the same ℓ 1 norm. With this definition, the convex conjugate OT^{⋆} of OT and its gradient still have the same value as in [8], and we can simply relax the constraint on W to be W≥0 in problem (3). We keep a simplex constraint on the columns of the dictionary D so that each update is guaranteed to stay in a compact set. We use R_{1}=−ρ_{1}E, a negative entropy defined on the nonnegative orthant as the coefficient matrix regularizer, and for R_{2}, we keep the nonnegative entropy defined on the simplex. The problem then becomes
This change of constraints yields the same dictionary update as in Section 2.2, Eq. 6. However, the coefficient updates need to be modified as follows.
Theorem 1
(coefficient update) For D fixed, the optimizer of
is \(W^{*} =\left (e^{D^{\top } {g}_{i}^{*}/\rho _{1}  1}\right)_{i=1}^{m}\), with
Proof
The terms in the sum are independent on the columns of X and W. Let us thus solve it separately for each column. Let 0≤i≤t, the problem is
Its Fenchel dual is
OT_{γ}(x_{i},·) and R_{1} are proper convex and continuous. Moreover, \(\text {dom}\: \text {OT}_{\gamma }^{\star }({x}_{i},\cdot) = \text {dom} R_{1}^{\star } = \mathbb {R}^{k}\) so \(D^{\top }\text {dom} \:\text {OT}_{\gamma }^{\star }({x}_{i},\cdot) = \text {Im} D^{\top }\) and
These conditions are sufficient for strong duality to hold, with the primaldual relation \({w}^{*} \in \nabla R_{1}^{\star }\left (D^{\top } {g}\right)\) ([21], Example 11.41). □
The concave conjugate of R_{1} and its gradient can be evaluated with:
3.3 Cost matrix design
In order to compute optimal transport on spectrogams and perform NMF, we need a cost matrix C, which represents the cost of moving weight from frequencies in the original spectrogram to frequencies in the reconstructed spectrogram. Schmidt and Olsson [25] use the mel scale to quantize spectrograms, relying on the fact that the perceptual difference between frequencies is smaller for the high frequency than for the low frequency domain. Following the same intuition, we propose to map frequencies to a logdomain and apply a cost function in that domain. Let f_{j} be the frequency of the jth bin in an input data spectrogram, where 1≤j≤m. Let \(\hat {f}_{\hat {j}}\) be the frequency of the \(\hat {j}\)th bin in a reconstruction spectrogram, where \(1 \le \hat {j} \le n\). We define the cost matrix \(C \in \mathbb {R}^{m \times n}\) as
with parameters λ≥0 and p>0. Since the mel scale is a log scale, it is included in this definition for some parameter λ. Some illustrations of our cost matrix for different values of λ are shown in Fig. 2, with p=0.5. It shows that with our definition, moving weights locally is less costly for high frequencies than low ones and that this effect can be tuned by selecting λ.
Figure 3 shows the effect of p on the learned dictionaries. Using p=0.5 yields a cost that is more spiked, leading to dictionary elements that can have several spikes in the same frequency bands, whereas p≥1 tends to produce smoother dictionary elements.
Note that with this definition and p≥1, C is a distance matrix to the power p when the source and target frequencies are the same. If p=0.5, C is the pointwise squareroot of a distance matrix and as such is a distance matrix itself, OT(.,.)^{1/p}.
Parameters p=0.5 and λ=100 yielded better results for blind source separation on the validation set and were accordingly used in all our experiments.
3.4 Postprocessing
Wiener filter. In the case where the reconstruction is in the same frequency domain as the original signal, the classical way to recover each voice in the time domain is to apply a Wiener filter. Let X be the original Fourier spectrum, X^{(1)} and X^{(2)} the separated spectra such that X≈X^{(1)}+X^{(2)}. The Wiener filter builds \(\hat {X}^{(1)} = X\odot \frac {X^{(1)}}{X^{(1)} + X^{(2)}}\) and \(\hat {X}^{(2)} = X\odot \frac {X^{(2)}}{X^{(1)} + X^{(2)}}\), before applying the original spectra’s phase and performing the inverse STFT.
Generalized filter. We propose to extend this filtering to the case where X^{(1)} and X^{(2)} are not in the same domain as X. This may happen for example if the test data is recorded using a different sample frequency, or if the STFT is performed with a different timewindow than the train data. In such a case, D^{(1)} and D^{(2)} are in the domain of the train data and are X^{(1)} and X^{(2)}, but X is in a different domain, and its coefficients correspond to different sound frequencies. As such, we cannot use Wiener filtering.
Instead, we propose to use the optimal transportation matrices to produce separated signals \(\hat {X}^{(1)}\) and \(\hat {X}^{(2)}\) in the same domain as X. Let \(T_{(i)} \in \underset {\Pi \in U\left ({x}_{i}, {x}_{i}^{(1)}+ {x}_{i}^{(2)}\right)}{\arg \min } \langle {C}, {\Pi }\rangle \). With Wiener filtering, x_{i} is decomposed into its components generated by \({x}_{i}^{(1)}\) and \({x}_{i}^{(2)}\). We use the same idea and separate the transport matrix T_{(i)} into:
\(T^{(1)}_{(i)} \left (resp. T^{(1)}_{(i)}\right)\) is a transport matrix between \(\frac {{x}_{i}^{(1)}}{{x}_{i}^{(1)}+ {x}_{i}^{(2)}} \left (\text {resp. } \frac {{x}_{i}^{(2)}}{{x}_{i}^{(1)}+ {x}_{i}^{(2)}}\right)\) and \(\hat {{x}_{i}}^{(1)} \left (resp. \hat {{x}_{i}}^{(2)}\right)\), where
Similar to the classical Wiener filter, we have
Because of this property, the couple \(\left (\hat {{x}_{i}}^{(1)},\hat {{x}_{i}}^{(2)}\right)\) is a fix point of the Wiener Filter.
Separated signal reconstruction. Separated sounds are reconstructed by inverse STFT after applying either the Wiener filter or the generalized filter to X^{(1)} and X^{(2)}.
4 Results
In this section, we present the main empirical findings of this paper. We start by describing the dataset that we used and the preprocessing we applied to it. We then show that the optimal transport loss allows us to have perceptually good reconstructions of single voices, even with few dictionary elements. We show that the optimal transport loss yields comparable results to other classical losses for voicevoice BSS with an NMF model. We also show that our generalized filter yields very similar results to the Wiener filter in the singledomain setting and can improve upon it in the crossdomain setting. Finally, we show that the optimal transport improves upon these other losses when using a universal voice model for voice denoising.
4.1 Dataset and preprocessing
Voice data. We evaluate our method on the English part of the MultiLingual Speech Database for Telephonometry 1994 dataset^{Footnote 1}. The data consists of recordings of the voice of four males and four females pronouncing each 24 different English sentences. We split each person’s audio file timewise into 25– 75% traintest data.
Noise data. For the speech denoising experiment, we consider 4 types of noises: cicadas, drums, subway, and sea. For each, we gathered one file for training and one file for testing from noncopyrighted sources on the internet^{Footnote 2}. We trimmed the training files so that they are approximately 20 s long and made sure that test files were longer than the voice test sounds. Note that for each noise type, the training and testing files were gathered using the same keywords, but can still have quite a bit of variability.
Preprocessing. All sound files are resampled to 16 kHz and treated as mono signal. The signals are analyzed by STFT with a Hann window, and a window size of 1024, leading to 513 frequency bins ranging from 0–8 kHz. The constant coefficient is removed from the NMF analysis and added for reconstruction in postprocessing.
Parameter selection. Hyperparameters are selected on validation data consisting if the first male and female voice, which are excluded from the evaluation set. We choose the parameters which yield the best SDR score in the voicevoice BSS experiment for these voices. We also use these voices as the training data for the universal voice model.
InitializationInitialization is performed by setting each value of the dictionary matrix as a random number picked uniformly in [0,1]. It would be possible to set each dictionary column to the optimal transport barycenter (computed for example with [1]) of all the time frames of the training data, and adding Gaussian noise (separately for each column). However, we did not notice a significant improvement with this initialization, and we only report here the scores with completely random initialization so that the results are comparable to the other methods. When training a model for any loss, we perform the NMF four times and keep the model with minimum training loss to reduce the impact of random initialization.
4.2 NMF audio quality
We first show that using an optimal transport loss for NMF leads to better perceptual reconstruction of voice data. To that end, we evaluated the PEMOQ score [13] of isolated test voices.
Personal voice model. Figure 4 shows the mean and standard deviation of the scores for k∈{5, 10, 15,20} with optimal transport (OT), KullbackLeibler (KL), ItakuraSaito (IS), or Euclidean (E) NMF. In this setting, the dictionaries are learned separately on the training data for each voice. These dictionaries are the same as in the following singledomain voicevoice separation experiment. The PEMOQ score of optimal transport NMF is higher for any value of k, although KL and IS results are still competitive. We found empirically that other scores such as SDR or SNR tend to be better for the Euclidean NMF, even though the reconstructed voices are clearly worse when listening to them (see Additional files 1 and 2). Optimal transport can reconstruct clear and intelligible voices with as few as five dictionary elements.
Universal voice model. Figure 5 shows the mean and standard deviation of the scores for k∈{5, 10, 15,20} with optimal transport, KullbackLeibler, ItakuraSaito, or Euclidean NMF, in the universal voice model setting. Here, only one dictionary is learned for all voices, with the training data of our validation voices. We kept this dictionary for the speech denoising experiment. The PEMOQ score of optimal transport NMF is significantly higher for any value of k. We believe that because optimal transport compares spectrogram by looking at the optimal flow between their frequencies, the variation of pitch between two speakers become less important that the overall patterns of human voices. Indeed, the scores with optimal transport are very similar whether we use a universal or a personal voice model, whereas they drop significantly for the other losses when using a universal model.
4.3 Voicevoice blind source separation
We evaluate our blind source separation using the classical signaltodistortion ratio (SDR) scores evaluated on reconstructed audio files using the MatLab toolbox BSS eval v2.1 [32].
Singledomain blind source separation. We first use NMF to perform BSS in the case of mixtures of two voices, where we have training data for each voice. Here, the spectrograms of the training and test data represent the same frequencies: both the training and test data are processed in exactly the same way, so that at train and test time \((f_{i})_{i} = \left (\hat {f}_{i}\right)_{i}\). We compare using the optimal transport loss for NMF to the KullbackLeibler divergence, the ItakuraSaito divergence, or the Euclidean distance. For baseline methods, we reconstruct the signal using a Wiener filter before applying inverse STFT. For optimal transportbased source separation, we evaluate separation using either the Wiener filter or our generalized filter.
Figure 6 shows mean and standard deviation of the SDR, SIR, and SAR scores for each method. We can see that although KL NMF achieves a better SDR score, the variability is actually high and the results are comparable for all method.
Crossdomain blind source separation. In this experiment, we artificially generate spectrograms which represent different frequencies for the training and test data by simply changing the STFT window size. For the training data, we use a window of size 512 and a window of size 800 for the test data.
Although \((f_{i})_{i} \neq \left (\hat {f}_{i}\right)_{i}\), we can still compute optimal transport between the spectrograms, thanks to our cost matrix, and thus, we can use the trained dictionary as is to compute the weight matrix at test time.
In order to compute the weight matrix for the other losses however, we first need to requantize the dictionary matrix so that it represents the same frequencies as the test data. We do it by assigning each frequency in the smaller spectrogram to its closest frequency in the larger one. This can be done with the simple linear operation D←AD with
Figure 7 shows mean and standard deviation of the SDR, SIR, and SAR scores for each method. In the case of the optimal transport loss, we report both the result with the generalized filter, and the Wiener filter applied to AX^{(k)}. We can see that the SDR scores have dropped a lot, except with the optimal transport loss combined to our generalized filter. We notice a similar effect on the signaltoartifact ratio (SAR), meaning that the separation process has created artifacts, which are actually very noticeable when listening to the reconstructed sound, except when using the generalized filter. This is probably due to the fact that the heuristic mapping process cancels a lot of frequencies which were in the test data.
4.4 Universal voice model for speech denoising
Setting We now use NMF to first learn a universal speech model and noise models and then apply these models for speech denoising. The universal speech model is learned on the concatenated training data of the first male and first female voices of our dataset. For each noise type, we learn a model with NMF on its training data. We then mix test voices with test noise with a pSNR of 0 and use our BSS approach to separate the voice. All the scores reported are evaluated on the voices only since reconstruction of the noise is not our goal here.
In this experiment, we kept the same parameters for the cost matrix of optimal transport as in the ones selected in the voicevoice BSS experiment. We report the scores for each dictionary size k in {5,10,15,20}.
Results Tables 1, 2 and 3 show the SDR, SIR and SAR scores with their standard deviation for all methods and all noise types. We can see from Tables 1 and 2 that the optimal transport yields significantly better SDR and SIR than other methods for all noises except “sea.” This is consistent with our observation that the optimal transport loss allows to good reconstruction with a universal model.
Dictionaries Figures 8 and 9 show the dictionaries learned for the universal voice model and the cicada noise, respectively, with all losses and a dictionary size of 5 and 10. The dictionaries learned with optimal transport tend to be smoother and maybe with less overlap between dictionary elements. They seem to have high activation on bands, rather than isolated frequencies, and each dictionary element has only a few bands with high activation. The IS loss seems to induce similar effect to a lesser extent, while the KL and even more so the Euclidean loss tend to be spiked, with a lot of spikes for a same dictionary element, and more redundancy between elements.
Running times. Our implementation of the method in Python with numpy on 3 CPU cores of 2.93 gHz takes about 3 min to fully learn a dictionary of 5 elements on the cicada training data, which is about 20 s long, leading to spectrograms in \(\mathbb {R}^{512~\times ~724}\). Test times are around 2 min for sound files of around 50 s, which is not real time but close. We used rather tight convergence criteria in these experiments, and we believe that these times could be reduced by using better hardware (multicore, GPUs) and looser convergence criteria. For comparison, computing times for the KL loss, with a similar alternate minimization scheme (with inner optimizations performed with the multiplicative updates of [15]) and the same convergence criteria is about 50 s for training and about 20 s for testing.
5 Discussion
Regularization of the transport plan. In this work, we considered entropyregularized optimal transport as introduced by [7]. This allows us to get an easytosolve dual problem since its convex conjugate is smooth and can be computed in closed form. However, any convex regularizer would yield the same duality results and could be considered as long as its conjugate is computable. For instance, the squared L^{2} norm regularization was considered in several recent works [4, 26] and was shown to have desirable properties such as better numerical stability or sparsity of the optimal transport plan. Moreover, similarly to entropic regularization, it was shown that the convex conjugate and its gradient can be computed in closed form [4].
Learning procedure. Following the work of [22], we solved the NMF problem with an alternating minimization approach, in which at each iteration, a complete optimization is performed on either the dictionary or the coefficients. While this seems to work well in our experiments, it would be interesting to compare with smaller step approaches like in [15]. Unfortunately, such updates do not exist to our knowledge: gradient methods in the primal would be prohibitively slow since they involve solving t large optimal transport problems at each iteration.
5.1 Future work
Sparsity Many works using NMF for sound processing add sparsityinducing regularization to the NMF loss. This is usually achieved with a l1 regularization on the coefficient matrix W[16, 29]. We believe such sparsity would also benefit our approach, although l1 regularization cannot be applied directly. Indeed, we have constraints of the form ∥DW_{i}∥_{1}=∥X_{i}∥_{1}, and since all columns of D are in the simplex, this is equivalent to ∥W_{i}∥_{1}=∥X_{i}∥_{1}, so we already have a hard constraint on the l1 norm of W. One solution to this problem is to use an “unbalanced” optimal transport loss [6, 11], for which both input do not need to have the same total weight. Unbalanced versions of optimal transport as defined in [6] do not have an easytocompute convex conjugate to the best of our knowledge, but [12] casts unbalanced optimal transport into a regular optimal transport problem, and our approach should work with this loss.
Multichannel sound processing. In order to use our framework with multichannel sound input, the main issue is to have an optimal transport loss between multichannel spectrograms. A simple way to solve this is to simply treat channels separately and sum the loss on each channel. A more interesting approach in our opinion would be to design a cost matrix which would encode the cost of moving power not only between frequencies but also between channels.
Optimal transport in other models. We believe optimal transport can improve upon other losses between spectrograms in many sound processing tasks, as long as the loss is evaluated between spectrograms. For instance, one can use a speechdenoising autoencoder as done by [14] and use the optimal transport loss with our proposed cost matrix on the reconstructed spectrograms. However, the simple linear model of NMF used in this paper allows us to have simple and easytooptimize duals. This is not the case with deep neural networks, and one would have to resort to more computationally involved primal gradientbased approaches as in [11] or [17].
6 Conclusion
We showed that using an optimal transportbased loss can improve performance of NMFbased models for voice reconstruction and separation tasks. We believe this is a first step towards using optimal transport as a loss for speech processing, possibly using more complicated models such as sparse NMF or deep neural networks. The versatility of optimal transport, which can compare spectrograms on different frequency domains, lets us use dictionaries on sounds that are not recorded or processed in the same way as the training set. This property could also be beneficial to learn common representations (e.g., dictionaries) for different datasets.
Notes
See availability of data section.
Abbreviations
 BSS:

Blind source separation
 E:

Euclidean
 IS:

ItakuraSaito
 KL:

KullbackLeibler
 LP:

Linear program
 NMF:

Nonnegative matrix factorization
 OT:

Optimal transport
 SAR:

Signaltoartifact ratio
 SDR:

Signaltodistortion ratio
 SIR:

Signaltointerference ratio
 SNR:

Signaltonoise ratio
 STFT:

Shorttime Fourier transform
References
J. D. Benamou, G. Carlier, M. Cuturi, L. Nenna, G. Peyré, Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput.37(2), A1111–A1138(2015).
D. P. Bertsekas, Nonlinear programming (Athena scientific, Belmont, 1999).
J. Bigot, R. Gouet, T. Klein, A. López, et al., in Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, Institut Henri Poincaré, vol 53. Geodesic PCA in the Wasserstein space by Convex PCA, (2017), pp. 1–26.
M. Blondel, V. Seguy, A. Rolet, in Artificial Intelligence and Statistics. Smooth and sparse optimal transport, (2018).
E. Cazelles, V. Seguy, J. Bigot, M. Cuturi, N. Papadakis, Geodesic PCA versus logPCA of histograms in the Wasserstein space. SIAM J. Sci. Comput.40(2), B429–B456 (2018).
L. Chizat, G. Peyré, B. Schmitzer, F. X. Vialard, Scaling algorithms for unbalanced optimal transport problems. Math. Comput. 87:, 2563–2609 (2018). American Mathematical Soc.
M. Cuturi, in Advances in Neural Information Processing Systems. Sinkhorn distances: lightspeed computation of optimal transport, (2013), pp. 2292–2300.
M. Cuturi, G. Peyré, A smoothed dual approach for variational Wasserstein problems. SIAM J. Imaging Sci.9(1), 320–343 (2016).
C. Févotte, N. Bertin, J. L. Durrieu, Nonnegative matrix factorization with the ItakuraSaito divergence: with application to music analysis. Neural Comput.21(3), 793–830 (2009).
R. Flamary, C. Févotte, N Courty, V Emiya, in Advances in Neural Information Processing Systems. Optimal spectral transportation with application to music transcription, (2016), pp. 703–711.
C. Frogner, C. Zhang, H. Mobahi, M. Araya, T. A. Poggio, in Advances in Neural Information Processing Systems. Learning with a Wasserstein loss, (2015), pp. 2053–2061.
A. Gramfort, G. Peyré, M. Cuturi, in International Conference on Information Processing in Medical Imaging. Fast optimal transport averaging of neuroimaging data (SpringerSabhal Mor Ostaig, 2015), pp. 261–272.
R. Huber, B. Kollmeier, Pemoq—a new method for objective audio quality assessment using a model of auditory perception. IEEE Trans. Audio Speech Lang. Process.14(6), 1902–1911 (2006).
T. Ishii, H. Komiyama, T. Shinozaki, Y. Horiuchi, S. Kuroiwa, Reverberant speech recognition based on denoising autoencoder. Interspeech, 3512–3516 (2013).
D. D. Lee, H. S. Seung, Algorithms for nonnegative matrix factorization. Advances in neural information processing systems. 14:, 556–562 (2001).
Y. Li, A. Cichocki, S. I. Amari, Analysis of sparse representation and blind source separation. Neural Comput.16(6), 1193–1234 (2004).
G. Montavon, K. R. Müller, M Cuturi, Wasserstein training of restricted Boltzmann machines. Advances in Neural Information Processing Systems. 29:, 3718–3726 (2016).
Y. Nesterov, A method of solving a convex programming problem with convergence rate o (1/k2). Sov. Math. Dokl. 27(2), 372–376 (1983).
J. Orlin, A polynomial time primal network simplex algorithm for minimum cost flows. Math. Program.78(2), 109–129 (1997).
G. Peyré, M. Cuturi, Computational optimal transport, (2017).
R. T. Rockafellar, R. J. B. Wets, Variational analysis, vol 317 (SpringerVerlag Berlin Heidelberg, 2009).
A. Rolet, M. Cuturi, G. Peyré, in Artificial Intelligence and Statistics. Fast dictionary learning with a smoothed Wasserstein loss, (2016), pp. 630–638.
R. Sandler, M. Lindenbaum, in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on,. Nonnegative matrix factorization with earth mover’s distance metric (IEEEMiami, 2009), pp. 1873–1880.
H. Sawada, H. Kameoka, S. Araki, N. Ueda, Multichannel extensions of nonnegative matrix factorization with complexvalued data. IEEE Trans. Audio Speech Lang. Process.21(5), 971–982 (2013).
M. N. Schmidt, R. K. Olsson, in Spoken Language Proceesing,ISCA International Conference on (INTERSPEECH). Singlechannel speech separation using sparse nonnegative matrix factorization, (2006).
V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, M. Blondel, in Proceedings of the International Conference in Learning Representations. Largescale optimal transport and mapping estimation, (2018).
S. Shirdhonkar, D. Jacobs, in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. Approximate earth mover’s distance in linear time (IEEEAnchorage, 2008), pp. 1–8.
P. Smaragdis, J. C. Brown, in Applications of Signal Processing to Audio and Acoustics, 2003 IEEE Workshop on. Nonnegative matrix factorization for polyphonic music transcription (IEEENew Paltz, 2003), pp. 177–180.
D. L. Sun, G. J. Mysore, in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. Universal speech models for speaker independent single channel source separation (IEEEVancouver, 2013), pp. 141–145.
J. A. Tropp, An alternating minimization algorithm for nonnegative matrix approximation, (2003).
C. Villani, Topics in optimal transportation. Am. Math. Soc.58: (2003).
E. Vincent, R. Gribonval, C. Févotte, Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process.14(4), 1462–1469 (2006).
Acknowledgements
The authors would like to thank Arnaud Dessein, who gave helpful insight on the cost matrix design.
Funding
The authors do not have any particular funding to declare.
Availability of data and materials
The voice dataset we used for our experiments is the English part of the MultiLingual Speech Database for Telephonometry 1994 dataset, distributed by NTT. Requests for the data are handled through the form at http://www.nttat.com/product/speech2002/. The noise sound files we used were gathered on https://freesound.org/. We provide them in a zip archive in the following link: https://drive.google.com/file/d/1A9A3Bfc6SvztZGA9wYcLJBCPzqWQUlx/view?usp=sharing.
Author information
Authors and Affiliations
Contributions
AR, VS, MB, and HS designed the research and wrote the paper. Experiments were performed by AR. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional information
Work performed during an internship at NTT Communication Science Laboratories
Additional files
Additional file 1
Reconstruction with optimal transport NMF. This WAV file contains the reconstructed test sentences of the male validation voice with optimal transport NMF and a dictionary of rank 5 (five columns), where the dictionary was learned on the training sentences of the same voice. (WAV 2831 kb)
Additional file 2
Reconstruction with Euclidean NMF. This WAV file contains the reconstructed test sentences of the male validation voice with Euclidean NMF and a dictionary of rank 5 (five columns), where the dictionary was learned on the training sentences of the same voice. (WAV 2831 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Rolet, A., Seguy, V., Blondel, M. et al. Blind source separation with optimal transport nonnegative matrix factorization. EURASIP J. Adv. Signal Process. 2018, 53 (2018). https://doi.org/10.1186/s1363401805762
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1363401805762