Abstract
In this paper, we report the results of the 2016 community-based Signal Separation Evaluation Campaign (SiSEC 2016). This edition comprises four tasks. Three focus on the separation of speech and music audio recordings, while one concerns biomedical signals. We summarize these tasks and the performance of the submitted systems, as well as provide a small discussion concerning future trends of SiSEC.
Keywords
- Empirical Mode Decomposition
- Source Separation
- Nonnegative Matrix Factorization
- Deep Neural Network
- Ensemble Empirical Mode Decomposition
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
Evaluating source separation algorithms is a challenging topic on its own, as well as finding appropriate datasets on which to train and evaluate various separation systems. In this respect, the Signal Separation Evaluation Campaign (SiSEC) has played an important role. SiSEC was held about every year-and-half since 2008, in conjunction with the LVA/ICA conference. Its purpose is two-fold.
The primary objective of SiSEC is to regularly report the progress of the source separation community, in order to serve as a reference for a comparison of as many methods as possible on the topic of source separation. This involves adapting both the evaluations and the metrics to current trends in the field.
The second important objective of SiSEC is then to provide data the community can use for the design and evaluation of new methods, even outside the scope of the campaign itself. These efforts lead to a significant, although moderate, impact of SiSEC in the community as depicted on Fig. 1.
For the objective evaluation of source separation, two options are now widely accepted and used for SiSEC’2016. First, the BSS Eval toolbox [3] features the signal to distortion ratio (SDR), the source image to spatial distortion ratio (ISR), the signal to interference ratio (SIR), and signal to artifacts ratio (SAR) metrics. All are given in dB and are better with better separation. Second, the PEASS toolbox [4] was used in some tasks for providing four perceptually-motivated criteria: the overall perceptual score (OPS), the target-related perceptual score (TPS), the interference-related perceptual score (IPS), and the artifact-related perceptual score (APS).
This sixth SiSEC features the same UND and BGN tasks as proposed last year and summarized in Sects. 2 and 3, respectively. The BIO task presented in Sect. 4 is new. Finally, the MUS task presented in Sect. 5 features new data and accompanying software.
2 UND: Underdetermined-Speech and Music Mixtures
The datasets for the UND task are the same as those described in detail in [1]. The results presented here include those found in previous editions, as well as a new contribution [14], that utilizes both generalized cross correlation (GCC, [21]) and nonnegative matrix factorization (NMF, [22]). GCC was used previously for sound source localization in reverberant environments [23]. NMF is a well-known mathematical framework for many applications, especially in the source separation task. For the acoustic signals, NMF can extract some spectral patterns (bases) and their activations (time-varying gains), and the source separation is achieved by clustering the bases into each source. Wood et al. combined GCC with NMF to localize individual bases over time, such that they may be attributed to individual sources. Computations of Wood’s algorithm were between 6 and 7 min per mixture on a dual 2.8 GHz Intel Xeon E5462 quad-core processor with 16 GB of RAM.
From the comparison of the results on Table 1, Wood’s algorithm could not outperform the best ever performance on this dataset. Other results for microphone spacings of 5 cm and 1 m with reverberation times of 130 ms and 250 ms may be found on the SiSEC 2016 websiteFootnote 1.
3 BGN: Two-Channel Mixtures of Speech and Real-World Background Noise
Just like for the UND task, we proposed the same dataset for the task ‘two-channel mixtures of speech and real-world background noise (BGN)’ as in SiSEC 2013 [1].
Three algorithms were submitted to the BGN task this year, as shown in Table 2. Duong’s method [24] is based on NMF with pre-trained speech and noise spectral dictionaries. Liu’s method performs Time Difference of Arrival (TDOA) clustering based on GCC-PHAT. Wood’s method [14] first applies NMF to the magnitude spectrograms of the mixture signals with channels concatenated in time. Each dictionary atom is then attributed to either the speech or the noise according to its spatial origin.
Considering the results in Table 2, we can see that all methods present some advantages. Whereas Duong’s method [24] clearly shows a significant superiority on BSS Eval metrics, this is much less clear when analyzing the PEASS perceptual scores. Wood’s method [14] indeed gives the best OPS and IPS scores, suggesting a better overall and interference-related perceptual quality of estimates. Now analyzing APS scores, Liu’s method consistently gives results with few annoying artifacts. From all these facts and contradictions, we see the limitations of objective metrics and it seems clear that a real perceptual evaluation would be needed to draw further conclusions.
4 BIO: Separation of Biomedical Signals
Phonocardiography (PCG) is the recording of the sounds generated by the heart. It allows to evaluate some vital functions of the heart. However, the raw recordings of the PCG are not always directly exploitable because of ambient interference (e.g., speech, cough, gastric noise, etc.). Consequently, it is necessary to denoise the raw PCG before their interpretation. An example of clean PCG is plotted on Fig. 2.
The aim of this challenge is to extract the heart activity from raw PCG recordings with a single microphone maintained by a belt on the skin, in front of the heart. 16 sessions have been recorded from 3 healthy participants in different conditions. The quality of the separation process has been evaluated by the BSS Eval toolbox. The SDR, SIR and SAR indexes were computed on sliding windows of 1 s with an overlap of 0.5 s. The performance was only retained for the indexes related to the heart sounds.
Two participants have submitted their results on this specific task:
-
The first participant (Part. 1) proposed a method based on the alignment of Empirical Mode Decomposition (EMD) and Lempel-Ziv complexity measure to extract the denoised signal.
-
The second participant (Part. 2) proposed a method based on the decomposition of the signal using an ensemble empirical mode decomposition (EEMD) and the selection of some IMFs to filter the signal. Finally, the estimated signal is post-processed to reject additional peaks based on the characteristics of PCG signals.
The results achieved by the submitted methods are plotted on Fig. 3 that shows the distribution of SDR, SIR and SAR for the two participants as well as the noisy data. The red line is the median, the edges of the box are the 25th and 75th percentiles, the whiskers extend to the most extreme values and outliers are plotted by a red cross. In term of SIR, i.e., rejection of noise, Part. 2 is slightly better than Part. 1: the average SIR improvements are of 10.4 dB and 9.6 dB, respectively, while the average SIR on the noisy data is \(-3\) dB. On the contrary, the Part. 2’s method leads to better results based on SDR and SAR than the Part. 1’s one: an average gain in SDR of 5.7 dB and 1.4 dB, and an average SAR of 5.5 dB and 0.5 dB. It is interesting to see that the two participants proposed methods based on empirical mode decomposition.
5 MUS: Professionally-Produced Music Recordings
The MUS task attempts at evaluating the performance of music separation methods. In SiSEC 2015 [2], a new dataset was introduced for this task, comprising 100 full-track songs of different musical styles and genres, divided into development and test subsets. This year, this dataset was further heavily remastered so that for each track, it now features a set of four semi-professionally engineered stereo source images (bass, drums, vocals, and other), summing up to realistic mixtures. This corpus was called the Demixing Secret Database (DSD100), as a reference to the’Mixing Secrets’ Free Multitrack Download Library it was build fromFootnote 2. The duration of the songs ranges from 2 min and 22 s to 7 min and 20 s, with an average duration of 4 min and 10 s.
Additionally, an accompanying software toolbox was developed in Matlab and Python that permits the straightforward processing of the DSD100 dataset. This software is open source and was publicly broadcasted so as to allow the participants to run the evaluation themselvesFootnote 3.
Similarly to the previous SiSEC editions, MUS was the task attracting the most participants, with 24 systems evaluated. Due to page constraints, we may not detail each method, but encourage the interested reader to refer to SiSEC’2016 website and to the references given therein.
Among the systems evaluated, 10 are blind methods: CHA [5], DUR [6], KAM [8], OZE [10], RAF [11,12,13], HUA [7], JEO [28]. Then, 14 are supervised methods exploiting variants of deep neural networks: GRA [27], KON [29], UHL [26], NUG [9], and the methods proposed by F.-R. Stöter (STO), consisting of variants of [25, 26] with various representations. Finally, the evaluation also features the scores of Ideal Binary Mask (IBM), computed for left and right channels independently.
Due to space constraints again, Fig. 4 shows the box plots for the SDR of the vocals only, over the whole DSD100 dataset and excluding those few 30 s excerpts for which the IBM method was badly behaved (yielding nan values for its SDR). More results may be found online. For the first time in SiSEC, 30 s excerpts of all separated results may also be found in the webpage dedicated to the resultsFootnote 4. The striking fact is that most proposed supervised systems considerably outperform blind methods, a trend that is also noticeable on other SIR, SAR metrics. Also, systems like [26] which use additional augmentation data, seem to generalise better, resulting in a smaller gap between Dev and Test.
A Friedman test revealed a significant effect of separation method on SDR (Dev: \(\chi ^2=1083.23, p < 0.0001\), Test: \(\chi ^2=1004.29, p < 0.0001\)). Inspired by recent studies [30], we also tested for each pair of method whether the difference in performance was significant. A post-hoc pairwise comparison test (Wilcoxon signed-rank test, two-tailed, Bonferroni corrected) is depicted in Fig. 5.
From these pair-wise comparisons, it turns out that state-of-the art music separation systems ought to feature multichannel modelling (introduced in NUG) and data augmentation (UHL). As depicted by the best scores obtained by UHL3, performing a fusion of different systems is also a promising idea.
6 Conclusion
In this paper, we reported the different tasks and their results for SiSEC’2016. This edition enjoyed a good participation on the long-run tasks, as well as several novelties. Among those, a new task on biomedical signal processing was proposed this year, as well as important improvements concerning the music separation dataset and accompaniment software.
In the recent years, we witnessed a very strong increase of interest in supervised methods for separation. A corresponding objective of SiSEC is to make it easier for machine learning practitioners to adapt learning algorithms to the task of source separation, widening the audience of this fascinating topic.
In the future, we plan to continue in this direction and focus on two important moves for SiSEC: first, the problem of quality assessment appears as largely unsolved and SiSEC should play a role in this respect. Second, facilitating reproducibility and comparison of research is a challenge when methods involve large-scale machine learning systems. SiSEC will shortly host and broadcast separation results of various techniques along datasets to promote easy comparison with state of the art.
Notes
- 1.
- 2.
- 3.
More info at github.com/faroit/dsdtools.
- 4.
References
Ono, N., Koldovsky, Z., Miyabe, S., Ito, N.: The 2013 signal separation evaluation campaign. In: Proceedings of MLSP, pp. 1–6, September 2013
Ono, N., Rafii, Z., Kitamura, D., Ito, N., Liutkus, A.: The 2015 signal separation evaluation campaign. In: Vincent, E., Yeredor, A., Koldovský, Z., Tichavský, P. (eds.) LVA/ICA 2015. LNCS, vol. 9237, pp. 387–395. Springer, Heidelberg (2015). doi:10.1007/978-3-319-22482-4_45
Vincent, E., Griboval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. ASLP 14(4), 1462–1469 (2006)
Emiya, V., Vincent, E., Harlander, N., Hohmann, V.: Subjective and objective quality assessment of audio source separation. IEEE Trans. ASLP 19(7), 2046–2057 (2011)
Chan, T-S., Yeh, T-C., Fan, Z-C., Chen, H-W., Su, L., Yang, Y-H., Jang, R.: Vocal activity informed singing voice separation with the iKala dataset. In: Proceedings of ICASSP, pp. 718–722, April 2015
Durrieu, J.-L., David, B., Richard, G.: A musically motivated mid-level representation for pitch estimation and musical audio source separation. IEEE J. Sel. Top. Sig. Process. 5(6), 1180–1191 (2011)
Huang, P., Chen, S., Smaragdis, P., Hasegawa-Johnson, M.: Singing-voice separation from monaural recordings using robust principal component analysis. In: Proceedings of ICASSP, pp. 57–60, March 2012
Liutkus, A., FitzGerald, D., Rafii, Z., Daudet, L.: Scalable audio separation with light kernel additive modelling. In: Proceedings of ICASSP, pp. 76–80, April 2015
Nugraha, A., Liutkus, A., Vincent, E.: Multichannel music separation with deep neural networks. In: Proceedings of EUSIPCO (2016)
Ozerov, A., Vincent, E., Bimbot, F.: A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. ASLP 20(4), 1118–1133 (2012)
Rafii, Z., Pardo, B.: REpeating pattern extraction technique (REPET): a simple method for music/voice separation. IEEE Trans. ASLP 21(1), 71–82 (2013)
Liutkus, A., Rafii, Z., Badeau, R., Pardo, B., Richard, G.: Adaptive filtering for music/voice separation exploiting the repeating musical structure. In: Proceedings of ICASSP, pp. 53–56, March 2012
Rafii, Z., Pardo, B.: Music/voice separation using the similarity matrix. In: Proceedings of ISMIR, pp. 583–588, October 2012
Wood, S., Rouat, J.: Blind speech separation with GCC-NMF. In: Proceedings of Interspeech (2016)
Cho, J., Yoo, C.D.: Underdetermined convolutive BSS: Bayes risk minimization based on a mixture of super-Gaussian posterior approximation. IEEE Trans. Audio Speech Lang. Process. 23(5), 828–839 (2011)
Adiloglu, K., Vincent, E.: “Variational Bayesian inference for source separation and robust feature extraction,” Technical report, INRIA (2012). https://hal.inria.fr/hal-00726146
Hirasawa, Y., Yasuraoka, N., Takahashi, T., Ogata, T., Okuno, H.G.: A GMM sound source model for blind speech separation in under-determined conditions. In: Theis, F., Cichocki, A., Yeredor, A., Zibulevsky, M. (eds.) LVA/ICA 2012. LNCS, vol. 7191, pp. 446–453. Springer, Heidelberg (2012). doi:10.1007/978-3-642-28551-6_55
Iso, K., Araki, S., Makino, S., Nakatani, T., Sawada, H., Yamada, T., Nakamura, A.: Blind source separation of mixed speech in a high reverberation environment. In: Proceedings of Hands-free Speech Communication and Microphone Arrays, pp. 36–39 (2011)
Cho, J., Choi, J., Yoo, C.D.: Underdetermined convolutive blind source separation using a novel mixing matrix estimation and MMSE-based source estimation. In: Proceedings of IEEE MLSP (2011)
Nesta, F., Omologo, M.: Convolutive underdetermined source separation through weighted interleaved ICA and spatio-temporal source correlation. In: Theis, F., Cichocki, A., Yeredor, A., Zibulevsky, M. (eds.) LVA/ICA 2012. LNCS, vol. 7191, pp. 222–230. Springer, Heidelberg (2012). doi:10.1007/978-3-642-28551-6_28
Knapp, C.H., Carter, G.C.: The generalized correlation method for estimation of time delay. IEEE Trans. Acousti. Speech Sig. Process. 24(4), 320–327 (1976)
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999)
Blandin, C., Ozerov, A., Vincent, E.: Multi-source TDOA estimation in reverberant audio using angular spectra and clustering. Sig. Process. 92(8), 1950–1960 (2012)
Duong, H.-T.T., Nguyen, Q.-C., Nguyen, C.-P., Tran, T.-H., Duong, N.Q.K.: Speech enhancement based on nonnegative matrix factorization with mixed group sparsity constraint. In: Proceedings of ACM International Symposium on Information and Communication Technology, pp. 247–251 (2015)
Stöter, F.-R., Liutkus, A., Badeau, R., Edler, B., Magron, P.: Common fate model for unison source separation. In: Proceedings of ICASSP (2016)
Uhlich, S., Porcu, M., Giron, F., Enenkl, M., Kemp, T., Takahashi, N., Mitsufuji, Y.: Improving Music Source Separation Based On Deep Neural Networks Through Data Augmentation and Network Blending (2017). Submitted to ICASSP
Grais, E., Roma, G., Simpson, A.J., Plumbley, M.: Single-channel audio source separation using deep neural network ensembles. In: Proceedings of AES 140, May 2016
Jeong, I.-Y., Lee, K.: Singing voice separation using RPCA with weighted l1-norm. In: Proceedings of LVA/ICA (2017)
Huang, P., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)
Simpson, A., Roma, G., Grais, E., Mason, R., Hummersone, C., Plumbley, M., Liutkus, A.: Evaluation of audio source separation models using hypothesis-driven non-parametric statistical methods. In: Proceedings of EUSIPCO (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Liutkus, A. et al. (2017). The 2016 Signal Separation Evaluation Campaign. In: Tichavský, P., Babaie-Zadeh, M., Michel, O., Thirion-Moreau, N. (eds) Latent Variable Analysis and Signal Separation. LVA/ICA 2017. Lecture Notes in Computer Science(), vol 10169. Springer, Cham. https://doi.org/10.1007/978-3-319-53547-0_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-53547-0_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-53546-3
Online ISBN: 978-3-319-53547-0
eBook Packages: Computer ScienceComputer Science (R0)