Abstract
Over the past few decades, monaural speech separation has always been an interesting but challenging problem. The goal of speech separation is to separate a specific target speech from some background interferences and it has been treated as a signal processing problem traditionally. In recent years, with the rapid advances of deep learning techniques, deep learning has made a great breakthrough in speech separation. In this paper, recurrent neural networks (RNNs) which integrate multiple nonlinear masking layers (NMLs) to learn two-level estimation are proposed for speech separation. Experimental results show that our proposed model “RNN + SMMs + 3 NMLs” outperforms the baseline RNN without any mask in all the SDR, SIR and SAR indices, and it also obtains much better SDR and SIR than the RNN simply with original deterministic time-frequency masks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Brown, G.J., Cooke, M.: Computational auditory scene analysis. Comput. Speech Lang. 8(4), 297–336 (1994). https://doi.org/10.1006/csla.1994.1016
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016). https://doi.org/10.1109/ICASSP.2016.7472621
Chang, X., Qian, Y., Yu, D.: Adaptive permutation invariant training with auxiliary information for monaural multi-talker speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5974–5978. IEEE (2018). https://doi.org/10.1109/ICASSP.2018.8461570
Cherry, E.: Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 25(5), 975–979 (1953). https://doi.org/10.1121/1.1907229
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014)
Cooke, M., Hershey, J.R., Rennie, S.J.: Monaural speech separation and recognition challenge. Comput. Speech Lang. 24(1), 1–15 (2010). https://doi.org/10.1016/j.csl.2009.02.006
Garofolo, J.S., et al.: TIMIT corpus. Linguistic Data Consortium. https://catalog.ldc.upenn.edu/LDC93S1
Hershey, J.R., Chen, Z., Roux, J.L., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35. IEEE (2016). https://doi.org/10.1109/ICASSP.2016.7471631
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1562–1566. IEEE (2014). https://doi.org/10.1109/ICASSP.2014.6853860
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012). https://doi.org/10.1145/3065386
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788 (1999). https://doi.org/10.1038/44565
Maas, A.L., Le, Q.V., O’Neil, T.M., Vinyals, O., Nguyen, P., Ng, A.Y.: Recurrent neural networks for noise reduction in robust ASR. In: INTERSPEECH 2012, 13th Annual Conference of the International Speech, pp. 22–25 (2012)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)
Pandey, A., Wang, D.: On adversarial training and loss functions for speech enhancement. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5414–5418. IEEE (2018). https://doi.org/10.1109/ICASSP.2018.8462614
Schmidt, M.N., Olsson, R.K.: Single-channel speech separation using sparse non-negative matrix factorization. In: INTERSPEECH 2006 - ICSLP, Ninth International Conference on Spoken Language Processing (2006)
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997). https://doi.org/10.1109/78.650093
Srinivasan, S., Roman, N., Wang, D.: Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006). https://doi.org/10.1016/j.specom.2006.09.003
Tu, Y., Du, J., Xu, Y., Dai, L., Lee, C.H.: Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers. In: The 9th International Symposium on Chinese Spoken Language Processing, pp. 250–254. IEEE (2014). https://doi.org/10.1109/ISCSLP.2014.6936615
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006). https://doi.org/10.1109/TSA.2005.858005
Wang, D.: Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif. 12(4), 332–353 (2008). https://doi.org/10.1177/1084713808326455
Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018). https://doi.org/10.1109/TASLP.2018.2842159
Wang, Y., Narayanan, A., Wang, D.: On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014). https://doi.org/10.1109/TASLP.2014.2352935
Weng, C., Yu, D., Seltzer, M.L., Droppo, J.: Deep neural networks for single-channel multi-talker speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(10), 1670–1679 (2015). https://doi.org/10.1109/TASLP.2015.2444659
Wikipedia. https://en.wikipedia.org/wiki/Circular_shift. Circular Shift
Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolutional network. CoRR abs/1505.00853 (2015)
Yilmaz, O., Rickard, S.: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process. 52(7), 1830–1847 (2004). https://doi.org/10.1109/TSP.2004.828896
Yu, D., Deng, L., Dahl, G.E.: Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition. In: Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2010)
Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245. IEEE (2017). https://doi.org/10.1109/ICASSP.2017.7952154
Zhang, H., Zhang, X., Gao, G.: Multi-target ensemble learning for monaural speech separation. In: INTERSPEECH, pp. 1958–1962 (2017). https://doi.org/10.21437/Interspeech.2017-240
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, J., Zhang, P. (2019). Deep Recurrent Neural Networks with Nonlinear Masking Layers and Two-Level Estimation for Speech Separation. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series. ICANN 2019. Lecture Notes in Computer Science(), vol 11730. Springer, Cham. https://doi.org/10.1007/978-3-030-30490-4_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-30490-4_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30489-8
Online ISBN: 978-3-030-30490-4
eBook Packages: Computer ScienceComputer Science (R0)