Abstract
Identification and extraction of singing voice from within musical mixtures is a key challenge in source separation and machine audition. Recently, deep neural networks (DNN) have been used to estimate ‘ideal’ binary masks for carefully controlled cocktail party speech separation problems. However, it is not yet known whether these methods are capable of generalizing to the discrimination of voice and non-voice in the context of musical mixtures. Here, we trained a convolutional DNN (of around a billion parameters) to provide probabilistic estimates of the ideal binary mask for separation of vocal sounds from real-world musical mixtures. We contrast our DNN results with more traditional linear methods. Our approach may be useful for automatic removal of vocal sounds from musical mixtures for ‘karaoke’ type applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
McDermott, J.H.: The cocktail party problem. Curr. Biol. 19, R1024–R1027 (2009)
Pressnitzer, D., Sayles, M., Micheyl, C., Winter, I.M.: Perceptual organization of sound begins in the auditory periphery. Curr. Biol. 18, 1124–1128 (2008)
Ding, N., Simon, J.Z.: Emergence of neural encoding of auditory objects while listening to competing speakers. Proc. Natl. Acad. Sci. USA 109, 11854–11859 (2012)
Wang, Y., Wang, D.: Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21, 1381–1390 (2013)
Grais, E., Sen, M., Erdogan, H.: Deep neural networks for single channel source separation. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3734–3738 (2014)
Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1562–1566 (2014)
Simpson, A.J.R.: Probabilistic Binary-Mask Cocktail-Party Source Separation in a Convolutional Deep Neural Network, arxiv.org abs/1503.06962 (2015)
Abrard, F., Deville, Y.: A time–frequency blind signal separation method applicable to underdetermined mixtures of dependent sources. Sig. Process. 85, 1389–1403 (2005)
Ryynanen, M., Virtanen, T., Paulus, J., Klapuri, A.: Accompaniment separation and karaoke application based on automatic melody transcription. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 1417–1420 (2008)
Raphael, C.: Music plus one and machine learning. In: Proceedings 27th International Conference on Machine Learning (ICML-2010), pp. 21–28 (2010)
Bittner, R., Salamon, J., Tierney, M., Mauch, M., Cannam, C., Bello, J.P.: MedleyDB: a multitrack dataset for annotation-intensive MIR research. In: 15th International Society Music Information Retrieval Conference (2014)
Terrell, M.J., Simpson, A.J.R., Sandler, M.: The mathematics of mixing. J. Audio Eng. Soc. 62(1/2), 4–13 (2014)
Simpson, A.J.R.: Abstract Learning via Demodulation in a Deep Neural Network, arxiv.org abs/1502.04042 (2015)
Grais, E.M., Erdogan, H.: Single channel speech music separation using nonnegative matrix factorization and spectral masks. In: 2011 17th International Conference on Digital Signal Processing (DSP), pp. 1–6. IEEE (2011)
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, pp. 556–562 (2001)
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14, 1462–1469 (2006)
Simpson, A.J.R.: Deep Transform: Error Correction via Probabilistic Re-Synthesis, arxiv.org abs/1502.04617 (2015)
Simpson, A.J.R.: Over-Sampling in a Deep Neural Network, arxiv.org abs/1502.03648 (2015)
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Improving neural networks by preventing co-adaptation of feature detectors. The Computing Research Repository (CoRR), abs/1207.0580 (2012)
Acknowledgment
AJRS, GR and MDP were supported by grant EP/L027119/1 from the UK Engineering and Physical Sciences Research Council (EPSRC). Data and materials are available at doi:10.15126/surreydata.00807909.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Simpson, A.J.R., Roma, G., Plumbley, M.D. (2015). Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network. In: Vincent, E., Yeredor, A., Koldovský, Z., Tichavský, P. (eds) Latent Variable Analysis and Signal Separation. LVA/ICA 2015. Lecture Notes in Computer Science(), vol 9237. Springer, Cham. https://doi.org/10.1007/978-3-319-22482-4_50
Download citation
DOI: https://doi.org/10.1007/978-3-319-22482-4_50
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22481-7
Online ISBN: 978-3-319-22482-4
eBook Packages: Computer ScienceComputer Science (R0)
