Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network

  • Andrew J. R. Simpson
  • Gerard Roma
  • Mark D. Plumbley
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9237)

Abstract

Identification and extraction of singing voice from within musical mixtures is a key challenge in source separation and machine audition. Recently, deep neural networks (DNN) have been used to estimate ‘ideal’ binary masks for carefully controlled cocktail party speech separation problems. However, it is not yet known whether these methods are capable of generalizing to the discrimination of voice and non-voice in the context of musical mixtures. Here, we trained a convolutional DNN (of around a billion parameters) to provide probabilistic estimates of the ideal binary mask for separation of vocal sounds from real-world musical mixtures. We contrast our DNN results with more traditional linear methods. Our approach may be useful for automatic removal of vocal sounds from musical mixtures for ‘karaoke’ type applications.

Keywords

Deep learning Supervised learning Convolution Source separation 

Notes

Acknowledgment

AJRS, GR and MDP were supported by grant EP/L027119/1 from the UK Engineering and Physical Sciences Research Council (EPSRC). Data and materials are available at doi:10.15126/surreydata.00807909.

References

  1. 1.
    McDermott, J.H.: The cocktail party problem. Curr. Biol. 19, R1024–R1027 (2009)CrossRefGoogle Scholar
  2. 2.
    Pressnitzer, D., Sayles, M., Micheyl, C., Winter, I.M.: Perceptual organization of sound begins in the auditory periphery. Curr. Biol. 18, 1124–1128 (2008)CrossRefGoogle Scholar
  3. 3.
    Ding, N., Simon, J.Z.: Emergence of neural encoding of auditory objects while listening to competing speakers. Proc. Natl. Acad. Sci. USA 109, 11854–11859 (2012)CrossRefGoogle Scholar
  4. 4.
    Wang, Y., Wang, D.: Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21, 1381–1390 (2013)CrossRefGoogle Scholar
  5. 5.
    Grais, E., Sen, M., Erdogan, H.: Deep neural networks for single channel source separation. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3734–3738 (2014)Google Scholar
  6. 6.
    Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1562–1566 (2014)Google Scholar
  7. 7.
    Simpson, A.J.R.: Probabilistic Binary-Mask Cocktail-Party Source Separation in a Convolutional Deep Neural Network, arxiv.org abs/1503.06962 (2015)Google Scholar
  8. 8.
    Abrard, F., Deville, Y.: A time–frequency blind signal separation method applicable to underdetermined mixtures of dependent sources. Sig. Process. 85, 1389–1403 (2005)MATHCrossRefGoogle Scholar
  9. 9.
    Ryynanen, M., Virtanen, T., Paulus, J., Klapuri, A.: Accompaniment separation and karaoke application based on automatic melody transcription. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 1417–1420 (2008)Google Scholar
  10. 10.
    Raphael, C.: Music plus one and machine learning. In: Proceedings 27th International Conference on Machine Learning (ICML-2010), pp. 21–28 (2010)Google Scholar
  11. 11.
    Bittner, R., Salamon, J., Tierney, M., Mauch, M., Cannam, C., Bello, J.P.: MedleyDB: a multitrack dataset for annotation-intensive MIR research. In: 15th International Society Music Information Retrieval Conference (2014)Google Scholar
  12. 12.
    Terrell, M.J., Simpson, A.J.R., Sandler, M.: The mathematics of mixing. J. Audio Eng. Soc. 62(1/2), 4–13 (2014)CrossRefGoogle Scholar
  13. 13.
    Simpson, A.J.R.: Abstract Learning via Demodulation in a Deep Neural Network, arxiv.org abs/1502.04042 (2015)Google Scholar
  14. 14.
    Grais, E.M., Erdogan, H.: Single channel speech music separation using nonnegative matrix factorization and spectral masks. In: 2011 17th International Conference on Digital Signal Processing (DSP), pp. 1–6. IEEE (2011)Google Scholar
  15. 15.
    Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, pp. 556–562 (2001)Google Scholar
  16. 16.
    Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14, 1462–1469 (2006)CrossRefGoogle Scholar
  17. 17.
    Simpson, A.J.R.: Deep Transform: Error Correction via Probabilistic Re-Synthesis, arxiv.org abs/1502.04617 (2015)Google Scholar
  18. 18.
    Simpson, A.J.R.: Over-Sampling in a Deep Neural Network, arxiv.org abs/1502.03648 (2015)Google Scholar
  19. 19.
    Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Improving neural networks by preventing co-adaptation of feature detectors. The Computing Research Repository (CoRR), abs/1207.0580 (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Andrew J. R. Simpson
    • 1
  • Gerard Roma
    • 1
  • Mark D. Plumbley
    • 1
  1. 1.Centre for Vision, Speech and Signal ProcessingUniversity of SurreyGuildfordUK

Personalised recommendations