Low-Latency Instrument Separation in Polyphonic Audio Using Timbre Models
This research focuses on the removal of the singing voice in polyphonic audio recordings under real-time constraints. It is based on time-frequency binary masks resulting from the combination of azimuth, phase difference and absolute frequency spectral bin classification and harmonic-derived masks. For the harmonic-derived masks, a pitch likelihood estimation technique based on Tikhonov regularization is proposed. A method for target instrument pitch tracking makes use of supervised timbre models. This approach runs in real-time on off-the-shelf computers with latency below 250ms. The method was compared to a state of the art Non-negative Matrix Factorization (NMF) offline technique and to the ideal binary mask separation. For the evaluation we used a dataset of multi-track versions of professional audio recordings.
KeywordsSource separation Singing voice Predominant pitch tracking
Unable to display preview. Download preview PDF.
- 2.Benaroya, L., Bimbot, F., Gribonval, R.: Audio source separation with a single sensor. IEEE Transactions on Audio, Speech, and Language Processing 14(1) (2006)Google Scholar
- 3.Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
- 6.Fujihara, H., Kitahara, T., Goto, M., Komatani, K., Ogata, T., Okuno, H.: F0 estimation method for singing voice in polyphonic audio signal based on statistical vocal model and viterbi search. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, p. V (May 2006)Google Scholar
- 7.Goto, M., Hayamizu, S.: A real-time music scene description system: Detecting melody and bass lines in audio signals. Speech Communication (1999)Google Scholar
- 8.Jourjine, A., Rickard, S., Yilmaz, O.: Blind separation of disjoint orthogonal signals: demixing n sources from 2 mixtures. In: Proc (ICASSP) International Conference on Acoustics, Speech, and Signal Processing (2000)Google Scholar
- 10.Ryynänen, M., Klapuri, A.: Transcription of the singing melody in polyphonic music. In: Proc. 7th International Conference on Music Information Retrieval, Victoria, BC, Canada, pp. 222–227 (October 2006)Google Scholar
- 11.Sha, F., Saul, L.K.: Real-time pitch determination of one or more voices by nonnegative matrix factorization. In: Advances in Neural Information Processing Systems, vol. 17, pp. 1233–1240. MIT Press (2005)Google Scholar
- 12.Vincent, E., Sawada, H., Bofill, P., Makino, S., Rosca, J.P.: First Stereo Audio Source Separation Evaluation Campaign: Data, Algorithms and Results. In: Davies, M.E., James, C.J., Abdallah, S.A., Plumbley, M.D. (eds.) ICA 2007. LNCS, vol. 4666, pp. 552–559. Springer, Heidelberg (2007)CrossRefGoogle Scholar
- 13.Vinyes, M., Bonada, J., Loscos, A.: Demixing commercial music productions via human-assisted time-frequency masking. In: Proceedings of Audio Engineering Society 120th Convention (2006)Google Scholar