Abstract
The problem of singing voice detection is to segment a song into vocal and non-vocal parts. Commonly used methods usually train a model on a set of frame-based features and then predict the unknown frames by the model. However, the multi-dimensional features are usually concatenated together for each frame, with little consideration of spatial information. Hence, a deep fusion method of the Multi-feature dimensions with Convolution Neural Networks (CNN) is proposed. A one dimension convolution is made on feature dimensions for each frames, then the high-level features obtained can be used for a direct binary classification. The performance of the proposed method is on par with the state-of-art methods on public dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Kim Y, Whitman. B. Singer identification in popular music recordings using voicecoding features[C]. ISMIR. Paris, France, 2002
Vembu S, Baumann. S. Separation of vocals from polyphonic audio recordings[C].ISMIR. London, UK, 2005
Salamon e a J. Melody extraction from polyphonic music signals: Approaches, appli-cations, and challenges[J]. IEEE Signal Processing Magazine. 2014
Hsu e a C.-L. A tandem algorithm for singing pitch extraction and voice separationfrom music accompaniment[J]. IEEE Transactions on Audio, Speech, and LanguageProcessing. 2012
Sonnleitner e a R. A simple and effective spectral feature for speech detection in mixedaudio signals[C]. DAFx’12. York, UK, 2012
Ramona G R M., David B. Vocal detection in music with support vector machines[C].ICASSP. Las Vegas, NV, USA, 2008
Mauch e a M. Timbre and melody features for the recognition of vocal activity andinstrumental solos in polyphonic music[C]. ISMIR. Miami, Florida, USA, 2011
Eyben e a F. Real-life voice activity detection with lstm recurrent neural networks andan application to hollywood movies[C]. ICASSP. Vancouver, BC, Canada, 2013
Lehner, G.W.B., Bock, S.: A low-latency, real-time-capable singing voice detectionmethod with lstm recurrent neural networks[C]. EUSIPCO. Nice, France (2015)
Leglaive, R.H.S., Badeau, R.: Singing voice detection with deep recurrent neural networks[C]. ICASSP. South Brisbane, Queensland, Australia (2015)
Schlüter J, Grill T. Exploring data augmentation for improved singing voice detectionwith neural networks[C]. ISMIR. Malaga, Spain, 2015
Regnier, L., Peeters, G.: Singing voice detection in music tracks using direct voice vibrato detection[C]. ICASSP. Taipei, Taiwan (2009)
Pikrakis e a A. Unsupervised singing voice detection using dictionary learning[C].EUSIPCO. Budapest, Hungary, 2016
Li, X.F.W., Xue, M.: Reducing manual labeling in singing voice detection: An activelearning approach[C]. ICME. Seattle, WA, USA (2016)
Rocamora M, Herrera P. Comparing audio descriptors for singing voice detection inmusic audio files[C]. Brazilian Symposium on Computer Music. San Pablo, Brazil,2007
Lehner, R.S.B., Widmer, G.: Towards light-weight, real-time-capable singing voice de-tection[C]. ISMIR. Curitiba, PR, Brazil (2013)
Lehner, G.W.B., Sonnleitner, R.: On the reduction of false positives in singing voicedetection[C]. ICASSP. Florence, Italy (2014)
Rafii Z, Pardo. B. Music/voice separation using the similarity matrix[C]. ISMIR.Porto, Portugal, 2012
FitzGerald, D.: Vocal separation using nearest neighbours and median filtering[C]. ISSC.Maynooth, Ireland (2012)
You Y C W S.D., Peng S H. Comparative study of singing voice detection methods[J].Multimedia tools and applications. 2016
Gupta H, Gupta D. Lpc and lpcc method of feature extraction in speech recogni-tion system[C]. 6th International Conference-Cloud System and Big Data Engineering(Confluence). Noida, India, 2016
Ellis, D., Poliner, G.: Identifying cover songs’ with chroma features and dynamic programming beat tracking[C]. ICASSP. Honolulu, Hawaii, USA (2007)
Muller S E M., Kreuzer S. Making chroma features more robust to timbre changes[C].ICASSP. Taipei, Taiwan, 2009
Richard, M.R.G., Essid, S.: Combined supervised and unsupervised approaches forautomatic segmentation of radiophonic audio streams[C]. ICASSP. Honolulu, Hawaii, USA (2007)
Acknowledgements
This work is supported by NSFC 61671156.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, X., Li, S., Li, Z., Chen, S., Gao, Y., Li, W. (2020). Singing Voice Detection Using Multi-Feature Deep Fusion with CNN. In: Li, H., Li, S., Ma, L., Fang, C., Zhu, Y. (eds) Proceedings of the 7th Conference on Sound and Music Technology (CSMT). Lecture Notes in Electrical Engineering, vol 635. Springer, Singapore. https://doi.org/10.1007/978-981-15-2756-2_4
Download citation
DOI: https://doi.org/10.1007/978-981-15-2756-2_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-2755-5
Online ISBN: 978-981-15-2756-2
eBook Packages: EngineeringEngineering (R0)