Abstract
With the growth of acoustic data in the development of multimedia tools, mobile phones and the Internet of Multimedia Things (IoMT), recent studies exploit different models of machine hearing capable of capturing sounds, classification and separating them in different types of speech, music and environmental sounds. The separation of speech, music, and environmental sounds plays an important role in the automatic machine hearing to develop future applications for big acoustic data (BAD) processing. This paper critically reviews the various approaches and methods adopted in speech and music separation, and highlights how the algorithms and techniques can help machine hearing applications. First, we describe the main sound characteristics and features in order to discuss the approaches for separating sounds into speech and music in order to categorize the related literature. Next, we present the processing of voice, speech and music separately, and we explain machine hearing to analyze existing information approaches. Subsequently, we explain a new BAD model and the set of challenges that music and speech processing research algorithms should focus on and required novel items to big data processing in the future. Finally, all existing metrics and data sets are discussed and required future metrics and data sets for BAD in order to experiment and evaluate with new multimedia applications presented, with the conclusion of the future directions are discussed.
Similar content being viewed by others
References
Aissa-EI-Bey A, Abed-Meriam K, Grenier Y (2007) Underdetermined Blind Audio Source Separation using Modal Decomposition. EURASIP Journal on Audio Speech, music Processing, pp 1–15
Ajmera J, McCowan IA, Bourland H (2002) Robust HMM-based Speech/Music Segmentation. IEEE Int Conf Acoust Speech Signal Process 1:1–297
Alias F, Socoro JC, Sevillano X (2016) A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds. Appl Sci 6(5):143
Amodei D et al (2016) Deep Speech 2: End-to-end Speech Recognition in English and Mandarin. International Conference on Machine Learning, pp 173–182
Arqub OA, Al-Smadi M (2020) Fuzzy conformable fractional differential equations: novel extended approach and new numerical solutions. Soft Comput:1–22
Arqub OA et al (2017) Application of reproducing kernel algorithm for solving second-order, two-point fuzzy boundary value problems. Soft Comput 21(23):7191–7206
Barchiesi D, Giannoulis D, Stowell D, Plumbley MD (2015) Acoustic scene classification: classifying environments from the sounds they produce. IEEE Signal Proc Mag 32(3):16–34
Beerends GC et al (2016) Quantifying sound quality in loudspeaker reproduction. J Audio Eng Soc 64(10):784–799
Burute H, Mane PB (2015) Separation of singing voice from music background. Int J Comput Appl 129(4):22–26
Burute H, Mane PB (2015) Separation of Singing Voice from Music Accompaniment using matrix Factorization Method. IEEE International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), pp 166–171
Chachada S, Kuo CCJ (2014) Environmental sound recognition: a survey. APSIPA Trans Signal Inf Process 3:1–15
Chan TS et al (2015) Vocal activity informed singing voice separation with the ikala dataset. IEEE international conference on acoustics, Speech and Signal Processing (ICASSP), pp 718–722
Chien JT, Yang P (2016) Bayesian Factorization and Learning for Monaural Source Separation. IEEE Trans Audio Speech Lang Process 24(1):185–195
Cichocki A et al (2009) Nonnegative Matrix and Tensor Factorizations-Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. Wiley, Ch.1 ISBN:9780470746660
Dafforn KA et al (2016) Big data opportunities and challenges for assessing multiple stressors across scales in aquatic ecosystems. Mar Freshw Res 67(4):393–413
Delić V et al (2019) Speech technology progress based on new machine learning paradigm. Computational intelligence and neuroscience, pp 25
Ding N et al (2017) Temporal modulations in speech and music. Neurosci Biobehav Rev 81:181–187
Driedger J, Miiller M (2015) Extracting singing voice from music recordings by cascading audio decomposition techniques. IEEE international conference on acoustics, Speech and Signal Processing (ICASSP), pp 126–130
Duan S, Zhang J, Roe P, Towsey M (2014) A survey of tagging techniques for music, speech and enviromental sound. Artif Intell Rev 42(4):637–661
Dubey H, Mehl MR, Mankodiya K (2016) Bigear: Inferring the Ambient and Emotional Correlates from Smartphone-based Acoustic Big Data. IEEE International Workshop on Big Data Analytics for Smart and Connected Health, pp 78–83
Dugan P et al (2015) High Performance Computer Acoustic Data Accelerator: A New System for Exploring Marine Mammal Acoustics for Big Data Applicaions, arXiv:1509.03591
El-Maleh K et al (2000) Speech/Music Discrimination for multimedia applications. IEEE Int Conf Acoust Speech Signal Process 4:2445–2448
Fevotte C, Gribonval R, Vincent E (2005) BSS-EVAL Toolbox User Guide-Revision 2.0
Fevotte C, Kowalski M (2015) Hybrid Sparse and Low-Rank Time-Frequency Signal Decomposition, 23rd European Signal Processing Conference, pp 1–5
Fevotte C, Vincent E, Ozerov A (2018) Single channel audio source separation with NMF: divergences, Constraints and Algorithms, Audio Source Separation. Springer, pp 1–24
Goel P, Sharma P, Srivastava S (2016) Design of electrical ultrasonic converter model to generate electricity. IEEE International Conference on Computational Intelligence & Communication Technology (CICT), pp 403–405
Grondin F, Michaud F (2016) Robust Speech/Non-Speech Discrimination Based on Pitch Estimation for mobile Robots. IEEE International Conference on Robotics and Automation, pp 1650–1655
Guo J et al (2016) GPU-Based fast signal processing for large amounts of snore sound data. IEEE Glob Conf Consum Electron, pp 1–3
Han J, Chen C (2011) Improving melody extraction using probabilistic latent component analysis. IEEE international conference on acoustics Speech and Signal Processing (ICASSP), pp 33–36
Hobson-Webb L, Cartwright M (2017) Advancing neuromuscular ultrasound through research: Finding common sound. Muscle Nerve 56(3):375–378
Holmes T (2021) Defining voice design in video games
Hsu CL, Wang D, Jang JR, Hu K (2012) A tandem algorithm for singing pitch extraction and voice separation from music accompaniment. IEEE Trans Audio Speech Lang Process 20(5):1482–1491
Huang P et al (2012) Singing-voice Separation from Monaural Recordings using Robust Principle Component Analysis. International Conference on Acoustics, Speech and Signal Processing, pp 57–60
Hurley N et al (2005) Blind source separation of speech in hardware. IEEE Workshop on Signal Processing Design and Implementation, pp 442–445
Igarashi Y et al (2013) Evaluation of Sinusoidal Modeling for Polyphonic Music Signal. 9th International Conference on Intelligent Hiding and Multimedia Signal Processing, pp 464–467
Ikemiya Y, Itoyama K, Yoshii K (2015) Singing Voice Analysis and Editing based on Mutually dependent F0 Estimation and Source Separation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 574–578
Ikemiya Y, Itoyama K, Yoshii K (2016) Singing voice separation and vocal F0 estimation based on mutual combination of robust principle component analysis and subharmonuc summation. IEEE Trans Audio Speech Lang Process 24 (11):2084–2095
Kent G, al e. t. (2017) Low-power image recognition challenge. IEEE Asia and South Pacific Design Automation Conference (ASP-DAC), pp 99–104
Khonglah BK, Prasanna SM (2016) Speech / music classification using speech-specific features. Digital Signal Process 48:71–83
Kune R et al (2016) The anatomy of big data computing. Softw Practice Exper 46(1):79–105
Kune R et al (2017) XHAMI-Extended HDFS and MapReduce interface for big data image processing applications in cloud computing environments. Softw Practice Exper 47(3):455–472
Lagrange M et al (2008) Normalized cuts for predominant melodic source separation. IEEE Trans Audio Speech Lang Process 16(2):278–290
Li F, Akagi M (2018) Unsupervised Singing Voice Separation Based on Robust Principal Component Analysis Exploiting Rank-1 Constraint. 26th IEEE European Signal Processing Conference (EUSIPCO), pp 1920–1924
Li Y, Wang D (2006) Singing Voice Separation from Monaural Recordings. 7th International Society for Music Information Retrieval Conference (ISMIR), vol 176, pp 179
Li Y, Wang D (2007) Separation of singing voice from music accompaniment for monaural recordings. IEEE Trans Audio Speech Lang Process 15 (4):1475–1487
Liutkus A et al (2012) Adaptive filtering for Music/Voice separation exploiting the repeating musical structure. IEEE international conference on acoustics, Speech and Signal Processing (ICASSP), pp 53–56
Lyon RF (2010) Machine hearing: an emerging field [ExploratoryDSP]. IEEE Signal Proc Mag 27(7):131–139
Mai Y et al (2015) Transductive Convolutive Nonnegative Matrix Factorization for Speech Separation, 4th IEEE International Conference on Computer Science and Network Technology (ICCSNT), vol 1, pp 1400–1404
McFee B et al (2012) The million song dataset challenges. International Conference on World Wide Web, pp 909–916
Mcleod A, Steedman M (2016) HMM-Based Voice Separation of MIDI Performance. J Music Res 45(1):17–26
Mcloughlin I (2009) Applied Speech and Audio Processing with Matlab Examlpes. Cambridge University Press, Ch.3, ISBN:978-0-511-51654-2
Meneghesso G et al (2017) Smart power devices nanotechnology, nanoelectronics: materials, Devices. Applications, vol 2
Mimilkis SI, Drossos K, Schuller G (2021) Unsupervised interpretable representation learning for singing voice separation. European Signal Processing Conference (EUSIPCO), pp 1412–1416
Mirbeygi M et al (2021) RPCA-Based real-time speech and music separation method. Speech Comm 126:22–34
Miyazaki K et al (2019) Environmental sound processing and its applications. IEEJ Trans Electr Electron Eng 14(3):340–351
Mohammed A, Ballal T, Grbic N (2007) Blind source separation using time - frequency masking. RadioEngineering-Prague 16(4):96–100
Mowlavi P, Froghani A, Sayadiyan A (2008) Sparse sinusoidal signal representation for speech and music signals. Springer, Berlin, pp 469–476
Muller M (2015) Fundamentals of music processing. Springer, ch.1, 8 ISBN:978-3-319-21944-8
Munoz-Exposito JE, Garcia-Galan S, Ruiz-Reyes N, Vera-Candeas P, Rivas-Pena F (2005) Speech/music discrimination using a single warped LPC-based feature. Int Conf Music Inf 5:16–25
Munoz-exposito JE et al (2006) Speech/Music Discrimination using a Warped LPC-Based Feature and A Fuzzy System for Intelligent Audio Coding. 14th Europian Signal Processing Conference, pp 1–5
Nugraha AA, Liutkus A, Vincent E (2018) Deep Neural Network based Multichannel Audio Source Separation, Audio Source Separation. Springer, pp 157–185
Ozerov A, Vincent E, Bimbot F (2012) A general flexible framework for handling of prior information in audio source separation. IEEE Trans Audio Speech Lang Process Inst Electr Electron Eng 20(4):1118–1133
Ozerov A et al (2005) One Microphone Singing Voice Separation using Source-Adapted Models. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp 90–93
Ozerov A et al (2007) Adaption of bayesian models for single channel source separation and its application to Voice/Music separation in popular songs. IEEE Transactions on Audio Speech, and Language Processing 15 (5):1564–1578
Pikrakis A, Theodoridis S (2014) Speech-Music Discrimination: a deep learning perspective. IEEE European signal processing conference (EUSIPCO), pp 616–620
Pulkki V, Karjalainen M (2015) Communication acoustics: an introduction to speech, audio and psychoacoustics. Wiley. ISBN:978-1-118-86654-2
Puy G, Ozerov A, Duong N, Perez P (2017) Informed source separation via compressive graph sampling. IEEE international conference on acoustics, Speech and Signal Processing (ICASSP), pp 1–5
Radhakrishnan R, Divakaran A, Smaragdis A (2005) Audio analysis for surveillance applications. IEEE Workshop Applications of Signal Processing to Audio and Acoustics, pp 158–161
Rafii Z, Duan Z, Pardo B (2014) Combining rhythm-based and pitch-based methods for background and melody separation. IEEE Trans Audio Speech Lang Prcess 22(12):1884–1893
Rafii Z, Liutkus A, Pardo B (2015) A simple user interface system for recovering patterns repeating in time and frequency in mixtures of sounds. IEEE international conference on acoustics, Speech and Signal Processing (ICASSP), pp 271–275
Rafii Z, Pardo B (2011) Degenerate unmixing Estimation Tecnique using zthe Constant Q Transform. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 217–220
Rafii Z, Pardo B (2012) Music/voice Separation using the Similarity Matrix. International Society for Music Information Retrieval Conference (ISMIR), pp 583–588
Rafii Z, Pardo B (2012) Repeating pattern extraction technique (REPET): a simple method for Music/Voice separation. IEEE Trans Audio Speech Lang Process 21(1):71–84
Rafii Z, Pardo B, simple A (2011) Music/voice Separation Method based on the Extraction of Repeating Musical Structure. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 221–224
Rafii Z et al (2013) Combinig modeling of Singing Voice and Background Music for Automatic Separation of Musical Mixtures. Int Soc Music Inf Retr Conf (ISMIR) 10:645–680
Rajapakse M, Wyse L (2005) Generic Audio Classification using a Hybrid Model based on GMMs and HMMs. IEEE International Multimedia Modeling Conference, pp 53–58
Rao V, Ramakrishnan S, Rao P (2009) Singing Voice Detection in Polyphonic Music using Predominant Pitch Annual Conference of the International Speech Communication Association. (Interspeech)
Reginer L, Peeters G (2012) Singing Voice Detection in Music Tracks using Direct Voice Vibrato Detection. IEEE International conference on acoustics, Speech and Signal Processing (ICASSP), pp 1685–1688
Rickard S (2007) The duet blind source separation algorithm. Blind Speech Separation (Springer), pp 217–241
Roads C, Pope ST, Piccialli A, poli GD (1997) Musical signal processing. Swets & Zeitlinger Publishers ISBN:9026514824
Rossing TD (2007) Springer Handbook of Acoustics. Springer handbook of acoustics, vol 1. ISBN:978-0-378-30446-5
Roux JL, Hershey J, Weninger F, Deep NMF (2015) For Speech Separation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 66–70
Rumsey F, McCormick T (2009) Sound and recording. Elsevier ltd, ch. 1 ISBN:978-0-240-52163-3
Sagiroglu S, Sinanc D (2013) Big data: a review. IEEE International Conference on Collaboration Technologies and Systems (CTS), pp 42–47
Sarasola X et al (2019) Application of pitch derived parameters to speech and monophonic singing classification. Appl Sci 9(15):3140
Sell G, Clark P (2014) Music tonality features for Speech/Music discrimination. IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2489–2493
Shamim HM et al (2016) Audio-visual emotion recognition using big data towards 5G. Mob Netw Appl 21(5):753–763
Songnian L et al (2016) Geospatial big data handling theory and methods: a review and research challenges. ISPRS J Photogramm Remote Sens 115:119–133
Sprechmann P, Bronstein A, Sapiro G (2012) Real-Time Online Singing Voice Separation from Monaural Recordings using Robust Low-Rank Modeling, 13th International Society for Music Information Retrieval Conference (ISMIR), pp 67–72
Stanev M et al (2016) Speech and Music Discrimination: Human Detection of Differences between Music and Speech based on Rhythm. Speech Prosody Conference,International Speech Communication Association, pp 222–226
Synder D, Chen G, Povey D (2015) MUSAN: A Music, Speech, and Noise Corpus, arXiv:1510.08484
Taniguchi T, Tohyama M, Shirai K (2008) Detection of speech and music based on spectral tracking. Speech Comm 50(7):547–563
Tjandra A, Sakti S, Nakamura S (2020) Machine speech chain. IEEE/ACM Trans Audio Speech Lang Process 2(28):976–89
Toroghi RM (2016) Blind Speech Separation in Distant Speech Recognition Front-end Processing. PhD Dissertation, Saarland University Germany
Tsai WH, Ma CH (2014) Speech and singing discrimination for audio data indexing. IEEE International Congress on Big Data, pp 276–280
Tsipas N et al (2017) Efficient Audio-Driven Multimedia Indexing through Similarity-based Speech/Music Discrimination. Multimed Tools Appl 76 (24):25603–25621
Ullo SL, Khare SK, Bajaj V, Sinha GR (2020) Hybrid computerized method for environmental sound classification. IEEE Access 8:124055–124065
Vacher M, Serignat JF, Chaillol S (2007) Sound classification in a smart room environment: an approach using GMM and HMM methods. 4th IEEE Conference Speech Technique, Human-Computer Dialogue, vol 1, pp 135–146
Vallin J et al (2016) Low-Complexity Iterative Sinusoidal Parameter Estimation, arXiv:1603.01824
Vaseghi S (2007) Multimedia signal processing theory and applications in speech, music and communication. Wiley, Ch. 6
Vaseghi S (2008) Advanced digital signal processing and noise reduction. John Wiley, pp 29–43
Verma JP et al (2016) Big data analytics: Challenges and applications for text, audio, video, and social media data. Int J Soft Comput Artif Intell Appl 5(1):41–51
Virtanen T (2000) Audio signal modeling with sinusoids plus noise. Master of Science Thesis, Tampere University of Technology
Virtanen T, Mesaros A, Ryynanen M (2008) Combining Pitch-Based Inference and Non-ngative Spectrogram Factorization in Separating Vocals from Polyphonic Music. ITRW on Statistical and Perceptual Audio Processing, pp 17–22
Wolfe J (2002) Speech and music, acoustics and coding, and what music might be for. 7th International Conference on Music Perception and Cognition, pp 10-13
Wu X, Zhu X, Wu G, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107
Xu X, Flynn R, Russell M (2017) Speech Intelligibility and Quality: A Comparative Study of Speech Enhancement Algorithms, 28th IEEE Irish Signal and System Conference, pp 1–6
Zeremdini J, Messaoud MB, Bouzid A (2015) A comparison of several computational auditory scene analysis (CASA) techniques for monaural speech segregation. Brain Inf (Springer) 2(3):155–166
Zhang Z et al (2021) Attention based convolutional recurrent neural network for environmental sound classification. Neurocomputing 453:896–903
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mirbeygi, M., Mahabadi, A. & Ranjbar, A. Speech and music separation approaches - a survey. Multimed Tools Appl 81, 21155–21197 (2022). https://doi.org/10.1007/s11042-022-11994-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-11994-1