Skip to main content
Log in

Speech and music separation approaches - a survey

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

With the growth of acoustic data in the development of multimedia tools, mobile phones and the Internet of Multimedia Things (IoMT), recent studies exploit different models of machine hearing capable of capturing sounds, classification and separating them in different types of speech, music and environmental sounds. The separation of speech, music, and environmental sounds plays an important role in the automatic machine hearing to develop future applications for big acoustic data (BAD) processing. This paper critically reviews the various approaches and methods adopted in speech and music separation, and highlights how the algorithms and techniques can help machine hearing applications. First, we describe the main sound characteristics and features in order to discuss the approaches for separating sounds into speech and music in order to categorize the related literature. Next, we present the processing of voice, speech and music separately, and we explain machine hearing to analyze existing information approaches. Subsequently, we explain a new BAD model and the set of challenges that music and speech processing research algorithms should focus on and required novel items to big data processing in the future. Finally, all existing metrics and data sets are discussed and required future metrics and data sets for BAD in order to experiment and evaluate with new multimedia applications presented, with the conclusion of the future directions are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28

Similar content being viewed by others

References

  1. Aissa-EI-Bey A, Abed-Meriam K, Grenier Y (2007) Underdetermined Blind Audio Source Separation using Modal Decomposition. EURASIP Journal on Audio Speech, music Processing, pp 1–15

  2. Ajmera J, McCowan IA, Bourland H (2002) Robust HMM-based Speech/Music Segmentation. IEEE Int Conf Acoust Speech Signal Process 1:1–297

    Google Scholar 

  3. Alias F, Socoro JC, Sevillano X (2016) A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds. Appl Sci 6(5):143

    Article  Google Scholar 

  4. Amodei D et al (2016) Deep Speech 2: End-to-end Speech Recognition in English and Mandarin. International Conference on Machine Learning, pp 173–182

  5. Arqub OA, Al-Smadi M (2020) Fuzzy conformable fractional differential equations: novel extended approach and new numerical solutions. Soft Comput:1–22

  6. Arqub OA et al (2017) Application of reproducing kernel algorithm for solving second-order, two-point fuzzy boundary value problems. Soft Comput 21(23):7191–7206

  7. Barchiesi D, Giannoulis D, Stowell D, Plumbley MD (2015) Acoustic scene classification: classifying environments from the sounds they produce. IEEE Signal Proc Mag 32(3):16–34

    Article  Google Scholar 

  8. Beerends GC et al (2016) Quantifying sound quality in loudspeaker reproduction. J Audio Eng Soc 64(10):784–799

    Article  Google Scholar 

  9. Burute H, Mane PB (2015) Separation of singing voice from music background. Int J Comput Appl 129(4):22–26

    Google Scholar 

  10. Burute H, Mane PB (2015) Separation of Singing Voice from Music Accompaniment using matrix Factorization Method. IEEE International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), pp 166–171

  11. Chachada S, Kuo CCJ (2014) Environmental sound recognition: a survey. APSIPA Trans Signal Inf Process 3:1–15

    Article  Google Scholar 

  12. Chan TS et al (2015) Vocal activity informed singing voice separation with the ikala dataset. IEEE international conference on acoustics, Speech and Signal Processing (ICASSP), pp 718–722

  13. Chien JT, Yang P (2016) Bayesian Factorization and Learning for Monaural Source Separation. IEEE Trans Audio Speech Lang Process 24(1):185–195

    Article  Google Scholar 

  14. Cichocki A et al (2009) Nonnegative Matrix and Tensor Factorizations-Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. Wiley, Ch.1 ISBN:9780470746660

  15. Dafforn KA et al (2016) Big data opportunities and challenges for assessing multiple stressors across scales in aquatic ecosystems. Mar Freshw Res 67(4):393–413

    Article  Google Scholar 

  16. Delić V et al (2019) Speech technology progress based on new machine learning paradigm. Computational intelligence and neuroscience, pp 25

  17. Ding N et al (2017) Temporal modulations in speech and music. Neurosci Biobehav Rev 81:181–187

    Article  Google Scholar 

  18. Driedger J, Miiller M (2015) Extracting singing voice from music recordings by cascading audio decomposition techniques. IEEE international conference on acoustics, Speech and Signal Processing (ICASSP), pp 126–130

  19. Duan S, Zhang J, Roe P, Towsey M (2014) A survey of tagging techniques for music, speech and enviromental sound. Artif Intell Rev 42(4):637–661

    Article  Google Scholar 

  20. Dubey H, Mehl MR, Mankodiya K (2016) Bigear: Inferring the Ambient and Emotional Correlates from Smartphone-based Acoustic Big Data. IEEE International Workshop on Big Data Analytics for Smart and Connected Health, pp 78–83

  21. Dugan P et al (2015) High Performance Computer Acoustic Data Accelerator: A New System for Exploring Marine Mammal Acoustics for Big Data Applicaions, arXiv:1509.03591

  22. El-Maleh K et al (2000) Speech/Music Discrimination for multimedia applications. IEEE Int Conf Acoust Speech Signal Process 4:2445–2448

    Google Scholar 

  23. Fevotte C, Gribonval R, Vincent E (2005) BSS-EVAL Toolbox User Guide-Revision 2.0

  24. Fevotte C, Kowalski M (2015) Hybrid Sparse and Low-Rank Time-Frequency Signal Decomposition, 23rd European Signal Processing Conference, pp 1–5

  25. Fevotte C, Vincent E, Ozerov A (2018) Single channel audio source separation with NMF: divergences, Constraints and Algorithms, Audio Source Separation. Springer, pp 1–24

  26. Goel P, Sharma P, Srivastava S (2016) Design of electrical ultrasonic converter model to generate electricity. IEEE International Conference on Computational Intelligence & Communication Technology (CICT), pp 403–405

  27. Grondin F, Michaud F (2016) Robust Speech/Non-Speech Discrimination Based on Pitch Estimation for mobile Robots. IEEE International Conference on Robotics and Automation, pp 1650–1655

  28. Guo J et al (2016) GPU-Based fast signal processing for large amounts of snore sound data. IEEE Glob Conf Consum Electron, pp 1–3

  29. Han J, Chen C (2011) Improving melody extraction using probabilistic latent component analysis. IEEE international conference on acoustics Speech and Signal Processing (ICASSP), pp 33–36

  30. Hobson-Webb L, Cartwright M (2017) Advancing neuromuscular ultrasound through research: Finding common sound. Muscle Nerve 56(3):375–378

    Article  Google Scholar 

  31. Holmes T (2021) Defining voice design in video games

  32. Hsu CL, Wang D, Jang JR, Hu K (2012) A tandem algorithm for singing pitch extraction and voice separation from music accompaniment. IEEE Trans Audio Speech Lang Process 20(5):1482–1491

    Article  Google Scholar 

  33. Huang P et al (2012) Singing-voice Separation from Monaural Recordings using Robust Principle Component Analysis. International Conference on Acoustics, Speech and Signal Processing, pp 57–60

  34. Hurley N et al (2005) Blind source separation of speech in hardware. IEEE Workshop on Signal Processing Design and Implementation, pp 442–445

  35. Igarashi Y et al (2013) Evaluation of Sinusoidal Modeling for Polyphonic Music Signal. 9th International Conference on Intelligent Hiding and Multimedia Signal Processing, pp 464–467

  36. Ikemiya Y, Itoyama K, Yoshii K (2015) Singing Voice Analysis and Editing based on Mutually dependent F0 Estimation and Source Separation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 574–578

  37. Ikemiya Y, Itoyama K, Yoshii K (2016) Singing voice separation and vocal F0 estimation based on mutual combination of robust principle component analysis and subharmonuc summation. IEEE Trans Audio Speech Lang Process 24 (11):2084–2095

    Article  Google Scholar 

  38. Kent G, al e. t. (2017) Low-power image recognition challenge. IEEE Asia and South Pacific Design Automation Conference (ASP-DAC), pp 99–104

  39. Khonglah BK, Prasanna SM (2016) Speech / music classification using speech-specific features. Digital Signal Process 48:71–83

    Article  MathSciNet  Google Scholar 

  40. Kune R et al (2016) The anatomy of big data computing. Softw Practice Exper 46(1):79–105

    Article  Google Scholar 

  41. Kune R et al (2017) XHAMI-Extended HDFS and MapReduce interface for big data image processing applications in cloud computing environments. Softw Practice Exper 47(3):455–472

    Article  Google Scholar 

  42. Lagrange M et al (2008) Normalized cuts for predominant melodic source separation. IEEE Trans Audio Speech Lang Process 16(2):278–290

    Article  MathSciNet  Google Scholar 

  43. Li F, Akagi M (2018) Unsupervised Singing Voice Separation Based on Robust Principal Component Analysis Exploiting Rank-1 Constraint. 26th IEEE European Signal Processing Conference (EUSIPCO), pp 1920–1924

  44. Li Y, Wang D (2006) Singing Voice Separation from Monaural Recordings. 7th International Society for Music Information Retrieval Conference (ISMIR), vol 176, pp 179

  45. Li Y, Wang D (2007) Separation of singing voice from music accompaniment for monaural recordings. IEEE Trans Audio Speech Lang Process 15 (4):1475–1487

    Article  Google Scholar 

  46. Liutkus A et al (2012) Adaptive filtering for Music/Voice separation exploiting the repeating musical structure. IEEE international conference on acoustics, Speech and Signal Processing (ICASSP), pp 53–56

  47. Lyon RF (2010) Machine hearing: an emerging field [ExploratoryDSP]. IEEE Signal Proc Mag 27(7):131–139

    Article  Google Scholar 

  48. Mai Y et al (2015) Transductive Convolutive Nonnegative Matrix Factorization for Speech Separation, 4th IEEE International Conference on Computer Science and Network Technology (ICCSNT), vol 1, pp 1400–1404

  49. McFee B et al (2012) The million song dataset challenges. International Conference on World Wide Web, pp 909–916

  50. Mcleod A, Steedman M (2016) HMM-Based Voice Separation of MIDI Performance. J Music Res 45(1):17–26

    Article  Google Scholar 

  51. Mcloughlin I (2009) Applied Speech and Audio Processing with Matlab Examlpes. Cambridge University Press, Ch.3, ISBN:978-0-511-51654-2

  52. Meneghesso G et al (2017) Smart power devices nanotechnology, nanoelectronics: materials, Devices. Applications, vol 2

  53. Mimilkis SI, Drossos K, Schuller G (2021) Unsupervised interpretable representation learning for singing voice separation. European Signal Processing Conference (EUSIPCO), pp 1412–1416

  54. Mirbeygi M et al (2021) RPCA-Based real-time speech and music separation method. Speech Comm 126:22–34

    Article  Google Scholar 

  55. Miyazaki K et al (2019) Environmental sound processing and its applications. IEEJ Trans Electr Electron Eng 14(3):340–351

    Article  Google Scholar 

  56. Mohammed A, Ballal T, Grbic N (2007) Blind source separation using time - frequency masking. RadioEngineering-Prague 16(4):96–100

    Google Scholar 

  57. Mowlavi P, Froghani A, Sayadiyan A (2008) Sparse sinusoidal signal representation for speech and music signals. Springer, Berlin, pp 469–476

  58. Muller M (2015) Fundamentals of music processing. Springer, ch.1, 8 ISBN:978-3-319-21944-8

  59. Munoz-Exposito JE, Garcia-Galan S, Ruiz-Reyes N, Vera-Candeas P, Rivas-Pena F (2005) Speech/music discrimination using a single warped LPC-based feature. Int Conf Music Inf 5:16–25

    Google Scholar 

  60. Munoz-exposito JE et al (2006) Speech/Music Discrimination using a Warped LPC-Based Feature and A Fuzzy System for Intelligent Audio Coding. 14th Europian Signal Processing Conference, pp 1–5

  61. Nugraha AA, Liutkus A, Vincent E (2018) Deep Neural Network based Multichannel Audio Source Separation, Audio Source Separation. Springer, pp 157–185

  62. Ozerov A, Vincent E, Bimbot F (2012) A general flexible framework for handling of prior information in audio source separation. IEEE Trans Audio Speech Lang Process Inst Electr Electron Eng 20(4):1118–1133

    Google Scholar 

  63. Ozerov A et al (2005) One Microphone Singing Voice Separation using Source-Adapted Models. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp 90–93

  64. Ozerov A et al (2007) Adaption of bayesian models for single channel source separation and its application to Voice/Music separation in popular songs. IEEE Transactions on Audio Speech, and Language Processing 15 (5):1564–1578

    Article  Google Scholar 

  65. Pikrakis A, Theodoridis S (2014) Speech-Music Discrimination: a deep learning perspective. IEEE European signal processing conference (EUSIPCO), pp 616–620

  66. Pulkki V, Karjalainen M (2015) Communication acoustics: an introduction to speech, audio and psychoacoustics. Wiley. ISBN:978-1-118-86654-2

  67. Puy G, Ozerov A, Duong N, Perez P (2017) Informed source separation via compressive graph sampling. IEEE international conference on acoustics, Speech and Signal Processing (ICASSP), pp 1–5

  68. Radhakrishnan R, Divakaran A, Smaragdis A (2005) Audio analysis for surveillance applications. IEEE Workshop Applications of Signal Processing to Audio and Acoustics, pp 158–161

  69. Rafii Z, Duan Z, Pardo B (2014) Combining rhythm-based and pitch-based methods for background and melody separation. IEEE Trans Audio Speech Lang Prcess 22(12):1884–1893

    Article  Google Scholar 

  70. Rafii Z, Liutkus A, Pardo B (2015) A simple user interface system for recovering patterns repeating in time and frequency in mixtures of sounds. IEEE international conference on acoustics, Speech and Signal Processing (ICASSP), pp 271–275

  71. Rafii Z, Pardo B (2011) Degenerate unmixing Estimation Tecnique using zthe Constant Q Transform. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 217–220

  72. Rafii Z, Pardo B (2012) Music/voice Separation using the Similarity Matrix. International Society for Music Information Retrieval Conference (ISMIR), pp 583–588

  73. Rafii Z, Pardo B (2012) Repeating pattern extraction technique (REPET): a simple method for Music/Voice separation. IEEE Trans Audio Speech Lang Process 21(1):71–84

    Google Scholar 

  74. Rafii Z, Pardo B, simple A (2011) Music/voice Separation Method based on the Extraction of Repeating Musical Structure. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 221–224

  75. Rafii Z et al (2013) Combinig modeling of Singing Voice and Background Music for Automatic Separation of Musical Mixtures. Int Soc Music Inf Retr Conf (ISMIR) 10:645–680

    Google Scholar 

  76. Rajapakse M, Wyse L (2005) Generic Audio Classification using a Hybrid Model based on GMMs and HMMs. IEEE International Multimedia Modeling Conference, pp 53–58

  77. Rao V, Ramakrishnan S, Rao P (2009) Singing Voice Detection in Polyphonic Music using Predominant Pitch Annual Conference of the International Speech Communication Association. (Interspeech)

  78. Reginer L, Peeters G (2012) Singing Voice Detection in Music Tracks using Direct Voice Vibrato Detection. IEEE International conference on acoustics, Speech and Signal Processing (ICASSP), pp 1685–1688

  79. Rickard S (2007) The duet blind source separation algorithm. Blind Speech Separation (Springer), pp 217–241

  80. Roads C, Pope ST, Piccialli A, poli GD (1997) Musical signal processing. Swets & Zeitlinger Publishers ISBN:9026514824

  81. Rossing TD (2007) Springer Handbook of Acoustics. Springer handbook of acoustics, vol 1. ISBN:978-0-378-30446-5

  82. Roux JL, Hershey J, Weninger F, Deep NMF (2015) For Speech Separation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 66–70

  83. Rumsey F, McCormick T (2009) Sound and recording. Elsevier ltd, ch. 1 ISBN:978-0-240-52163-3

  84. Sagiroglu S, Sinanc D (2013) Big data: a review. IEEE International Conference on Collaboration Technologies and Systems (CTS), pp 42–47

  85. Sarasola X et al (2019) Application of pitch derived parameters to speech and monophonic singing classification. Appl Sci 9(15):3140

    Article  Google Scholar 

  86. Sell G, Clark P (2014) Music tonality features for Speech/Music discrimination. IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2489–2493

  87. Shamim HM et al (2016) Audio-visual emotion recognition using big data towards 5G. Mob Netw Appl 21(5):753–763

    Article  Google Scholar 

  88. Songnian L et al (2016) Geospatial big data handling theory and methods: a review and research challenges. ISPRS J Photogramm Remote Sens 115:119–133

    Article  Google Scholar 

  89. Sprechmann P, Bronstein A, Sapiro G (2012) Real-Time Online Singing Voice Separation from Monaural Recordings using Robust Low-Rank Modeling, 13th International Society for Music Information Retrieval Conference (ISMIR), pp 67–72

  90. Stanev M et al (2016) Speech and Music Discrimination: Human Detection of Differences between Music and Speech based on Rhythm. Speech Prosody Conference,International Speech Communication Association, pp 222–226

  91. Synder D, Chen G, Povey D (2015) MUSAN: A Music, Speech, and Noise Corpus, arXiv:1510.08484

  92. Taniguchi T, Tohyama M, Shirai K (2008) Detection of speech and music based on spectral tracking. Speech Comm 50(7):547–563

    Article  Google Scholar 

  93. Tjandra A, Sakti S, Nakamura S (2020) Machine speech chain. IEEE/ACM Trans Audio Speech Lang Process 2(28):976–89

    Article  Google Scholar 

  94. Toroghi RM (2016) Blind Speech Separation in Distant Speech Recognition Front-end Processing. PhD Dissertation, Saarland University Germany

  95. Tsai WH, Ma CH (2014) Speech and singing discrimination for audio data indexing. IEEE International Congress on Big Data, pp 276–280

  96. Tsipas N et al (2017) Efficient Audio-Driven Multimedia Indexing through Similarity-based Speech/Music Discrimination. Multimed Tools Appl 76 (24):25603–25621

    Article  Google Scholar 

  97. Ullo SL, Khare SK, Bajaj V, Sinha GR (2020) Hybrid computerized method for environmental sound classification. IEEE Access 8:124055–124065

  98. Vacher M, Serignat JF, Chaillol S (2007) Sound classification in a smart room environment: an approach using GMM and HMM methods. 4th IEEE Conference Speech Technique, Human-Computer Dialogue, vol 1, pp 135–146

  99. Vallin J et al (2016) Low-Complexity Iterative Sinusoidal Parameter Estimation, arXiv:1603.01824

  100. Vaseghi S (2007) Multimedia signal processing theory and applications in speech, music and communication. Wiley, Ch. 6

  101. Vaseghi S (2008) Advanced digital signal processing and noise reduction. John Wiley, pp 29–43

  102. Verma JP et al (2016) Big data analytics: Challenges and applications for text, audio, video, and social media data. Int J Soft Comput Artif Intell Appl 5(1):41–51

    MathSciNet  Google Scholar 

  103. Virtanen T (2000) Audio signal modeling with sinusoids plus noise. Master of Science Thesis, Tampere University of Technology

  104. Virtanen T, Mesaros A, Ryynanen M (2008) Combining Pitch-Based Inference and Non-ngative Spectrogram Factorization in Separating Vocals from Polyphonic Music. ITRW on Statistical and Perceptual Audio Processing, pp 17–22

  105. Wolfe J (2002) Speech and music, acoustics and coding, and what music might be for. 7th International Conference on Music Perception and Cognition, pp 10-13

  106. Wu X, Zhu X, Wu G, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107

    Article  Google Scholar 

  107. Xu X, Flynn R, Russell M (2017) Speech Intelligibility and Quality: A Comparative Study of Speech Enhancement Algorithms, 28th IEEE Irish Signal and System Conference, pp 1–6

  108. Zeremdini J, Messaoud MB, Bouzid A (2015) A comparison of several computational auditory scene analysis (CASA) techniques for monaural speech segregation. Brain Inf (Springer) 2(3):155–166

    Article  Google Scholar 

  109. Zhang Z et al (2021) Attention based convolutional recurrent neural network for environmental sound classification. Neurocomputing 453:896–903

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aminollah Mahabadi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mirbeygi, M., Mahabadi, A. & Ranjbar, A. Speech and music separation approaches - a survey. Multimed Tools Appl 81, 21155–21197 (2022). https://doi.org/10.1007/s11042-022-11994-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-11994-1

Keywords

Navigation