Audio Source Separation in a Musical Context

  • Bryan Pardo
  • Zafar Rafii
  • Zhiyao Duan
Part of the Springer Handbooks book series (SHB)


When musical instruments are recorded in isolation, modern editing and mixing tools allow correction of small errors without requiring a group to re-record an entire passage. Isolated recording also allows rebalancing of levels between musicians without re-recording and application of audio effects to individual instruments. Many of these techniques require (nearly) isolated instrumental recordings to work. Unfortunately, there are many recording situations (e. g., a stereo recording of a 10-piece ensemble) where there are many more instruments than there are microphones, making many editing or remixing tasks difficult or impossible.

Audio source separation is the process of extracting individual sound sources (e. g., a single flute) from a mixture of sounds (e. g., a recording of a concert band using a single microphone). Effective source separation would allow application of editing and remixing techniques to existing recordings with multiple instruments on a single track.

In this chapter we will focus on a pair of source separation approaches designed to work with music audio. The first seeks the repeated elements in the musical scene and separates the repeating from the nonrepeating. The second looks for melodic elements, pitch tracking and streaming the audio into separate elements. Finally, we consider informing source separation with information from the musical score.


Source Separation Time-frequency Bin Mixture Spectrogram Non-peak Regions Pitch Estimation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

beats per minute


independent component analysis


maximum a posteriori


Mel-frequency cepstral coefficient


musical instrument digital interface


University of Iowa musical instrument samples


minimum mean square error


multipitch estimation


nonnegative matrix factorization


nonnegative tensor factorization


probabilistic latent component analysis


repeating pattern extraction technique


robust principal component analysis


short-term Fourier transform/short-time Fourier transform


uniform discrete cepstrum


  1. 15.1
    P. Common, C. Jutten (Eds.): Handbook of Blind Source Separation: Independent Component Analysis and Applications, 1st edn. (Academic, Oxford 2010)Google Scholar
  2. 15.2
    T. Virtanen: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria, IEEE Trans. Audio Speech Lang. Process. 15(3), 1066–1074 (2007)CrossRefGoogle Scholar
  3. 15.3
    D. FitzGerald, M. Cranitch, E. Coyle: Non-negative tensor factorisation for sound source separation. In: Irish Signals and Syst. Conf., Dublin (2005)Google Scholar
  4. 15.4
    P. Smaragdis, B. Raj, M.V.S. Shashanka: A probabilistic latent variable model for acoustic modeling. In: NIPS Workshop Adv. Modeling Acoust. Process., Whistler (2006)Google Scholar
  5. 15.5
    P.-S. Huang, S.D. Chen, P. Smaragdis: Singing-voice separation from monaural recordings using robust principal component analysis. In: 37th Int. Conf. Acoustics, Speech and Signal Process., Kyoto (2012)Google Scholar
  6. 15.6
    H. Schenker: Harmony, Vol. 1 (Univ. Chicago Press, Chicago 1980)Google Scholar
  7. 15.7
    N. Ruwet, M. Everist: Methods of analysis in musicology, Music Anal. 6(1/2), 3–9 (1987)CrossRefGoogle Scholar
  8. 15.8
    A. Ockelford: Repetition in Music: Theoretical and Metatheoretical Perspectives, Royal Musical Association Monographs, Vol. 13, 2005)Google Scholar
  9. 15.9
    M.A. Bartsch: To catch a chorus using chroma-based representations for audio thumbnailing. In: IEEE Workshop Appl. Signal Process. Audio Acoust., New Paltz (2001)Google Scholar
  10. 15.10
    M. Cooper, J. Foote: Automatic music summarization via similarity analysis. In: 3rd Int. Conf. Music Inf. Retr., Paris (2002)Google Scholar
  11. 15.11
    G. Peeters: Deriving musical structures from signal analysis for music audio summary generation: Sequence and state approach, Comput. Music Modeling Retr. 2771, 143–166 (2004)CrossRefGoogle Scholar
  12. 15.12
    J. Foote: Automatic audio segmentation using a measure of audio novelty. In: IEEE Int. Conf. Multimedia and Expo, New York (2000)Google Scholar
  13. 15.13
    J. Foote, S. Uchihashi: The beat spectrum: A new approach to rhythm analysis. In: IEEE Int. Conf. Multimedia and Expo, Tokyo (2001)Google Scholar
  14. 15.14
    K. Yoshii, M. Goto, H.G. Okuno: Drum sound identification for polyphonic music using template adaptation and matching methods. In: ISCA Tutor. Res. Workshop on Stat. Percept. Audio Process., Jeju (2004)Google Scholar
  15. 15.15
    R.B. Dannenberg: Listening to Naima: An automated structural analysis of music from recorded audio. In: Int. Comput. Music Conf., Gothenburg (2002)Google Scholar
  16. 15.16
    R.B. Dannenberg, M. Goto: Music structure analysis from acoustic signals, Handbook of Signal Process, Acoustics 1, 305–331 (2009)Google Scholar
  17. 15.17
    J. Paulus, M. Müller, A. Klapuri: Audio-based music structure analysis. In: 11th Int. Soc. Music Inf. Retr., Utrecht (2010)Google Scholar
  18. 15.18
    J.H. McDermott, D. Wrobleski, A.J. Oxenham: Recovering sound sources from embedded repetition, Proc. Nat. Acad. Sci. USA 108(3), 1188–1193 (2011)CrossRefGoogle Scholar
  19. 15.19
    A. Bregman, C. Jutten: Auditory Scene Analysis: The Perceptual Organization of Sound (MIT Press, Cambridge 1994)Google Scholar
  20. 15.20
    Interactive Audio Lab of Northwestern University:
  21. 15.21
    Z. Rafii, B. Pardo: A simple music–voice separation system based on the extraction of the repeating musical structure. In: 36th Int. Conf. Acoust. Speech Signal Process., Prague (2011)Google Scholar
  22. 15.22
    Z. Rafii, B. Pardo: REpeating pattern extraction technique (REPET): A simple method for music–voice separation, IEEE Trans. Audio Speech Lang. Process. 21(1), 71–82 (2013)CrossRefGoogle Scholar
  23. 15.23
    Z. Rafii, D.L. Sun, F.G. Germain, G.J. Mysore: Combining modeling of singing voice and background music for automatic separation of musical mixtures. In: 14th Int. Soc. Music Inf. Retr., Curitiba (2013)Google Scholar
  24. 15.24
    A. Liutkus, Z. Rafii, R. Badeau, B. Pardo, G. Richard: Adaptive filtering for music–voice separation exploiting the repeating musical structure. In: 37th Int. Conf. Acoustics, Speech and Signal Process., Kyoto (2012)Google Scholar
  25. 15.25
    Z. Rafii, B. Pardo: Music–voice separation using the similarity matrix. In: 13th Int. Soc. Music Inf. Retr., Porto (2012)Google Scholar
  26. 15.26
    J. Foote: Visualizing music and audio using self-similarity. In: 7th ACM Int. Conf. Multimedia, Orlando (1999)Google Scholar
  27. 15.27
    Z. Rafii, B. Pardo: Online REPET-SIM for real-time speech enhancement. In: 38th Int. Conf. Acoust. Speech and Signal Process., Vancouver (2013)Google Scholar
  28. 15.28
    D. FitzGerald: Vocal separation using nearest neighbours and median filtering. In: 23nd IET Irish Signals and Syst. Conf., Maynooth (2012)Google Scholar
  29. 15.29
    Z. Duan, B. Pardo, C. Zhang: Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions, IEEE Trans. Audio Speech Lang. Process. 18(8), 2121–2133 (2010)CrossRefGoogle Scholar
  30. 15.30
    Z. Duan, J. Han, B. Pardo: Multi-pitch streaming of harmonic sound mixtures, IEEE Trans. Audio Speech Lang. Process. 22(1), 1–13 (2014)CrossRefGoogle Scholar
  31. 15.31
    G.E. Poliner, D.P.W. Ellis: A discriminative model for polyphonic piano transcription, EURASIP J. Adv. Signal Process. 2007, 48317-1–48317-9 (2007), CrossRefzbMATHGoogle Scholar
  32. 15.32
    M. Davy, S.J. Godsill, J. Idier: Bayesian analysis of polyphonic western tonal music, J. Acoustical Soc. Am. 119, 2498–2517 (2006)CrossRefGoogle Scholar
  33. 15.33
    E. Vincent, M.D. Plumbley: Efficient Bayesian inference for harmonic models via adaptive posterior factorization, Neurocomputing 72, 79–87 (2008)CrossRefGoogle Scholar
  34. 15.34
    K. Kashino, H. Murase: A sound source identification system for ensemble music based on template adaptation and music stream extraction, Speech Commun. 27(3--4), 337–349 (1999)CrossRefGoogle Scholar
  35. 15.35
    M. Goto: A real-time music-scene-description system: Predominant-F0 estimation for detecting melody and bass lines in real-world audio signals, Speech Commun. 43(4), 311–329 (2004)CrossRefGoogle Scholar
  36. 15.36
    H. Kameoka, T. Nishimoto, S. Sagayama: A multipitch analyzer based on harmonic temporal structured clustering, IEEE Trans. Audio Speech Lang. Process. 15(3), 982–994 (2007)CrossRefGoogle Scholar
  37. 15.37
    S. Saito, H. Kameoka, K. Takahashi, T. Nishimoto, S. Sagayama: Specmurt analysis of polyphonic music signals, IEEE Trans. Speech Audio Process. 16(3), 639–650 (2008)CrossRefGoogle Scholar
  38. 15.38
    J.-L. Durrieu, G. Richard, B. David: Singer melody extraction in polyphonic signals using source separation methods. In: Proc. IEEE Int. Conf. Acoustics Speech Signal Process. (ICASSP) (2008) pp. 169–172Google Scholar
  39. 15.39
    V. Emiya, R. Badeau, B. David: Multipitch estimation of quasi-harmonic sounds in colored noise. In: Proc. Int. Conf. Digital Audio Effects (DAFx) (2007)Google Scholar
  40. 15.40
    G. Reis, N. Fonseca, F. Ferndandez: Genetic algorithm approach to polyphonic music transcription. In: Proc. IEEE Int. Symp. Intell. Signal Process (2007)Google Scholar
  41. 15.41
    T. Tolonen, M. Karjalainen: A computationally efficient multipitch analysis model, IEEE Trans. Speech Audio Process. 8(6), 708–716 (2000)CrossRefGoogle Scholar
  42. 15.42
    A. de Cheveigné, H. Kawahara: Multiple period estimation and pitch perception model, Speech Commun. 27, 175–185 (1999)CrossRefGoogle Scholar
  43. 15.43
    A. Klapuri: Multiple fundamental frequency estimation based on harmonicity and spectral smoothness, IEEE Trans. Speech Audio Process. 11(6), 804–815 (2003)CrossRefGoogle Scholar
  44. 15.44
    A. Klapuri: Multiple fundamental frequency estimation by summing harmonic amplitudes. In: Proc. ISMIR (2006) pp. 216–221Google Scholar
  45. 15.45
    R.J. Leistikow, H.D. Thornburg, J.S. Smith, J. Berger: Bayesian identification of closely-spaced chords from single-frame STFT peaks. In: Proc. Int. Conf. Digital Audio Effects (DAFx’04), Naples (2004) pp. 228–233Google Scholar
  46. 15.46
    A. Pertusa, J.M. Inesta: Multiple fundamental frequency estimation using Gaussian smoothness. In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) (2008) pp. 105–108Google Scholar
  47. 15.47
    C. Yeh, A. Röbel, X. Rodet: Multiple fundamental frequency estimation of polyphonic music signals. In: Proc. IEEE Int. Conf. Acoustics, Speech Signal Process. (ICASSP) (2005) pp. 225–228Google Scholar
  48. 15.48
    J.O. Smith: Spectral Audio Signal Processing, (2014)
  49. 15.49
    Z. Duan, Y. Zhang, C. Zhang, Z. Shi: Unsupervised single-channel music source separation by average harmonic structure modeling, IEEE Trans. Audio Speech Lang. Process. 16(4), 766–778 (2008)CrossRefGoogle Scholar
  50. 15.50
    J.O. Smith, X. Serra: Parshl: An analysis–synthesis program for non-harmonic sounds based on a sinusoidal representation. In: Proc. Int. Comput. Music Conf. (ICMC) (1987)Google Scholar
  51. 15.51
    L. Fritts, University of Iowa:
  52. 15.52
    A. de Cheveigné, H. Kawahara: YIN, a fundamental frequency estimator for speech and music, J. Acoustical Soc. Am. 111, 1917–1930 (2002)CrossRefGoogle Scholar
  53. 15.53
    M. Ryynanen, A. Klapuri: Polyphonic music transcription using note event modeling. In: Proc. IEEE Workshop on Appl. Signal Process. Audio Acoustics (WASPAA) (2005) pp. 319–322Google Scholar
  54. 15.54
    W.-C. Chang, A.W.Y. Su, C. Yeh, A. Robel, X. Rodet: Multiple-F0 tracking based on a high-order HMM model. In: Proc. Int. Conf. Digital Audio Effects (DAFx) (2008)Google Scholar
  55. 15.55
    Z. Duan, B. Pardo, L. Daudet: A novel cepstral representation for timbre modeling of sound sources in polyphonic mixtures. In: Proc. IEEE Int. Conf. Acoustics Speech Signal Process. (ICASSP) (2014)Google Scholar
  56. 15.56
    K. Wagstaff, C. Cardie: Clustering with instance-level constraints. In: Proc. Int. Conf. Machine Learning (ICML) (2000) pp. 1103–1110Google Scholar
  57. 15.57
    K. Wagstaff, C. Cardie, S. Rogers, S. Schroedl: Constrained k-means clustering with background knowledge. In: Proc. Int. Conf. Machine Learning (ICML) (2001) pp. 577–584Google Scholar
  58. 15.58
    I. Davidson, S.S. Ravi, M. Ester: Efficient incremental constrained clustering. In: Proc. ACM Conf. Knowl. Discovery and Data Mining (KDD) (2007) pp. 240–249Google Scholar
  59. 15.59
    Z. Duan, B. Pardo: Soundprism: An online system for score-informed source separation of music audio, IEEE J. Selected Topics Signal Process. 5(6), 1205–1215 (2011)CrossRefGoogle Scholar
  60. 15.60
    S. Ewert, M. Müller, P. Grosche: High resolution audio synchronization using chroma onset features. In: Proc. IEEE Int. Conf. Acoustics Speech Signal Process. (ICASSP) (2009) pp. 1869–1872Google Scholar
  61. 15.61
    C. Joder, S. Essid, G. Richard: A conditional random field framework for robust and scalable audio-to-score matching, IEEE Trans. Audio Speech Lang. Process. 19(8), 2385–2397 (2011)CrossRefGoogle Scholar
  62. 15.62
    A. Doucet, N. de Freitas, N.J. Gordon (Eds.): Sequential Monte Carlo Methods in Practice (Springer, New York 2001)zbMATHGoogle Scholar
  63. 15.63
    M.S. Arulampalam, S. Maskell, N. Gordon, T. Clapp: A tutorial on particle filters for online nonlinear–non-Gaussian Bayesian tracking, IEEE Trans. Signal Process. 50(2), 174–188 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2018

Authors and Affiliations

  • Bryan Pardo
    • 1
  • Zafar Rafii
    • 2
  • Zhiyao Duan
    • 3
  1. 1.Ford Engineering Design CenterNorthwestern UniversityEvanstonUSA
  2. 2.GracenoteEmeryvilleUSA
  3. 3.Dept. of Electrical and Computer EngineeringUniversity of RochesterRochesterUSA

Personalised recommendations