Perceptual Evaluation of Blind Source Separation in Object-Based Audio Production

  • Philip ColemanEmail author
  • Qingju Liu
  • Jon Francombe
  • Philip J. B. Jackson
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10891)


Object-based audio has the potential to enable multimedia content to be tailored to individual listeners and their reproduction equipment. In general, object-based production assumes that the objects—the assets comprising the scene—are free of noise and interference. However, there are many applications in which signal separation could be useful to an object-based audio workflow, e.g., extracting individual objects from channel-based recordings or legacy content, or recording a sound scene with a single microphone array. This paper describes the application and evaluation of blind source separation (BSS) for sound recording in a hybrid channel-based and object-based workflow, in which BSS-estimated objects are mixed with the original stereo recording. A subjective experiment was conducted using simultaneously spoken speech recorded with omnidirectional microphones in a reverberant room. Listeners mixed a BSS-extracted speech object into the scene to make the quieter talker clearer, while retaining acceptable audio quality, compared to the raw stereo recording. Objective evaluations show that the relative short-term objective intelligibility and speech quality scores increase using BSS. Further objective evaluations are used to discuss the influence of the BSS method on the remixing scenario; the scenario shown by human listeners to be useful in object-based audio is the worst-case scenario among those tested.



This work was supported by the EPSRC Programme Grant S3A: Future Spatial Audio for an Immersive Listener Experience at Home (EP/L000539/1). Relevant data can be accessed via


  1. 1.
    Alinaghi, A., Jackson, P.J., Liu, Q., Wang, W.: Joint mixing vector and binaural model based stereo source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(9), 1434–1448 (2014)CrossRefGoogle Scholar
  2. 2.
    Coleman, P., Franck, A., Francombe, J., Liu, Q., de Campos, T., Hughes, R., Menzies, D., Galvez, S., Tang, Y., Woodcock, J., et al.: An audio-visual system for object-based audio: from recording to listening. IEEE Trans. Multimedia (2018, in press).
  3. 3.
    Emiya, V., Vincent, E., Harlander, N., Hohmann, V.: Subjective and objective quality assessment of audio source separation. IEEE Trans. Audio Speech Lang. Process. 19(7), 2046–2057 (2011)CrossRefGoogle Scholar
  4. 4.
    European Telecommunications Standards Institute: Digital audio compression (AC-4) standard part 2: immersive and personalized audio, ETSI-TS-103-190-2. European Telecommunications Standards Institute (2015)Google Scholar
  5. 5.
    Francombe, J., Brookes, T., Mason, R., Flindt, R., Coleman, P., Liu, Q., Jackson, P.: Production and reproduction of program material for a variety of spatial audio formats. In: 138 Convention Audio Engineering Society, Warsaw, Poland (2015)Google Scholar
  6. 6.
    Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L., Zue, V.: TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Linguistic Data Consortium, Philadelphia (1993)CrossRefGoogle Scholar
  7. 7.
    Herre, J., Hilpert, J., Kuntz, A., Plogsties, J.: MPEG-H 3D audio – the new standard for coding of immersive spatial audio. IEEE J. Sel. Top. Sig. Process. 9(5), 770–779 (2015)CrossRefGoogle Scholar
  8. 8.
    ITU-R: Recommendation ITU-R BS.2051-0: Advanced sound system for programme reproduction. International Telecommunication Union (2014)Google Scholar
  9. 9.
    ITU-R: Recommendation BS.1770-4: Algorithms to measure audio programme loudness and true-peak audio level. International Telecommunication Union (2015)Google Scholar
  10. 10.
    ITU-R: Recommendation ITU-R BS.1116-3: Methods for the subjective assessment of small impairments in audio systems. International Telecommunication Union (2015)Google Scholar
  11. 11.
    Liu, Q., Xu, Y., Coleman, P., Jackson, P.J.B., Wang, W.: Iterative deep neural networks for speaker-independent binaural blind speech separation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Accepted 2018Google Scholar
  12. 12.
    Liu, Q., Wang, W., Jackson, P.J., Cox, T.J.: A source separation evaluation method in object-based spatial audio. In: Signal Processing Conference (EUSIPCO), 2015 23rd European, pp. 1088–1092. IEEE (2015)Google Scholar
  13. 13.
    Mandel, M.I., Weiss, R.J., Ellis, D.P.W.: Model-based expectation-maximization source separation and localization. IEEE Trans. Audio Speech Lang. Process. 18(2), 382–394 (2010)CrossRefGoogle Scholar
  14. 14.
    McGill, R., Tukey, J.W., Larsen, W.A.: Variations of box plots. Am. Stat. 32(1), 12–16 (1978)Google Scholar
  15. 15.
    Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. 749–752. IEEE, Salt Lake City, May 2001Google Scholar
  16. 16.
    Roma, G., Grais, E.M., Simpson, A.J., Plumbley, M.D.: Music remixing and upmixing using source separation. In: Proceedings of the 2nd AES Workshop on Intelligent Music Production (2016)Google Scholar
  17. 17.
    Simpson, A.J.R., Roma, G., Grais, E.M., Mason, R.D., Hummersone, C., Plumbley, M.D.: Psychophysical evaluation of audio source separation methods. In: Tichavský, P., Babaie-Zadeh, M., Michel, O.J.J., Thirion-Moreau, N. (eds.) LVA/ICA 2017. LNCS, vol. 10169, pp. 211–221. Springer, Cham (2017). Scholar
  18. 18.
    Simpson, A.J.R., Roma, G., Plumbley, M.D.: Deep karaoke: extracting vocals from musical mixtures using a convolutional deep neural network. In: Vincent, E., Yeredor, A., Koldovský, Z., Tichavský, P. (eds.) LVA/ICA 2015. LNCS, vol. 9237, pp. 429–436. Springer, Cham (2015). Scholar
  19. 19.
    Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)CrossRefGoogle Scholar
  20. 20.
    Torcoli, M., Herre, J., Paulus, J., Uhle, C., Fuchs, H., Hellmuth, O.: The adjustment/satisfaction test (a/st) for the subjective evaluation of dialogue enhancement. In: Audio Engineering Society Convention 143. Audio Engineering Society (2017)Google Scholar
  21. 21.
    Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)CrossRefGoogle Scholar
  22. 22.
    Wierstorf, H., Ward, D., Mason, R., Grais, E.M., Hummersone, C., Plumbley, M.D.: Perceptual evaluation of source separation for remixing music. In: Audio Engineering Society Convention 143 (2017)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Philip Coleman
    • 1
    • 2
    Email author
  • Qingju Liu
    • 2
  • Jon Francombe
    • 1
    • 3
  • Philip J. B. Jackson
    • 2
  1. 1.Institute of Sound RecordingUniversity of SurreyGuildfordUK
  2. 2.Centre for Vision, Speech and Signal ProcessingUniversity of SurreyGuildfordUK
  3. 3.BBC Research & DevelopmentSalfordUK

Personalised recommendations