Learning to Separate Object Sounds by Watching Unlabeled Video

  • Ruohan Gao
  • Rogerio Feris
  • Kristen Grauman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11207)


Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to learn audio source separation from large-scale “in the wild” videos containing multiple audio sources per video. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising. Our video results:



This research was supported in part by an IBM Faculty Award, IBM Open Collaboration Research Award, and DARPA Lifelong Learning Machines. We thank members of the UT Austin vision group and Wenguang Mao, Yuzhong Wu, Dongguang You, Xingyi Zhou and Xinying Hao for helpful input. We also gratefully acknowledge a GPU donation from Facebook.


  1. 1.
    Afouras, T., Chung, J.S., Zisserman, A.: The conversation: deep audio-visual speech enhancement. arXiv preprint arXiv:1804.04121 (2018)
  2. 2.
    Ali, S., Shah, M.: Human action recognition in videos using kinematic features and multiple instance learning. PAMI 32, 288–303 (2010)CrossRefGoogle Scholar
  3. 3.
    Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)Google Scholar
  4. 4.
    Arandjelović, R., Zisserman, A.: Objects that sound. arXiv preprint arXiv:1712.06651 (2017)
  5. 5.
    Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: NIPS (2016)Google Scholar
  6. 6.
    Aytar, Y., Vondrick, C., Torralba, A.: See, hear, and read: deep aligned representations. arXiv preprint arXiv:1706.00932 (2017)
  7. 7.
    Barnard, K., Duygulu, P., de Freitas, N., Blei, D., Jordan, M.: Matching words and pictures. JMLR 3, 1107–1135 (2003)zbMATHGoogle Scholar
  8. 8.
    Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: CVPR (2007)Google Scholar
  9. 9.
    Berg, T., et al.: Names and faces in the news. In: CVPR (2004)Google Scholar
  10. 10.
    Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: CVPR (2016)Google Scholar
  11. 11.
    Bryan, N.: Interactive Sound Source Separation. Ph.D. thesis, Stanford University (2014)Google Scholar
  12. 12.
    Casanovas, A.L., Monaci, G., Vandergheynst, P., Gribonval, R.: Blind audiovisual source separation based on sparse redundant representations. IEEE Trans. Multimed. 12, 358–371 (2010)CrossRefGoogle Scholar
  13. 13.
    Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual generation. In: Proceedings of the on Thematic Workshops of ACM Multimedia (2017)Google Scholar
  14. 14.
    Cinbis, R., Verbeek, J., Schmid, C.: Weakly supervised object localization with multi-fold multiple instance learning. PAMI 39, 189–203 (2017)CrossRefGoogle Scholar
  15. 15.
    Darrell, T., Fisher, J.W., Viola, P.: Audio-visual segmentation and the cocktail party effect. In: Tan, T., Shi, Y., Gao, W. (eds.) ICMI 2000. LNCS, vol. 1948, pp. 32–40. Springer, Heidelberg (2000). Scholar
  16. 16.
    Deselaers, T., Alexe, B., Ferrari, V.: Weakly supervised localization and learning with generic knowledge. IJCV 100, 275–293 (2012)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. intell. 89, 31–71 (1997)CrossRefGoogle Scholar
  18. 18.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  19. 19.
    Duong, N.Q., Ozerov, A., Chevallier, L., Sirot, J.: An interactive audio source separation framework based on non-negative matrix factorization. In: ICASSP (2014)Google Scholar
  20. 20.
    Duong, N.Q., Vincent, E., Gribonval, R.: Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18, 1830–1840 (2010)CrossRefGoogle Scholar
  21. 21.
    Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002). Scholar
  22. 22.
    Ellis, D.P.W.: Prediction-driven computational auditory scene analysis. Ph.D. thesis, Massachusetts Institute of Technology (1996)Google Scholar
  23. 23.
    Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619 (2018)
  24. 24.
    Feng, J., Zhou, Z.H.: Deep MIML network. In: AAAI (2017)Google Scholar
  25. 25.
    Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the itakura-saito divergence: with application to music analysis. Neural comput. 21, 793–830 (2009)CrossRefGoogle Scholar
  26. 26.
    Févotte, C., Idier, J.: Algorithms for nonnegative matrix factorization with the \(\beta \)-divergence. Neural comput. 23, 2421–2456 (2011)MathSciNetCrossRefGoogle Scholar
  27. 27.
    Fisher III, J.W., Darrell, T., Freeman, W.T., Viola, P.A.: Learning joint statistical models for audio-visual fusion and segregation. In: NIPS (2001)Google Scholar
  28. 28.
    Gabbay, A., Shamir, A., Peleg, S.: Visual speech enhancement using noise-invariant training. arXiv preprint arXiv:1711.08789 (2017)
  29. 29.
    Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP (2017)Google Scholar
  30. 30.
    Griffin, D., Lim, J.: Signal estimation from modified short-time fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32, 236–243 (1984)CrossRefGoogle Scholar
  31. 31.
    Guo, X., Uhlich, S., Mitsufuji, Y.: NMF-based blind source separation using a linear predictive coding error clustering criterion. In: ICASSP (2015)Google Scholar
  32. 32.
    Harwath, D., Glass, J.: Learning word-like units from joint audio-visual analysis. In: ACL (2017)Google Scholar
  33. 33.
    Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. arXiv preprint arXiv:1804.01452 (2018)
  34. 34.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  35. 35.
    Hennequin, R., David, B., Badeau, R.: Score informed audio source separation using a parametric model of non-negative spectrogram. In: ICASSP (2011)Google Scholar
  36. 36.
    Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: ICASSP (2016)Google Scholar
  37. 37.
    Hershey, J.R., Movellan, J.R.: Audio vision: using audio-visual synchrony to locate sounds. In: NIPS (2000)Google Scholar
  38. 38.
    Hofmann, T.: Probabilistic latent semantic indexing. In: International ACM SIGIR Conference on Research and Development in Information Retrieval (1999)Google Scholar
  39. 39.
    Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: ICASSP (2014)Google Scholar
  40. 40.
    Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Netw. 13, 411–430 (2000)CrossRefGoogle Scholar
  41. 41.
    Innami, S., Kasai, H.: NMF-based environmental sound source separation using time-variant gain features. Comput. Math. Appl. 64, 1333–1342 (2012)CrossRefGoogle Scholar
  42. 42.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  43. 43.
    Izadinia, H., Saleemi, I., Shah, M.: Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Trans. Multimed. 15, 378–390 (2013)CrossRefGoogle Scholar
  44. 44.
    Jaiswal, R., FitzGerald, D., Barry, D., Coyle, E., Rickard, S.: Clustering NMF basis functions using shifted NMF for monaural sound source separation. In: ICASSP (2011)Google Scholar
  45. 45.
    Jhuo, I.H., Ye, G., Gao, S., Liu, D., Jiang, Y.G., Lee, D., Chang, S.F.: Discovering joint audio-visual codewords for video event detection. Machine Vis. Appl. 25, 33–47 (2014)CrossRefGoogle Scholar
  46. 46.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)Google Scholar
  47. 47.
    Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. In: CVPR (2005)Google Scholar
  48. 48.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  49. 49.
    Korbar, B., Tran, D., Torresani, L.: Co-training of audio and video representations from self-supervised temporal synchronization. arXiv preprint arXiv:1807.00230 (2018)
  50. 50.
    Le Magoarou, L., Ozerov, A., Duong, N.Q.: Text-informed audio source separation. Example-based approach using non-negative matrix partial co-factorization. J. Signal Process. Syst. 79, 117–131 (2015)CrossRefGoogle Scholar
  51. 51.
    Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems (2001)Google Scholar
  52. 52.
    Li, B., Dinesh, K., Duan, Z., Sharma, G.: See and listen: score-informed association of sound tracks to players in chamber music performance videos. In: ICASSP (2017)Google Scholar
  53. 53.
    Li, K., Ye, J., Hua, K.A.: What’s making that sound? In: ACMMM (2014)Google Scholar
  54. 54.
    Liutkus, A., Fitzgerald, D., Rafii, Z., Pardo, B., Daudet, L.: Kernel additive models for source separation. IEEE Trans. Signal Process. 62, 4298–4310 (2014)MathSciNetCrossRefGoogle Scholar
  55. 55.
    Lock, E.F., Hoadley, K.A., Marron, J.S., Nobel, A.B.: Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat. 7(1), 523 (2013)MathSciNetCrossRefGoogle Scholar
  56. 56.
    Nakadai, K., Hidai, K.I., Okuno, H.G., Kitano, H.: Real-time speaker localization and speech separation by audio-visual integration. In: IEEE International Conference on Robotics and Automation (2002)Google Scholar
  57. 57.
    Naphade, M., Smith, J.R., Tesic, J., Chang, S.F., Hsu, W., Kennedy, L., Hauptmann, A., Curtis, J.: Large-scale concept ontology for multimedia. IEEE Multimed. 13, 86–91 (2006)CrossRefGoogle Scholar
  58. 58.
    Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. arXiv preprint arXiv:1804.03641 (2018)
  59. 59.
    Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)Google Scholar
  60. 60.
    Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). Scholar
  61. 61.
    Parekh, S., Essid, S., Ozerov, A., Duong, N.Q., Pérez, P., Richard, G.: Motion informed audio source separation. In: ICASSP (2017)Google Scholar
  62. 62.
    Pu, J., Panagakis, Y., Petridis, S., Pantic, M.: Audio-visual object localization and separation using low-rank and sparsity. In: ICASSP (2017)Google Scholar
  63. 63.
    Rahne, T., Böckmann, M., von Specht, H., Sussman, E.S.: Visual cues can modulate integration and segregation of objects in auditory scene analysis. Brain Res. 1144, 127–135 (2007)CrossRefGoogle Scholar
  64. 64.
    Rivet, B., Girin, L., Jutten, C.: Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Trans. Audio Speech Lang. Process. 15, 96–108 (2007)CrossRefGoogle Scholar
  65. 65.
    Sedighin, F., Babaie-Zadeh, M., Rivet, B., Jutten, C.: Two multimodal approaches for single microphone source separation. In: 24th European Signal Processing Conference (2016)Google Scholar
  66. 66.
    Simpson, A.J.R., Roma, G., Plumbley, M.D.: Deep karaoke: extracting vocals from musical mixtures using a convolutional deep neural network. In: Vincent, E., Yeredor, A., Koldovský, Z., Tichavský, P. (eds.) LVA/ICA 2015. LNCS, vol. 9237, pp. 429–436. Springer, Cham (2015). Scholar
  67. 67.
    Smaragdis, P., Casey, M.: Audio/visual independent components. In: International Conference on Independent Component Analysis and Signal Separation (2003)Google Scholar
  68. 68.
    Smaragdis, P., Raj, B., Shashanka, M.: A probabilistic latent variable model for acoustic modeling. In: NIPS (2006)Google Scholar
  69. 69.
    Smaragdis, P., Raj, B., Shashanka, M.: Supervised and semi-supervised separation of sounds from single-channel mixtures. In: Davies, M.E., James, C.J., Abdallah, S.A., Plumbley, M.D. (eds.) ICA 2007. LNCS, vol. 4666, pp. 414–421. Springer, Heidelberg (2007). Scholar
  70. 70.
    Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVid. In: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval (2006)Google Scholar
  71. 71.
    Snoek, C.G., Worring, M.: Multimodal video indexing: a review of the state-of-the-art. Multimed. Tools Appl. 25, 5–35 (2005)CrossRefGoogle Scholar
  72. 72.
    SPIERTZ, M.: Source-filter based clustering for monaural blind source separation. In: 12th International Conference on Digital Audio Effects (2009)Google Scholar
  73. 73.
    Vijayanarasimhan, S., Grauman, K.: Keywords to visual categories: multiple-instance learning for weakly supervised object categorization. In: CVPR (2008)Google Scholar
  74. 74.
    Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14, 1462–1469 (2006)CrossRefGoogle Scholar
  75. 75.
    Virtanen, T.: Sound source separation using sparse coding with temporal continuity objective. In: International Computer Music Conference (2003)Google Scholar
  76. 76.
    Virtanen, T.: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15, 1066–1074 (2007)CrossRefGoogle Scholar
  77. 77.
    Wang, B.: Investigating single-channel audio source separation methods based on non-negative matrix factorization. In: ICA Research Network International Workshop (2006)Google Scholar
  78. 78.
    Wang, L., Xiong, Y., Lin, D., Gool, L.V.: Untrimmednets for weakly supervised action recognition and detection. In: CVPR (2017)Google Scholar
  79. 79.
    Wang, Z., et al.: Truly multi-modal YouTube-8m video classification with video, audio, and text. arXiv preprint arXiv:1706.05461 (2017)
  80. 80.
    Wu, J., Yu, Y., Huang, C., Yu, K.: Deep multiple instance learning for image classification and auto-annotation. In: CVPR (2015)Google Scholar
  81. 81.
    Yang, H., Zhou, J.T., Cai, J., Ong, Y.S.: MIML-FCN+: multi-instance multi-label learning via fully convolutional networks with privileged information. In: CVPR (2017)Google Scholar
  82. 82.
    Yilmaz, O., Rickard, S.: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process. 52, 1830–1847 (2004)MathSciNetCrossRefGoogle Scholar
  83. 83.
    Zhang, Z., et al.: Generative modeling of audible shapes for object perception. In: ICCV (2017)Google Scholar
  84. 84.
    Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. arXiv preprint arXiv:1804.03160 (2018)
  85. 85.
    Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: generating natural sound for videos in the wild. arXiv preprint arXiv:1712.01393 (2017)
  86. 86.
    Zibulevsky, M., Pearlmutter, B.A.: Blind source separation by sparse decomposition in a signal dictionary. Neural Computat. 13, 863–882 (2001)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.The University of Texas at AustinAustinUSA
  2. 2.IBM ResearchNew YorkUSA
  3. 3.Facebook AI ResearchMenlo ParkUSA

Personalised recommendations