Ambient Sound Provides Supervision for Visual Learning

  • Andrew OwensEmail author
  • Jiajun Wu
  • Josh H. McDermott
  • William T. Freeman
  • Antonio Torralba
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9905)


The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds.


Sound Convolutional networks Unsupervised learning 



This work was supported by NSF grants #1524817 to A.T; NSF grants #1447476 and #1212849 to W.F.; a McDonnell Scholar Award to J.H.M.; and a Microsoft Ph.D. Fellowship to A.O. It was also supported by Shell Research, and by a donation of GPUs from NVIDIA. We thank Phillip Isola and Carl Vondrick for the helpful discussions, and the anonymous reviewers for their comments (in particular, for suggesting the comparison with texton features in Sect. 4.2).


  1. 1.
    Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV (2015)Google Scholar
  2. 2.
    Andrew, G., Arora, R., Bilmes, J.A., Livescu, K.: Deep canonical correlation analysis. In: ICML (2013)Google Scholar
  3. 3.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  4. 4.
    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015)Google Scholar
  5. 5.
    Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with convolutional neural networks. In: NIPS (2014)Google Scholar
  6. 6.
    Ellis, D.P., Zeng, X., McDermott, J.H.: Classifying soundtracks with audio texture features. In: ICASSP (2011)Google Scholar
  7. 7.
    Eronen, A.J., Peltonen, V.T., Tuomi, J.T., Klapuri, A.P., Fagerlund, S., Sorsa, T., Lorho, G., Huopaniemi, J.: Audio-based context recognition. In: IEEE TASLP (2006)Google Scholar
  8. 8.
    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal Visual Object Classes (VOC) challenge. IJCV 88(2), 303–338 (2010)CrossRefGoogle Scholar
  9. 9.
    Fisher III, J.W., Darrell, T., Freeman, W.T., Viola, P.A.: Learning joint statistical models for audio-visual fusion and segregation. In: NIPS (2000)Google Scholar
  10. 10.
    Gaver, W.W.: What in the world do we hear?: an ecological approach to auditory event perception. Ecol. Psychol. 5(1), 1–29 (1993)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Girshick, R.: Fast R-CNN. In: ICCV (2015)Google Scholar
  12. 12.
    Goroshin, R., Bruna, J., Tompson, J., Eigen, D., LeCun, Y.: Unsupervised feature learning from temporal data. arXiv preprint arXiv:1504.02518 (2015)
  13. 13.
    Hershey, J.R., Movellan, J.R.: Audio vision: using audio-visual synchrony to locate sounds. In: NIPS (1999)Google Scholar
  14. 14.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC (1998)Google Scholar
  15. 15.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  16. 16.
    Isola, P., Zoran, D., Krishnan, D., Adelson, E.H.: Learning visual groups from co-occurrences in space and time. In: ICLR Workshop (2016)Google Scholar
  17. 17.
    Jayaraman, D., Grauman, K.: Learning image representations tied to ego-motion. In: ICCV (2015)Google Scholar
  18. 18.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: MM (2014)Google Scholar
  19. 19.
    Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. In: CVPR (2005)Google Scholar
  20. 20.
    Krähenbühl, P., Doersch, C., Donahue, J., Darrell, T.: Data-dependent initializations of convolutional neural networks. In: ICLR (2016)Google Scholar
  21. 21.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  22. 22.
    Le, Q.V., Ranzato, M.A., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng, A.Y.: Building high-level features using large scale unsupervised learning. In: ICML (2012)Google Scholar
  23. 23.
    Lee, K., Ellis, D.P., Loui, A.C.: Detecting local semantic concepts in environmental sounds using Markov model based clustering. In: ICASSP (2010)Google Scholar
  24. 24.
    Leung, T., Malik, J.: Representing and recognizing the visual appearance of materials using three-dimensional textons. IJCV 43(1), 29–44 (2001)CrossRefzbMATHGoogle Scholar
  25. 25.
    McDermott, J.H., Simoncelli, E.P.: Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron 71(5), 926–940 (2011)CrossRefGoogle Scholar
  26. 26.
    Mishkin, D., Matas, J.: All you need is a good init. arXiv preprint arXiv:1511.06422 (2015)
  27. 27.
    Mobahi, H., Collobert, R., Weston, J.: Deep learning from temporal coherence in video. In: ICML (2009)Google Scholar
  28. 28.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML (2011)Google Scholar
  29. 29.
    Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free?-weakly-supervised learning with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  30. 30.
    Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)Google Scholar
  31. 31.
    Salakhutdinov, R., Hinton, G.: Semantic hashing. Int. J. Approx. Reason. 50(7), 969–978 (2009)CrossRefGoogle Scholar
  32. 32.
    Slaney, M., Covell, M.: Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In: NIPS (2000)Google Scholar
  33. 33.
    Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: NIPS (2012)Google Scholar
  34. 34.
    Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817 (2015)
  35. 35.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV (2015)Google Scholar
  36. 36.
    Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NIPS (2009)Google Scholar
  37. 37.
    Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: CVPR (2010)Google Scholar
  38. 38.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Object detectors emerge in deep scene CNNs. In: ICLR 2015 (2014)Google Scholar
  39. 39.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: NIPS (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Andrew Owens
    • 1
    Email author
  • Jiajun Wu
    • 1
  • Josh H. McDermott
    • 1
  • William T. Freeman
    • 1
    • 2
  • Antonio Torralba
    • 1
  1. 1.Massachusetts Institute of TechnologyCambridgeUSA
  2. 2.Google ResearchCambridgeUSA

Personalised recommendations