VisualEchoes: Spatial Image Representation Learning Through Echolocation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12354)


Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. First we capture echo responses in photo-realistic 3D indoor scene environments. Then we propose a novel interaction-based representation learning framework that learns useful visual features via echolocation. We show that the learned image features are useful for multiple downstream vision tasks requiring spatial reasoning—monocular depth estimation, surface normal estimation, and visual navigation—with results comparable or even better than heavily supervised pre-training. Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world.



UT Austin is supported in part by DARPA Lifelong Learning Machines and ONR PECASE. RG is supported by Google PhD Fellowship and Adobe Research Fellowship.

Supplementary material

504446_1_En_38_MOESM1_ESM.pdf (3.4 mb)
Supplementary material 1 (pdf 3489 KB)


  1. 1.
    Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV (2015)Google Scholar
  2. 2.
    Agrawal, P., Nair, A.V., Abbeel, P., Malik, J., Levine, S.: Learning to poke by poking: experiential learning of intuitive physics. In: NeurIPS (2016)Google Scholar
  3. 3.
    Alameda-Pineda, X., et al.: Salsa: a novel dataset for multimodal group behavior analysis. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1707–1720 (2015)CrossRefGoogle Scholar
  4. 4.
    Anderson, P., et al.: On evaluation of embodied navigation agents (2018). arXiv preprint arXiv:1807.06757
  5. 5.
    Antonacci, F., et al.: Inference of room geometry from acoustic impulse responses. IEEE Trans. Audio Speech Lang. Process. 20(10), 2683–2695 (2012)CrossRefGoogle Scholar
  6. 6.
    Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)Google Scholar
  7. 7.
    Arandjelović, R., Zisserman, A.: Objects that sound. In: ECCV (2018)Google Scholar
  8. 8.
    Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: NeurIPS (2016)Google Scholar
  9. 9.
    Ban, Y., Li, X., Alameda-Pineda, X., Girin, L., Horaud, R.: Accounting for room acoustics in audio-visual multi-speaker tracking. In: ICASSP (2018)Google Scholar
  10. 10.
    Chang, A., et al.: Matterport3D: Learning from RGB-D data in indoor environments. 3DV (2017)Google Scholar
  11. 11.
    Chen, C., et al.: Audio-visual embodied navigation. In: ECCV (2020)Google Scholar
  12. 12.
    Christensen, J., Hornauer, S., Yu, S.: Batvision - learning to see 3D spatial layout with two ears. In: ICRA (2020)Google Scholar
  13. 13.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  14. 14.
    Dokmanić, I., Parhizkar, R., Walther, A., Lu, Y.M., Vetterli, M.: Acousticechoes reveal room shape. Proc. Natl. Acad. Sci. 110(30), 12186–12191 (2013)CrossRefGoogle Scholar
  15. 15.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)Google Scholar
  16. 16.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS (2014)Google Scholar
  17. 17.
    Eliakim, I., Cohen, Z., Kosa, G., Yovel, Y.: A fully autonomous terrestrialbat-like acoustic robot. PLoS Comput. Biol. 14(9), e1006406 (2018)CrossRefGoogle Scholar
  18. 18.
    Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. In: SIGGRAPH (2018)Google Scholar
  19. 19.
    Feng, Z., Xu, C., Tao, D.: Self-supervised representation learning by rotation feature decoupling. In: CVPR (2019)Google Scholar
  20. 20.
    Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: CVPR (2017)Google Scholar
  21. 21.
    Fouhey, D.F., Gupta, A., Hebert, M.: Data-driven 3D primitives for single image understanding. In: ICCV (2013)Google Scholar
  22. 22.
    Frank, N., Wolf, L., Olshansky, D., Boonman, A., Yovel, Y.: Comparing vision-based to sonar-based 3D reconstruction. ICCP (2020)Google Scholar
  23. 23.
    Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: CVPR (2018)Google Scholar
  24. 24.
    Gan, C., Huang, D., Zhao, H., Tenenbaum, J.B., Torralba, A.: Music gesture for visual sound separation. In: CVPR (2020)Google Scholar
  25. 25.
    Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.B.: Look, listen, and act: towards audio-visual embodied navigation. In: ICRA (2020)Google Scholar
  26. 26.
    Gan, C., Zhao, H., Chen, P., Cox, D., Torralba, A.: Self-supervised moving vehicle tracking with stereo sound. In: ICCV (2019)Google Scholar
  27. 27.
    Gandhi, D., Pinto, L., Gupta, A.: Learning to fly by crashing. In: IROS (2017)Google Scholar
  28. 28.
    Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: ECCV (2018)Google Scholar
  29. 29.
    Gao, R., Grauman, K.: 2.5D visual sound. In: CVPR (2019)Google Scholar
  30. 30.
    Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: ICCV (2019)Google Scholar
  31. 31.
    Gao, R., Jayaraman, D., Grauman, K.: Object-centric representation learning from unlabeled videos. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 248–263. Springer, Cham (2017). Scholar
  32. 32.
    Gao, R., Oh, T.H., Grauman, K., Torresani, L.: Listen to look: action recognition by previewing audio. In: CVPR (2020)Google Scholar
  33. 33.
    Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016). Scholar
  34. 34.
    Gebru, I.D., Ba, S., Evangelidis, G., Horaud, R.: Tracking the active speaker based on a joint audio-visual observation model. In: ICCV Workshops (2015)Google Scholar
  35. 35.
    Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018)Google Scholar
  36. 36.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR (2017)Google Scholar
  37. 37.
    Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: ICCV (2019)Google Scholar
  38. 38.
    Goyal, P., Mahajan, D., Gupta, A., Misra, I.: Scaling and benchmarking self-supervised visual representation learning. In: ICCV (2019)Google Scholar
  39. 39.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  40. 40.
    Hershey, J.R., Movellan, J.R.: Audio vision: using audio-visual synchrony to locate sounds. In: NeurIPS (2000)Google Scholar
  41. 41.
    Hu, J., Ozay, M., Zhang, Y., Okatani, T.: Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In: WACV (2019)Google Scholar
  42. 42.
    Irie, G., et al.: Seeing through sounds: predicting visual semantic segmentation results from multichannel audio signals. In: ICASSP (2019)Google Scholar
  43. 43.
    Jayaraman, D., Grauman, K.: Slow and steady feature analysis: higher order temporal coherence in video. In: CVPR (2016)Google Scholar
  44. 44.
    Jayaraman, D., Gao, R., Grauman, K.: Shapecodes: self-supervised feature learning by lifting views to viewgrids. In: ECCV (2018)Google Scholar
  45. 45.
    Jayaraman, D., Grauman, K.: Learning image representations equivariant to ego-motion. In: ICCV (2015)Google Scholar
  46. 46.
    Jiang, H., Larsson, G., Maire Greg Shakhnarovich, M., Learned-Miller, E.: Self-supervised relative depth learning for urban scene understanding. In: ECCV (2018)Google Scholar
  47. 47.
    Karsch, K., Liu, C., Kang, S.B.: Depth transfer: depth extraction from video using non-parametric sampling. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2144–2158 (2014)CrossRefGoogle Scholar
  48. 48.
    Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In: ICCV (2019)Google Scholar
  49. 49.
    Kim, H., Remaggi, L., Jackson, P.J., Fazi, F.M., Hilton, A.: 3D room geometry reconstruction using audio-visual sensors. In: 3DV (2017)Google Scholar
  50. 50.
    Korbar, B., Tran, D., Torresani, L.: Co-training of audio and video representations from self-supervised temporal synchronization. In: NeurIPS (2018)Google Scholar
  51. 51.
    Kuttruff, H.: Electroacoustical systems in rooms. Room Acoustics, pp. 267–293. CRC Pres, Boca Raton (2017)Google Scholar
  52. 52.
    Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: CVPR (2017)Google Scholar
  53. 53.
    Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image. In: CVPR (2015)Google Scholar
  54. 54.
    McGuire, K., De Wagter, C., Tuyls, K., Kappen, H., de Croon, G.: Minimal navigation solution for a swarm of tiny flying robots to explore an unknown environment. Sci. Robot. 4(35), eaaw9710 (2019)CrossRefGoogle Scholar
  55. 55.
    Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). Scholar
  56. 56.
    Morgado, P., Vasconcelos, N., Langlois, T., Wang, O.: Self-supervised generation of spatial audio for 360\({}^\circ \) video. In: NeurIPS (2018)Google Scholar
  57. 57.
    Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). Scholar
  58. 58.
    Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: ECCV (2018)Google Scholar
  59. 59.
    Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)Google Scholar
  60. 60.
    Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). Scholar
  61. 61.
    Palossi, D., Loquercio, A., Conti, F., Flamand, E., Scaramuzza, D., Benini, L.: A 64-mw DNN-based visual navigation engine for autonomous nano-drones. IEEE Internet Things J. 6(5), 8357–8371 (2019)CrossRefGoogle Scholar
  62. 62.
    Pinto, L., Gupta, A.: Supersizing self-supervision: learning to grasp from 50k tries and 700 robot hours. In: ICRA (2016)Google Scholar
  63. 63.
    Purushwalkam, S., Gupta, A., Kaufman, D.M., Russell, B.: Bounce and learn: modeling scene dynamics with real-world bounces. In: ICLR (2019)Google Scholar
  64. 64.
    Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: CVPR (2009)Google Scholar
  65. 65.
    Ranjan, A., et al.: Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: CVPR (2019)Google Scholar
  66. 66.
    Ren, Z., Jae Lee, Y.: Cross-domain self-supervised multi-task feature learning using synthetic imagery. In: CVPR (2018)Google Scholar
  67. 67.
    Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). Scholar
  68. 68.
    Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). Scholar
  69. 69.
    de Sa, V.R.: Learning classification with unlabeled data. In: NeurIPS (1994)Google Scholar
  70. 70.
    Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV (2019)Google Scholar
  71. 71.
    Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: CVPR (2018)Google Scholar
  72. 72.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). Scholar
  73. 73.
    Straub, J., et al.: The replica dataset: a digital replica of indoor spaces (2019). arXiv preprint arXiv:1906.05797
  74. 74.
    Stroffregen, T.A., Pittenger, J.B.: Human echolocation as a basic form of perception and action. Ecol. Psychol. 7(3), 181–216 (1995)CrossRefGoogle Scholar
  75. 75.
    Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: ECCV (2018)Google Scholar
  76. 76.
    Ummenhofer, B., et al.: Demon: Depth and motion network for learning monocular stereo. In: CVPR (2017)Google Scholar
  77. 77.
    Vanderelst, D., Holderied, M.W., Peremans, H.: Sensorimotor model of obstacleavoidance in echolocating bats. PLoS Comput. Biol. 11(10), e1004484 (2015)CrossRefGoogle Scholar
  78. 78.
    Vasiljevic, Iet al.: DIODE: A Dense Indoor and Outdoor DEpth Dataset (2019). arXiv preprint arXiv:1908.00463
  79. 79.
    Veach, E., Guibas, L.: Bidirectional estimators for light transport. In: Photorealistic Rendering Techniques (1995)Google Scholar
  80. 80.
    Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: SFM-net: Learning of structure and motion from video (2017). arXiv preprint arXiv:1704.07804
  81. 81.
    Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.L.: Towards unified depth and semantic prediction from a single image. In: CVPR (2015)Google Scholar
  82. 82.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV (2015)Google Scholar
  83. 83.
    Xia, F., Zamir, A.R., He, Z., Sax, A., Malik, J., Savarese, S.: Gibson env: Real-world perception for embodied agents. In: CVPR (2018)Google Scholar
  84. 84.
    Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N.: Multi-scale continuous CRFS as sequential deep networks for monocular depth estimation. In: CVPR (2017)Google Scholar
  85. 85.
    Yang, Z., Wang, P., Xu, W., Zhao, L., Nevatia, R.: Unsupervised learning of geometry with edge-aware depth-normal consistency. In: AAAI (2018)Google Scholar
  86. 86.
    Ye, M., Zhang, Y., Yang, R., Manocha, D.: 3d reconstruction in the presence of glasses by acoustic and stereo fusion. In: ICCV (2015)Google Scholar
  87. 87.
    Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J., Savarese, S.: Taskonomy: Disentangling task transfer learning. In: CVPR (2018)Google Scholar
  88. 88.
    Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). Scholar
  89. 89.
    Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: ECCV (2018)Google Scholar
  90. 90.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)Google Scholar
  91. 91.
    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)Google Scholar
  92. 92.
    Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: Generating natural sound for videos in the wild. In: CVPR (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.The University of Texas at AustinAustinUSA
  2. 2.Facebook Reality LabSeattleUSA
  3. 3.Facebook AI ResearchAustinUSA

Personalised recommendations