SoundSpaces: Audio-Visual Navigation in 3D Environments

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12351)


Moving around in the world is naturally a multisensory experience, but today’s embodied agents are deaf—restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object. We propose a multi-modal deep reinforcement learning approach to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to (1) discover elements of the geometry of the physical space indicated by the reverberating audio and (2) detect and follow sound-emitting targets. We further introduce SoundSpaces: a first-of-its-kind dataset of audio renderings based on geometrical acoustic simulations for two sets of publicly available 3D environments (Matterport3D and Replica), and we instrument Habitat to support the new sensor, making it possible to insert arbitrary sound sources in an array of real-world scanned environments. Our results show that audio greatly benefits embodied visual navigation in 3D spaces, and our work lays groundwork for new research in embodied AI with audio-visual perception. Project:



UT Austin is supported in part by DARPA Lifelong Learning Machines. We thank Alexander Schwing, Dhruv Batra, Erik Wijmans, Oleksandr Maksymets, Ruohan Gao, and Svetlana Lazebnik for valuable discussions and support with the AI-Habitat platform.

Supplementary material

504443_1_En_2_MOESM1_ESM.pdf (5.7 mb)
Supplementary material 1 (pdf 5785 KB)


  1. 1.
    Alameda-Pineda, X., Horaud, R.: Vision-guided robot hearing. Int. J. Robot. Res. 34, 437–456 (2015)CrossRefGoogle Scholar
  2. 2.
    Alameda-Pineda, X., et al.: Salsa: a novel dataset for multimodal group behavior analysis. IEEE Trans. Pattern Anal. Mach. intell. 38(8), 1707–1720 (2015)CrossRefGoogle Scholar
  3. 3.
    Ammirato, P., Poirson, P., Park, E., Kosecka, J., Berg, A.: A dataset for developing and benchmarking active vision. In: ICRA (2016)Google Scholar
  4. 4.
    Anderson, P., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)
  5. 5.
    Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)Google Scholar
  6. 6.
    Arandjelovic, R., Zisserman, A.: Objects that sound. In: ECCV (2018)Google Scholar
  7. 7.
    Armeni, I., Sax, A., Zamir, A.R., Savarese, S.: Joint 2D–3D-Semantic Data for Indoor Scene Understanding. ArXiv e-prints, February 2017Google Scholar
  8. 8.
    Ban, Y., Girin, L., Alameda-Pineda, X., Horaud, R.: Exploiting the complementarity of audio and visual data in multi-speaker tracking. In: ICCV Workshop on Computer Vision for Audio-Visual Media. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (2017).
  9. 9.
    Ban, Y., Li, X., Alameda-Pineda, X., Girin, L., Horaud, R.: Accounting for room acoustics in audio-visual multi-speaker tracking. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018)Google Scholar
  10. 10.
    Brodeur, S., et al.: Home: a household multimodal environment. (2017)
  11. 11.
    Cao, C., Ren, Z., Schissler, C., Manocha, D., Zhou, K.: Interactive sound propagation with bidirectional path tracing. ACM Trans. Graph. (TOG) 35(6), 1–11 (2016)CrossRefGoogle Scholar
  12. 12.
    Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: 3DV (2017)Google Scholar
  13. 13.
    Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: Proceedings of the International Conference on 3D Vision (3DV) (2017)Google Scholar
  14. 14.
    Chaplot, D.S., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural mapping. In: ICLR (2020)Google Scholar
  15. 15.
    Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: CVPR (2019)Google Scholar
  16. 16.
    Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual generation. In: Proceedings of the on Thematic Workshops of ACM Multimedia 2017. ACM (2017)Google Scholar
  17. 17.
    Chen, T., Gupta, S., Gupta, A.: Learning exploration policies for navigation.
  18. 18.
    Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A.C., Bengio, Y.: A recurrent latent variable model for sequential data. In: NeurIPS (2015)Google Scholar
  19. 19.
    Connors, E.C., Yazzolino, L.A., Sánchez, J., Merabet, L.B.: Development of an audio-based virtual gaming environment to assist with navigation skills in the blind. J. Vis. Exp. JoVE 73, e50272 (2013)Google Scholar
  20. 20.
    Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: CVPR (2018)Google Scholar
  21. 21.
    Das, A., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Neural modular control for embodied question answering. In: ECCV (2018)Google Scholar
  22. 22.
    Das, A., et al.: Probing emergent semantics in predictive agents via question answering. In: ICML (2020)Google Scholar
  23. 23.
    Egan, M.D., Quirt, J., Rousseau, M.: Architectural Acoustics. Elsevier, Amsterdam (1989)Google Scholar
  24. 24.
    Ekstrom, A.D.: Why vision is important to how we navigate. Hippocampus 25, 731–735 (2015)CrossRefGoogle Scholar
  25. 25.
    Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. In: SIGGRAPH (2018)Google Scholar
  26. 26.
    Evers, C., Naylor, P.: Acoustic slam. IEEE/ACM Trans. Audio Speech Lang. Process. 26(9), 1484–1498 (2018)CrossRefGoogle Scholar
  27. 27.
    Fortin, M., et al.: Wayfinding in the blind: larger hippocampal volume and supranormal spatial navigation. Brain 131, 2995–3005 (2008)CrossRefGoogle Scholar
  28. 28.
    Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.: Look, listen, and act: towards audio-visual embodied navigation. In: ICRA (2020)Google Scholar
  29. 29.
    Gao, R., Chen, C., Al-Halah, Z., Schissler, C., Grauman, K.: VisualEchoes: spatial image representation learning through echolocation. In: ECCV (2020)Google Scholar
  30. 30.
    Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 36–54. Springer, Cham (2018). Scholar
  31. 31.
    Gao, R., Grauman, K.: 2.5 D visual sound. In: CVPR (2019)Google Scholar
  32. 32.
    Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: ICCV (2019)Google Scholar
  33. 33.
    Gebru, I.D., Ba, S., Evangelidis, G., Horaud, R.: Tracking the active speaker based on a joint audio-visual observation model. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 15–21 (2015)Google Scholar
  34. 34.
    Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: CVPR (2018)Google Scholar
  35. 35.
    Gordon, D., Kadian, A., Parikh, D., Hoffman, J., Batra, D.: SplitNet: Sim2Sim and Task2Task transfer for embodied visual navigation. In: ICCV (2019)Google Scholar
  36. 36.
    Gougoux, F., Zatorre, R.J., Lassonde, M., Voss, P., Lepore, F.: A functional neuroimaging study of sound localization: visual cortex activity predicts performance in early-blind individuals. PLoS Biol. 3(2), e27 (2005)CrossRefGoogle Scholar
  37. 37.
    Gunther, R., Kazman, R., MacGregor, C.: Using 3D sound as a navigational aid in virtual environments. Behav. Inf. Technol. 23(6), 435–446 (2010). Scholar
  38. 38.
    Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625 (2017)Google Scholar
  39. 39.
    Gupta, S., Fouhey, D., Levine, S., Malik, J.: Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125 (2017)
  40. 40.
    Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML (2018)Google Scholar
  41. 41.
    Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2004)CrossRefGoogle Scholar
  42. 42.
    Henriques, J.F., Vedaldi, A.: MapNet: an allocentric spatial memory for mapping environments. In: CVPR (2018)Google Scholar
  43. 43.
    Hershey, J.R., Movellan, J.R.: Audio vision: using audio-visual synchrony to locate sounds. In: NeurIPS (2000)Google Scholar
  44. 44.
    Jain, U., et al.: A cordial sync: going beyond marginal policies for multi-agent embodied tasks. In: ECCV (2020)Google Scholar
  45. 45.
    Jain, U., et al.: Two body problem: collaborative visual task completion. In: CVPR (2019)Google Scholar
  46. 46.
    Jayaraman, D., Grauman, K.: End-to-end policy learning for active visual categorization. TPAMI 41(7), 1601–1614 (2018)CrossRefGoogle Scholar
  47. 47.
    Johnson, M., Hofmann, K., Hutton, T., Bignell, D.: The malmo platform for artificial intelligence experimentation. In: International Joint Conference on AI (2016)Google Scholar
  48. 48.
    Kempka, M., Wydmuch, M., Runc, G., Toczek, J., Jakowski, W.: ViZDoom: a doom-based AI research platform for visual reinforcement learning. In: Proceedings of the IEEE Conference on Computational Intelligence and Games (2016)Google Scholar
  49. 49.
    Kingma, D., Ba, J.: A method for stochastic optimization. In: CVPR (2017)Google Scholar
  50. 50.
    Kojima, N., Deng, J.: To learn or not to learn: analyzing the role of learning for navigation in virtual environments. arXiv preprint arXiv:1907.11770 (2019)
  51. 51.
    Kolve, E., et al.: AI2-THOR: an interactive 3D environment for visual AI. arXiv (2017)Google Scholar
  52. 52.
    Kuttruff, H.: Room Acoustics. CRC Press, Boca Raton (2016)CrossRefGoogle Scholar
  53. 53.
    Lerer, A., Gross, S., Fergus, R.: Learning physical intuition of block towers by example. In: ICML (2016)Google Scholar
  54. 54.
    Lessard, N., Paré, M., Lepore, F., Lassonde, M.: Early-blind human subjects localize sound sources better than sighted subjects. Nature 395, 278–280 (1998)CrossRefGoogle Scholar
  55. 55.
    Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV (2019)Google Scholar
  56. 56.
    Massiceti, D., Hicks, S.L., van Rheede, J.J.: Stereosonic vision: exploring visual-to-auditory sensory substitution mappings in an immersive virtual reality navigation paradigm. PLoS ONE 13(7), e0199389 (2018)CrossRefGoogle Scholar
  57. 57.
    Merabet, L., Sanchez, J.: Audio-based navigation using virtual environments: combining technology and neuroscience. AER J. Res. Pract. Vis. Impair. Blind. 2, 128–137 (2009)Google Scholar
  58. 58.
    Merabet, L.B., Pascual-Leone, A.: Neural reorganization following sensory loss: the opportunity of change. Nat. Rev. Neurosci. 11, 44–52 (2010)CrossRefGoogle Scholar
  59. 59.
    Mirowski, P., et al.: Learning to navigate in complex environments. In: ICLR (2017)Google Scholar
  60. 60.
    Mishkin, D., Dosovitskiy, A., Koltun, V.: Benchmarking classic and learned navigation in complex 3D environments. arXiv preprint arXiv:1901.10915 (2019)
  61. 61.
    Morgado, P., Nvasconcelos, N., Langlois, T., Wang, O.: Self-supervised generation of spatial audio for 360 video. In: NeurIPS (2018)Google Scholar
  62. 62.
    Murali, A. et al..: PyRobot: an open-source robotics framework for research and benchmarking. arXiv preprint arXiv:1906.08236 (2019)
  63. 63.
    Nakadai, K., Lourens, T., Okuno, H.G., Kitano, H.: Active audition for humanoid. In: AAAI (2000)Google Scholar
  64. 64.
    Nakadai, K., Nakamura, K.: Sound source localization and separation. Wiley Encyclopedia of Electrical and Electronics Engineering (1999)Google Scholar
  65. 65.
    Nakadai, K., Okuno, H.G., Kitano, H.: Epipolar geometry based sound localization and extraction for humanoid audition. In: IROS Workshops. IEEE (2001)Google Scholar
  66. 66.
    Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: ECCV (2018)Google Scholar
  67. 67.
    Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)Google Scholar
  68. 68.
    Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). Scholar
  69. 69.
    Picinali, L., Afonso, A., Denis, M., Katz, B.: Exploration of architectural spaces by blind people using auditory virtual reality for the construction of spatial knowledge. Int. J. Hum.-Comput. Stud. 72(4), 393–407 (2014)CrossRefGoogle Scholar
  70. 70.
    Qin, J., Cheng, J., Wu, X., Xu, Y.: A learning based approach to audio surveillance in household environment. Int. J. Inf. Acquis. 3, 213–219 (2006)CrossRefGoogle Scholar
  71. 71.
    Rascon, C., Meza, I.: Localization of sound sources in robotics: a review. Robot. Auton. Syst. 96, 184–210 (2017)CrossRefGoogle Scholar
  72. 72.
    RoÈder, B., Teder-SaÈlejaÈrvi, W., Sterr, A., RoÈsler, F., Hillyard, S.A., Neville, H.J.: Improved auditory spatial tuning in blind humans. Nature 400, 162–166 (1999)CrossRefGoogle Scholar
  73. 73.
    Romano, J.M., Brindza, J.P., Kuchenbecker, K.J.: ROS open-source audio recognizer: ROAR environmental sound detection tools for robot programming. Auton. Robot. 34, 207–215 (2013). Scholar
  74. 74.
    Savinov, N., Dosovitskiy, A., Koltun, V.: Semi-parametric topological memory for navigation. In: ICLR (2018)Google Scholar
  75. 75.
    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
  76. 76.
    Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: CVPR (2018)Google Scholar
  77. 77.
    Straub, J., et al.: The replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)
  78. 78.
    Sukhbaatar, S., Szlam, A., Synnaeve, G., Chintala, S., Fergus, R.: Mazebase: a sandbox for learning from games. arXiv preprint arXiv:1511.07401 (2015)
  79. 79.
    Thinus-Blanc, C., Gaunet, F.: Representation of space in blind persons: vision as a spatial sense? Psychol. Bull. 121, 20 (1997)CrossRefGoogle Scholar
  80. 80.
    Thomason, J., Gordon, D., Bisk, Y.: Shifting the baseline: single modality performance on visual navigation & QA. In: NAACL-HLT (2019)Google Scholar
  81. 81.
    Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press, Cambridge (2005)zbMATHGoogle Scholar
  82. 82.
    Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: ECCV (2018)Google Scholar
  83. 83.
    Tolman, E.C.: Cognitive maps in rats and men. Psychol. Rev. 55, 189 (1948)CrossRefGoogle Scholar
  84. 84.
    van der Maaten, L., Hinton, G.: Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)zbMATHGoogle Scholar
  85. 85.
    Veach, E., Guibas, L.: Bidirectional estimators for light transport. In: Sakas, G., Muller, S., Shirley, P. (eds) Photorealistic Rendering Techniques, pp. 145–167. Springer, Heidelberg (1995).
  86. 86.
    Viciana-Abad, R., Marfil, R., Perez-Lorenzo, J., Bandera, J., Romero-Garces, A., Reche-Lopez, P.: Audio-visual perception system for a humanoid robotic head. Sensors 14, 9522–9545 (2014)CrossRefGoogle Scholar
  87. 87.
    Voss, P., Lassonde, M., Gougoux, F., Fortin, M., Guillemot, J.P., Lepore, F.: Early-and late-onset blind individuals show supra-normal auditory abilities in far-space. Curr. Biol. 14(19), 1734–1738 (2004)CrossRefGoogle Scholar
  88. 88.
    Wang, Y., Kapadia, M., Huang, P., Kavan, L., Badler, N.: Sound localization and multi-modal steering for autonomous virtual agents. In: Symposium on Interactive 3D Graphics and Games (2014)Google Scholar
  89. 89.
    Wijmans, E., et al.: Embodied question answering in photorealistic environments with point cloud perception. In: CVPR (2019)Google Scholar
  90. 90.
    Wijmans, E., et al.: Decentralized distributed PPO: solving PointGoal navigation. In: ICLR (2020)Google Scholar
  91. 91.
    Wood, J., Magennis, M., Arias, E.F.C., Gutierrez, T., Graupp, H., Bergamasco, M.: The design and evaluation of a computer game for the blind in the GRAB haptic audio virtual environment. In: Proceedings of Eurohpatics (2003)Google Scholar
  92. 92.
    Wortsman, M., Ehsani, K., Rastegari, M., Farhadi, A., Mottaghi, R.: Learning to learn how to learn: self-adaptive visual navigation using meta-learning. In: CVPR (2019)Google Scholar
  93. 93.
    Woubie, A., Kanervisto, A., Karttunen, J., Hautamaki, V.: Do autonomous agents benefit from hearing? arXiv preprint arXiv:1905.04192 (2019)
  94. 94.
    Wu, X., Gong, H., Chen, P., Zhong, Z., Xu, Y.: Surveillance robot utilizing video and audio information. J. Intell. Robot. Syst. 55, 403–421 (2009). Scholar
  95. 95.
    Wu, Y., Wu, Y., Tamar, A., Russell, S., Gkioxari, G., Tian, Y.: Bayesian relational memory for semantic visual navigation. In: ICCV (2019)Google Scholar
  96. 96.
    Wymann, B., Espié, E., Guionneau, C., Dimitrakakis, C., Coulom, R., Sumner, A.: TORCS, the open racing car simulator (2013).
  97. 97.
    Xia, F., et al.: Interactive Gibson: a benchmark for interactive navigation in cluttered environments. arXiv preprint arXiv:1910.14442 (2019)
  98. 98.
    Xia, F., Zamir, A.R., He, Z., Sax, A., Malik, J., Savarese, S.: Gibson Env: real-world perception for embodied agents. In: CVPR (2018)Google Scholar
  99. 99.
    Yoshida, T., Nakadai, K., Okuno, H.G.: Automatic speech recognition improved by two-layered audio-visual integration for robot audition. In: 2009 9th IEEE-RAS International Conference on Humanoid Robots, pp. 604–609. IEEE (2009)Google Scholar
  100. 100.
    Aytar, Y., Vondrick, C., Torralba, A.: Learning sound representations from unlabeled video. In: NeurIPS (2016)Google Scholar
  101. 101.
    Aytar, Y., Vondrick, C., Torralba, A.: See, hear, and read: deep aligned representations. arXiv:1706.00932 (2017)
  102. 102.
    Zaunschirm, M., Schörkhuber, C., Höldrich, R.: Binaural rendering of ambisonic signals by head-related impulse response time alignment and a diffuseness constraint. J. Acoust. Soc. Am. 143, 3616 (2018)CrossRefGoogle Scholar
  103. 103.
    Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 587–604. Springer, Cham (2018). Scholar
  104. 104.
    Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: generating natural sound for videos in the wild. In: CVPR (2018)Google Scholar
  105. 105.
    Zhu, Y., et al.: Visual semantic planning using deep successor representations. In: ICCV (2017)Google Scholar
  106. 106.
    Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: ICRA (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.UT AustinAustinUSA
  2. 2.UIUCChampaignUSA
  3. 3.Facebook Reality LabsPittsburghUSA
  4. 4.Facebook AI ResearchPittsburghUSA

Personalised recommendations