Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Vasudevan, Arun Balajee; Dai, Dengxin; Van Gool, Luc

doi:10.1007/978-3-030-58548-8_37

Arun Balajee Vasudevan¹²,
Dengxin Dai¹² &
Luc Van Gool^12,13

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12349))

Included in the following conference series:

European Conference on Computer Vision

5187 Accesses
22 Citations

Abstract

Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360\(^{\circ }\) camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of a vision ‘teacher’ method and a sound ‘student’ method – the student method is trained to generate the same results as the teacher method. This way, the auditory system can be trained without using human annotations. We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound Super-resolution to increase the spatial resolution of sounds, and b) dense depth prediction of the scene. We then formulate the three tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results on the dataset show that 1) our method achieves good results for all the three tasks; and 2) the three tasks are mutually beneficial – training them together achieves the best performance and 3) the number and the orientations of microphones are both important. The data and code will be released on the project page.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Computational auditory scene analysis: Comput. Speech Lang. 8(4), 297–336 (1994)
Article Google Scholar
Albanie, S., Nagrani, A., Vedaldi, A., Zisserman, A.: Emotion recognition in speech using cross-modal transfer in the wild. In: ACM Multimedia (2018)
Google Scholar
Antonacci, F., et al.: Inference of room geometry from acoustic impulse responses. IEEE Trans. Audio Speech Lang Process. 20(10), 2683–2695 (2012)
Article Google Scholar
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: The IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Arandjelović, R., Zisserman, A.: Objects that sound. In: Proceedings of the European conference on computer vision (ECCV) (2018)
Google Scholar
Argentieri, S., Danès, P., Souères, P.: A survey on sound source localization in robotics: from binaural to array processing methods. Comput. Speech Lang. 34(1), 87–112 (2015)
Article Google Scholar
Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Balajee Vasudevan, A., Dai, D., Van Gool, L.: Object referring in visual scene with spoken language. In: Winter Conference on Applications of Computer Vision (WACV) (2018)
Google Scholar
Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2007)
Google Scholar
Brutzer, S., Höferlin, B., Heidemann, G.: Evaluation of background subtraction techniques for video surveillance. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 40(4), 834–848 (2017)
Article Google Scholar
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Chapter Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Delmerico, J., et al.: The current state and future outlook of rescue robotics. J. Field Robot. 36(7), 1171–1191 (2019)
Article Google Scholar
Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.: Talk2Car: taking control of your self-driving car. In: EMNLP-IJCNLP (2019)
Google Scholar
Dokmanic, I., Parhizkar, R., Walther, A., Lu, Y.M., Vetterli, M.: Acoustic echoes reveal room shape. Proc. Nat. Acad. Sci. 110(30), 12186–12191 (2013)
Article Google Scholar
Fazenda, B., Atmoko, H., Gu, F., Guan, L., Ball, A.: Acoustic based safety emergency vehicle detection for intelligent transport systems. In: ICCAS-SICE (2009)
Google Scholar
Fendrich, R.: The merging of the senses. J. Cogn. Neurosci. 5(3), 373–374 (1993)
Article Google Scholar
Gan, C., Zhao, H., Chen, P., Cox, D., Torralba, A.: Self-supervised moving vehicle tracking with stereo sound. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 36–54. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_3
Chapter Google Scholar
Gao, R., Grauman, K.: 2.5 D visual sound. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 324–333 (2019)
Google Scholar
Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Gaver, W.W.: What in the world do we hear?: an ecological approach to auditory event perception. Ecol. Psychol. 5(1), 1–29 (1993)
Article Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Robot. Res. (IJRR) 32(11), 1231–1237 (2013)
Article Google Scholar
Godard, C., Aodha, O.M., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE International Conference on Computer Vision (CVPR), pp. 3828–3838 (2019)
Google Scholar
Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 8, pp. 804–807 (1983)
Google Scholar
Hecker, S., Dai, D., Van Gool, L.: End-to-end learning of driving models with surround-view cameras and route planners. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 449–468. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_27
Chapter Google Scholar
Huang, W., Alem, L., Livingston, M.A.: Human factors in augmented reality environments. Springer, New York (2012). https://doi.org/10.1007/978-1-4614-4205-9
Book Google Scholar
Irie, G., et al.: Seeing through sounds: predicting visual semantic segmentation results from multichannel audio signals. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3961–3964 (2019)
Google Scholar
Kim, H., Remaggi, L., Jackson, P.J., Fazi, F.M., Hilton, A.: 3D room geometry reconstruction using audio-visual sensors. In: International Conference on 3D Vision (3DV), pp. 621–629 (2017)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Klee, U., Gehrig, T., McDonough, J.: Kalman filters for time delay of arrival-based source localization. EURASIP J. Adv. Signal Process. 2006(1), 012378 (2006)
Article MathSciNet Google Scholar
Li, D., Langlois, T.R., Zheng, C.: Scene-aware audio for 360\(^{\circ }\) videos. ACM Trans. Graph 37(4), 12 (2018)
Google Scholar
Marchegiani, L., Posner, I.: Leveraging the urban soundscape: auditory perception for smart vehicles. In: IEEE International Conference on Robotics and Automation (ICRA) (2017)
Google Scholar
McAnally, K.I., Martin, R.L.: Sound localization with head movement: implications for 3-d audio displays. Front. Neurosci. 8, 210 (2014)
Article Google Scholar
Mousavian, A., Pirsiavash, H., Košecká, J.: Joint semantic segmentation and depth estimation with deep convolutional networks. In: International Conference on 3D Vision (3DV), pp. 611–619 (2016)
Google Scholar
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 639–658. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_39
Chapter Google Scholar
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_48
Chapter Google Scholar
Morgado, P., Vasconcelos, N., Langlois, T., Wang, O.: Self-supervised generation of spatial audio for 360 deg video. In: Neural Information Processing Systems (NIPS) (2018)
Google Scholar
Rascon, C., Meza, I.: Localization of sound sources in robotics: a review. Robot. Auton. Syst. 96, 184–210 (2017)
Article Google Scholar
Rosenblum, L.D., Gordon, M.S., Jarquin, L.: Echolocating distance by moving and stationary listeners. Ecol. Psychol. 12(3), 181–206 (2000)
Article Google Scholar
Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vis. (IJCV) 77(1–3), 157–173 (2008)
Article Google Scholar
Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: ACM Multimedia (2014)
Google Scholar
Saxena, A., Ng, A.Y.: Learning sound location from a single microphone. In: IEEE International Conference on Robotics and Automation (ICRA) (2009)
Google Scholar
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Simeoni, M.M.J.A., Kashani, S., Hurley, P., Vetterli, M.: DeepWave: a recurrent neural-network for real-time acoustic imaging. In: Neural Information Processing Systems (NIPS), p. 38 (2019)
Google Scholar
Thurlow, W.R., Mangels, J.W., Runge, P.S.: Head movements during sound localizationtd. J. Acoust. Soc. Am. 42(2), 489–493 (1967)
Article Google Scholar
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 252–268. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_16
Chapter Google Scholar
Tiete, J., Domínguez, F., da Silva, B., Segers, L., Steenhaut, K., Touhafi, A.: SoundCompass: a distributed MEMS microphone array-based sensor for sound source localization. Sensors 14(2), 1918–1949 (2014)
Article Google Scholar
Urmson, C., et al.: Autonomous driving in urban environments: boss and the urban challenge. J. Field Robot. 25(8), 425–466 (2008). Special Issue on the 2007 DARPA Urban Challenge, Part I
Google Scholar
Vandenhende, S., Georgoulis, S., Proesmans, M., Dai, D., Van Gool, L.: Revisiting multi-task learning in the deep learning era. arXiv (2020)
Google Scholar
Wallach, H.: The role of head movements and vestibular and visual cues in sound localization. J. Exp. Psychol. 27(4), 339 (1940)
Article MathSciNet Google Scholar
Ye, M., Zhang, Y., Yang, R., Manocha, D.: 3D reconstruction in the presence of glasses by acoustic and stereo fusion. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Zhao, H., Gan, C., Ma, W.C., Torralba, A.: The sound of motions. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 587–604. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_35
Chapter Google Scholar

Download references

Acknowledgement

This work is funded by Toyota Motor Europe via the research project TRACE-Zurich. We would like to thank Danda Pani Paudel, Suryansh Kumar and Vaishakh Patil for helpful discussions.

Author information

Authors and Affiliations

CVL, ETH Zurich, Zurich, Switzerland
Arun Balajee Vasudevan, Dengxin Dai & Luc Van Gool
PSI, KU Leuven, Leuven, Belgium
Luc Van Gool

Authors

Arun Balajee Vasudevan
View author publications
You can also search for this author in PubMed Google Scholar
Dengxin Dai
View author publications
You can also search for this author in PubMed Google Scholar
Luc Van Gool
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arun Balajee Vasudevan .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 67124 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vasudevan, A.B., Dai, D., Van Gool, L. (2020). Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12349. Springer, Cham. https://doi.org/10.1007/978-3-030-58548-8_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-58548-8_37
Published: 29 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58547-1
Online ISBN: 978-3-030-58548-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics