Abstract
Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360\(^{\circ }\) camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of a vision ‘teacher’ method and a sound ‘student’ method – the student method is trained to generate the same results as the teacher method. This way, the auditory system can be trained without using human annotations. We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound Super-resolution to increase the spatial resolution of sounds, and b) dense depth prediction of the scene. We then formulate the three tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results on the dataset show that 1) our method achieves good results for all the three tasks; and 2) the three tasks are mutually beneficial – training them together achieves the best performance and 3) the number and the orientations of microphones are both important. The data and code will be released on the project page.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Computational auditory scene analysis: Comput. Speech Lang. 8(4), 297–336 (1994)
Albanie, S., Nagrani, A., Vedaldi, A., Zisserman, A.: Emotion recognition in speech using cross-modal transfer in the wild. In: ACM Multimedia (2018)
Antonacci, F., et al.: Inference of room geometry from acoustic impulse responses. IEEE Trans. Audio Speech Lang Process. 20(10), 2683–2695 (2012)
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: The IEEE International Conference on Computer Vision (ICCV) (2017)
Arandjelović, R., Zisserman, A.: Objects that sound. In: Proceedings of the European conference on computer vision (ECCV) (2018)
Argentieri, S., Danès, P., Souères, P.: A survey on sound source localization in robotics: from binaural to array processing methods. Comput. Speech Lang. 34(1), 87–112 (2015)
Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems (NIPS) (2016)
Balajee Vasudevan, A., Dai, D., Van Gool, L.: Object referring in visual scene with spoken language. In: Winter Conference on Applications of Computer Vision (WACV) (2018)
Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2007)
Brutzer, S., Höferlin, B., Heidemann, G.: Evaluation of background subtraction techniques for video surveillance. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 40(4), 834–848 (2017)
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Delmerico, J., et al.: The current state and future outlook of rescue robotics. J. Field Robot. 36(7), 1171–1191 (2019)
Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.: Talk2Car: taking control of your self-driving car. In: EMNLP-IJCNLP (2019)
Dokmanic, I., Parhizkar, R., Walther, A., Lu, Y.M., Vetterli, M.: Acoustic echoes reveal room shape. Proc. Nat. Acad. Sci. 110(30), 12186–12191 (2013)
Fazenda, B., Atmoko, H., Gu, F., Guan, L., Ball, A.: Acoustic based safety emergency vehicle detection for intelligent transport systems. In: ICCAS-SICE (2009)
Fendrich, R.: The merging of the senses. J. Cogn. Neurosci. 5(3), 373–374 (1993)
Gan, C., Zhao, H., Chen, P., Cox, D., Torralba, A.: Self-supervised moving vehicle tracking with stereo sound. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 36–54. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_3
Gao, R., Grauman, K.: 2.5 D visual sound. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 324–333 (2019)
Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Gaver, W.W.: What in the world do we hear?: an ecological approach to auditory event perception. Ecol. Psychol. 5(1), 1–29 (1993)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Robot. Res. (IJRR) 32(11), 1231–1237 (2013)
Godard, C., Aodha, O.M., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE International Conference on Computer Vision (CVPR), pp. 3828–3838 (2019)
Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 8, pp. 804–807 (1983)
Hecker, S., Dai, D., Van Gool, L.: End-to-end learning of driving models with surround-view cameras and route planners. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 449–468. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_27
Huang, W., Alem, L., Livingston, M.A.: Human factors in augmented reality environments. Springer, New York (2012). https://doi.org/10.1007/978-1-4614-4205-9
Irie, G., et al.: Seeing through sounds: predicting visual semantic segmentation results from multichannel audio signals. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3961–3964 (2019)
Kim, H., Remaggi, L., Jackson, P.J., Fazi, F.M., Hilton, A.: 3D room geometry reconstruction using audio-visual sensors. In: International Conference on 3D Vision (3DV), pp. 621–629 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Klee, U., Gehrig, T., McDonough, J.: Kalman filters for time delay of arrival-based source localization. EURASIP J. Adv. Signal Process. 2006(1), 012378 (2006)
Li, D., Langlois, T.R., Zheng, C.: Scene-aware audio for 360\(^{\circ }\) videos. ACM Trans. Graph 37(4), 12 (2018)
Marchegiani, L., Posner, I.: Leveraging the urban soundscape: auditory perception for smart vehicles. In: IEEE International Conference on Robotics and Automation (ICRA) (2017)
McAnally, K.I., Martin, R.L.: Sound localization with head movement: implications for 3-d audio displays. Front. Neurosci. 8, 210 (2014)
Mousavian, A., Pirsiavash, H., Košecká, J.: Joint semantic segmentation and depth estimation with deep convolutional networks. In: International Conference on 3D Vision (3DV), pp. 611–619 (2016)
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 639–658. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_39
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_48
Morgado, P., Vasconcelos, N., Langlois, T., Wang, O.: Self-supervised generation of spatial audio for 360 deg video. In: Neural Information Processing Systems (NIPS) (2018)
Rascon, C., Meza, I.: Localization of sound sources in robotics: a review. Robot. Auton. Syst. 96, 184–210 (2017)
Rosenblum, L.D., Gordon, M.S., Jarquin, L.: Echolocating distance by moving and stationary listeners. Ecol. Psychol. 12(3), 181–206 (2000)
Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vis. (IJCV) 77(1–3), 157–173 (2008)
Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: ACM Multimedia (2014)
Saxena, A., Ng, A.Y.: Learning sound location from a single microphone. In: IEEE International Conference on Robotics and Automation (ICRA) (2009)
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Simeoni, M.M.J.A., Kashani, S., Hurley, P., Vetterli, M.: DeepWave: a recurrent neural-network for real-time acoustic imaging. In: Neural Information Processing Systems (NIPS), p. 38 (2019)
Thurlow, W.R., Mangels, J.W., Runge, P.S.: Head movements during sound localizationtd. J. Acoust. Soc. Am. 42(2), 489–493 (1967)
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 252–268. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_16
Tiete, J., Domínguez, F., da Silva, B., Segers, L., Steenhaut, K., Touhafi, A.: SoundCompass: a distributed MEMS microphone array-based sensor for sound source localization. Sensors 14(2), 1918–1949 (2014)
Urmson, C., et al.: Autonomous driving in urban environments: boss and the urban challenge. J. Field Robot. 25(8), 425–466 (2008). Special Issue on the 2007 DARPA Urban Challenge, Part I
Vandenhende, S., Georgoulis, S., Proesmans, M., Dai, D., Van Gool, L.: Revisiting multi-task learning in the deep learning era. arXiv (2020)
Wallach, H.: The role of head movements and vestibular and visual cues in sound localization. J. Exp. Psychol. 27(4), 339 (1940)
Ye, M., Zhang, Y., Yang, R., Manocha, D.: 3D reconstruction in the presence of glasses by acoustic and stereo fusion. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Zhao, H., Gan, C., Ma, W.C., Torralba, A.: The sound of motions. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 587–604. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_35
Acknowledgement
This work is funded by Toyota Motor Europe via the research project TRACE-Zurich. We would like to thank Danda Pani Paudel, Suryansh Kumar and Vaishakh Patil for helpful discussions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Vasudevan, A.B., Dai, D., Van Gool, L. (2020). Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12349. Springer, Cham. https://doi.org/10.1007/978-3-030-58548-8_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-58548-8_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58547-1
Online ISBN: 978-3-030-58548-8
eBook Packages: Computer ScienceComputer Science (R0)