Advertisement

Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Conference paper
  • 928 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12349)

Abstract

Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360\(^{\circ }\) camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of a vision ‘teacher’ method and a sound ‘student’ method – the student method is trained to generate the same results as the teacher method. This way, the auditory system can be trained without using human annotations. We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound Super-resolution to increase the spatial resolution of sounds, and b) dense depth prediction of the scene. We then formulate the three tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results on the dataset show that 1) our method achieves good results for all the three tasks; and 2) the three tasks are mutually beneficial – training them together achieves the best performance and 3) the number and the orientations of microphones are both important. The data and code will be released on the project page.

Notes

Acknowledgement

This work is funded by Toyota Motor Europe via the research project TRACE-Zurich. We would like to thank Danda Pani Paudel, Suryansh Kumar and Vaishakh Patil for helpful discussions.

Supplementary material

504439_1_En_37_MOESM1_ESM.zip (65.6 mb)
Supplementary material 1 (zip 67124 KB)

References

  1. 1.
    Computational auditory scene analysis: Comput. Speech Lang. 8(4), 297–336 (1994)CrossRefGoogle Scholar
  2. 2.
    Albanie, S., Nagrani, A., Vedaldi, A., Zisserman, A.: Emotion recognition in speech using cross-modal transfer in the wild. In: ACM Multimedia (2018)Google Scholar
  3. 3.
    Antonacci, F., et al.: Inference of room geometry from acoustic impulse responses. IEEE Trans. Audio Speech Lang Process. 20(10), 2683–2695 (2012)CrossRefGoogle Scholar
  4. 4.
    Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: The IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  5. 5.
    Arandjelović, R., Zisserman, A.: Objects that sound. In: Proceedings of the European conference on computer vision (ECCV) (2018)Google Scholar
  6. 6.
    Argentieri, S., Danès, P., Souères, P.: A survey on sound source localization in robotics: from binaural to array processing methods. Comput. Speech Lang. 34(1), 87–112 (2015)CrossRefGoogle Scholar
  7. 7.
    Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems (NIPS) (2016)Google Scholar
  8. 8.
    Balajee Vasudevan, A., Dai, D., Van Gool, L.: Object referring in visual scene with spoken language. In: Winter Conference on Applications of Computer Vision (WACV) (2018)Google Scholar
  9. 9.
    Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2007)Google Scholar
  10. 10.
    Brutzer, S., Höferlin, B., Heidemann, G.: Evaluation of background subtraction techniques for video surveillance. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)Google Scholar
  11. 11.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 40(4), 834–848 (2017)CrossRefGoogle Scholar
  12. 12.
    Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01234-2_49CrossRefGoogle Scholar
  13. 13.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  14. 14.
    Delmerico, J., et al.: The current state and future outlook of rescue robotics. J. Field Robot. 36(7), 1171–1191 (2019)CrossRefGoogle Scholar
  15. 15.
    Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.: Talk2Car: taking control of your self-driving car. In: EMNLP-IJCNLP (2019)Google Scholar
  16. 16.
    Dokmanic, I., Parhizkar, R., Walther, A., Lu, Y.M., Vetterli, M.: Acoustic echoes reveal room shape. Proc. Nat. Acad. Sci. 110(30), 12186–12191 (2013)CrossRefGoogle Scholar
  17. 17.
    Fazenda, B., Atmoko, H., Gu, F., Guan, L., Ball, A.: Acoustic based safety emergency vehicle detection for intelligent transport systems. In: ICCAS-SICE (2009)Google Scholar
  18. 18.
    Fendrich, R.: The merging of the senses. J. Cogn. Neurosci. 5(3), 373–374 (1993)CrossRefGoogle Scholar
  19. 19.
    Gan, C., Zhao, H., Chen, P., Cox, D., Torralba, A.: Self-supervised moving vehicle tracking with stereo sound. In: The IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  20. 20.
    Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 36–54. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01219-9_3CrossRefGoogle Scholar
  21. 21.
    Gao, R., Grauman, K.: 2.5 D visual sound. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 324–333 (2019)Google Scholar
  22. 22.
    Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: The IEEE International Conference on Computer Vision (ICCV), October 2019Google Scholar
  23. 23.
    Gaver, W.W.: What in the world do we hear?: an ecological approach to auditory event perception. Ecol. Psychol. 5(1), 1–29 (1993)CrossRefGoogle Scholar
  24. 24.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Robot. Res. (IJRR) 32(11), 1231–1237 (2013)CrossRefGoogle Scholar
  25. 25.
    Godard, C., Aodha, O.M., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE International Conference on Computer Vision (CVPR), pp. 3828–3838 (2019)Google Scholar
  26. 26.
    Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 8, pp. 804–807 (1983)Google Scholar
  27. 27.
    Hecker, S., Dai, D., Van Gool, L.: End-to-end learning of driving models with surround-view cameras and route planners. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 449–468. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01234-2_27CrossRefGoogle Scholar
  28. 28.
    Huang, W., Alem, L., Livingston, M.A.: Human factors in augmented reality environments. Springer, New York (2012).  https://doi.org/10.1007/978-1-4614-4205-9CrossRefGoogle Scholar
  29. 29.
    Irie, G., et al.: Seeing through sounds: predicting visual semantic segmentation results from multichannel audio signals. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3961–3964 (2019)Google Scholar
  30. 30.
    Kim, H., Remaggi, L., Jackson, P.J., Fazi, F.M., Hilton, A.: 3D room geometry reconstruction using audio-visual sensors. In: International Conference on 3D Vision (3DV), pp. 621–629 (2017)Google Scholar
  31. 31.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  32. 32.
    Klee, U., Gehrig, T., McDonough, J.: Kalman filters for time delay of arrival-based source localization. EURASIP J. Adv. Signal Process. 2006(1), 012378 (2006)MathSciNetCrossRefGoogle Scholar
  33. 33.
    Li, D., Langlois, T.R., Zheng, C.: Scene-aware audio for 360\(^{\circ }\) videos. ACM Trans. Graph 37(4), 12 (2018)Google Scholar
  34. 34.
    Marchegiani, L., Posner, I.: Leveraging the urban soundscape: auditory perception for smart vehicles. In: IEEE International Conference on Robotics and Automation (ICRA) (2017)Google Scholar
  35. 35.
    McAnally, K.I., Martin, R.L.: Sound localization with head movement: implications for 3-d audio displays. Front. Neurosci. 8, 210 (2014)CrossRefGoogle Scholar
  36. 36.
    Mousavian, A., Pirsiavash, H., Košecká, J.: Joint semantic segmentation and depth estimation with deep convolutional networks. In: International Conference on 3D Vision (3DV), pp. 611–619 (2016)Google Scholar
  37. 37.
    Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 639–658. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01231-1_39CrossRefGoogle Scholar
  38. 38.
    Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  39. 39.
    Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_48CrossRefGoogle Scholar
  40. 40.
    Morgado, P., Vasconcelos, N., Langlois, T., Wang, O.: Self-supervised generation of spatial audio for 360 deg video. In: Neural Information Processing Systems (NIPS) (2018)Google Scholar
  41. 41.
    Rascon, C., Meza, I.: Localization of sound sources in robotics: a review. Robot. Auton. Syst. 96, 184–210 (2017)CrossRefGoogle Scholar
  42. 42.
    Rosenblum, L.D., Gordon, M.S., Jarquin, L.: Echolocating distance by moving and stationary listeners. Ecol. Psychol. 12(3), 181–206 (2000)CrossRefGoogle Scholar
  43. 43.
    Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vis. (IJCV) 77(1–3), 157–173 (2008)CrossRefGoogle Scholar
  44. 44.
    Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: ACM Multimedia (2014)Google Scholar
  45. 45.
    Saxena, A., Ng, A.Y.: Learning sound location from a single microphone. In: IEEE International Conference on Robotics and Automation (ICRA) (2009)Google Scholar
  46. 46.
    Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  47. 47.
    Simeoni, M.M.J.A., Kashani, S., Hurley, P., Vetterli, M.: DeepWave: a recurrent neural-network for real-time acoustic imaging. In: Neural Information Processing Systems (NIPS), p. 38 (2019)Google Scholar
  48. 48.
    Thurlow, W.R., Mangels, J.W., Runge, P.S.: Head movements during sound localizationtd. J. Acoust. Soc. Am. 42(2), 489–493 (1967)CrossRefGoogle Scholar
  49. 49.
    Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 252–268. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01216-8_16CrossRefGoogle Scholar
  50. 50.
    Tiete, J., Domínguez, F., da Silva, B., Segers, L., Steenhaut, K., Touhafi, A.: SoundCompass: a distributed MEMS microphone array-based sensor for sound source localization. Sensors 14(2), 1918–1949 (2014)CrossRefGoogle Scholar
  51. 51.
    Urmson, C., et al.: Autonomous driving in urban environments: boss and the urban challenge. J. Field Robot. 25(8), 425–466 (2008). Special Issue on the 2007 DARPA Urban Challenge, Part IGoogle Scholar
  52. 52.
    Vandenhende, S., Georgoulis, S., Proesmans, M., Dai, D., Van Gool, L.: Revisiting multi-task learning in the deep learning era. arXiv (2020)Google Scholar
  53. 53.
    Wallach, H.: The role of head movements and vestibular and visual cues in sound localization. J. Exp. Psychol. 27(4), 339 (1940)MathSciNetCrossRefGoogle Scholar
  54. 54.
    Ye, M., Zhang, Y., Yang, R., Manocha, D.: 3D reconstruction in the presence of glasses by acoustic and stereo fusion. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  55. 55.
    Zhao, H., Gan, C., Ma, W.C., Torralba, A.: The sound of motions. In: The IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  56. 56.
    Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 587–604. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01246-5_35CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.CVL, ETH ZurichZurichSwitzerland
  2. 2.PSI, KU LeuvenLeuvenBelgium

Personalised recommendations