Advertisement

Leveraging Acoustic Images for Effective Self-supervised Audio Representation Learning

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12367)

Abstract

In this paper, we propose the use of a new modality characterized by a richer information content, namely acoustic images, for the sake of audio-visual scene understanding. Each pixel in such images is characterized by a spectral signature, associated to a specific direction in space and obtained by processing the audio signals coming from an array of microphones. By coupling such array with a video camera, we obtain spatio-temporal alignment of acoustic images and video frames. This constitutes a powerful source of self-supervision, which can be exploited in the learning pipeline we are proposing, without resorting to expensive data annotations. However, since 2D planar arrays are cumbersome and not as widespread as ordinary microphones, we propose that the richer information content of acoustic images can be distilled, through a self-supervised learning scheme, into more powerful audio and visual feature representations. The learnt feature representations can then be employed for downstream tasks such as classification and cross-modal retrieval, without the need of a microphone array. To prove that, we introduce a novel multimodal dataset consisting in RGB videos, raw audio signals and acoustic images, aligned in space and synchronized in time. Experimental results demonstrate the validity of our hypothesis and the effectiveness of the proposed pipeline, also when tested for tasks and datasets different from those used for training.

Keywords

Audio-visual representations Acoustic images Audio- and video-based classification Cross-modal retrieval Self-supervised learning 

Supplementary material

504482_1_En_8_MOESM1_ESM.pdf (13.7 mb)
Supplementary material 1 (pdf 14053 KB)

Supplementary material 2 (mp4 214 KB)

Supplementary material 3 (mp4 122 KB)

Supplementary material 4 (mp4 447 KB)

Supplementary material 5 (mp4 461 KB)

Supplementary material 6 (mp4 460 KB)

Supplementary material 7 (mp4 344 KB)

Supplementary material 8 (mp4 306 KB)

Supplementary material 9 (mp4 307 KB)

References

  1. 1.
    Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: The IEEE International Conference on Computer Vision (ICCV), October 2017Google Scholar
  2. 2.
    Arandjelovic, R., Zisserman, A.: Objects that sound. In: The European Conference on Computer Vision (ECCV), September 2018Google Scholar
  3. 3.
    Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 2016, pp. 892–900. Curran Associates Inc., USA (2016). http://dl.acm.org/citation.cfm?id=3157096.3157196
  4. 4.
    Aytar, Y., Vondrick, C., Torralba, A.: See, hear, and read: deep aligned representations. CoRR abs/1706.00932 (2017). http://arxiv.org/abs/1706.00932
  5. 5.
    Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICML (2009)Google Scholar
  6. 6.
    Crocco, M., Martelli, S., Trucco, A., Zunino, A., Murino, V.: Audio tracking in noisy environments by acoustic map and spectral signature. IEEE Trans. Cybern. 48, 1619–1632 (2018)CrossRefGoogle Scholar
  7. 7.
    Crocco, M., Trucco, A.: Design of superdirective planar arrays with sparse aperiodic layouts for processing broadband signals via 3-D beamforming. IEEE/ACM Trans. Audio, Speech Lang. Process. 22(4), 800–815 (2014).  https://doi.org/10.1109/TASLP.2014.2304635CrossRefGoogle Scholar
  8. 8.
    Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph. 37(4), 1–2 (2018)CrossRefGoogle Scholar
  9. 9.
    Gao, R., Grauman, K.: 2.5d visual sound. CVPR 2019 arXiv:1812.04204 (2019)
  10. 10.
    Garcia, N.C., Morerio, P., Murino, V.: Learning with privileged information via adversarial discriminative modality distillation. CoRR abs/1810.08437 (2018)Google Scholar
  11. 11.
    Garcia, N.C., Morerio, P., Murino, V.: Modality distillation with multiple stream networks for action recognition. In: The European Conference on Computer Vision (ECCV), September 2018Google Scholar
  12. 12.
    Harwath, D., Recasens, A., Suris, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. In: The European Conference on Computer Vision (ECCV), September 2018Google Scholar
  13. 13.
    Harwath, D., Torralba, A., Glass, J.: Unsupervised learning of spoken language with visual context. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 1858–1866. Curran Associates, Inc. (2016). http://papers.nips.cc/paper/6186-unsupervised-learning-of-spoken-language-with-visual-context.pdf
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016Google Scholar
  15. 15.
    Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. NIPS 2014 Deep Learning Workshop abs/1503.02531 (2015)Google Scholar
  16. 16.
    Hoffman, J., Gupta, S., Darrell, T.: Learning with side information through modality hallucination. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 826–834, June 2016.  https://doi.org/10.1109/CVPR.2016.96
  17. 17.
    Hu, D., Nie, F., Li, X.: Deep multimodal clustering for unsupervised audiovisual learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019Google Scholar
  18. 18.
    Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks: a survey. CoRR abs/1902.06162 (2019)Google Scholar
  19. 19.
    Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models from self-supervised synchronization. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 7774–7785. Curran Associates, Inc. (2018). http://papers.nips.cc/paper/8002-cooperative-learning-of-audio-and-video-models-from-self-supervised-synchronization.pdf
  20. 20.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)Google Scholar
  21. 21.
    Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Unifying distillation and privileged information. ICLR 2016 abs/1511.03643 (2016)Google Scholar
  22. 22.
    Mesaros, A., Heittola, T., Virtanen, T.: A multi-device dataset for urban acoustic scene classification. In: DCASE 2018 Workshop (2018)Google Scholar
  23. 23.
    Morgado, P., Vasconcelos, N., Langlois, T., Wang, O.: Self-supervised generation of spatial audio for 360 video. In: Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS 2018, pp. 360–370. Curran Associates Inc., USA (2018)Google Scholar
  24. 24.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML 2011, pp. 689–696. Omnipress, USA (2011). http://dl.acm.org/citation.cfm?id=3104482.3104569
  25. 25.
    Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: The European Conference on Computer Vision (ECCV), September 2018Google Scholar
  26. 26.
    Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2405–2413 (2016)Google Scholar
  27. 27.
    Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Learning sight from sound: ambient sound provides supervision for visual learning. Int. J. Comput. Vis. 126(10), 1120–1137 (2018).  https://doi.org/10.1007/s11263-018-1083-5
  28. 28.
    Pérez, A.F., Sanguineti, V., Morerio, P., Murino, V.: Audio-visual model distillation using acoustic images. In: Winter Conference on Applications of Computer Vision (WACV) (2020)Google Scholar
  29. 29.
    Ramaswamy, J., Das, S.: See the sound, hear the pixels. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 2959–2968 (2020)Google Scholar
  30. 30.
    Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823, June 2015Google Scholar
  31. 31.
    Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018Google Scholar
  32. 32.
    Terasawa, H., Slaney, M., Berger, J.: A statistical model of timbre perception. In: SAPA@INTERSPEECH (2006)Google Scholar
  33. 33.
    Van Trees, H.: Detection, Estimation, and Modulation Theory, Optimum Array Processing. Wiley (2002)Google Scholar
  34. 34.
    Vapnik, V., Vashist, A.: A new learning paradigm: learning using privileged information. Neural Netw. 22(5–6), 544–557 (2009)CrossRefGoogle Scholar
  35. 35.
    Yang, K., Russell, B., Salamon, J.: Telling left from right: learning spatial correspondence of sight and sound. In: CVPR (2020)Google Scholar
  36. 36.
    Zunino, A., Crocco, M., Martelli, S., Trucco, A., Bue, A.D., Murino, V.: Seeing the sound: a new multimodal imaging device for computer vision. In: 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 693–701, December 2015.  https://doi.org/10.1109/ICCVW.2015.95

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Pattern Analysis and Computer Vision, Istituto Italiano di TecnologiaGenoaItaly
  2. 2.University of GenovaGenoaItaly
  3. 3.Huawei Technologies Ltd., Ireland Research CenterDublinIreland
  4. 4.University of VeronaVeronaItaly

Personalised recommendations