Advertisement

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

  • David Harwath
  • Adrià Recasens
  • Dídac Surís
  • Galen Chuang
  • Antonio Torralba
  • James Glass
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11210)

Abstract

In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly on the image pixels and speech waveform, and do not rely on any conventional supervision in the form of labels, segmentations, or alignments between the modalities during training. We perform analysis using the Places 205 and ADE20k datasets demonstrating that our models implicitly learn semantically-coupled object and word detectors.

Keywords

Vision and language Sound Speech Convolutional networks Multimodal learning Unsupervised learning 

Notes

Acknowledgments

The authors would like to thank Toyota Research Institute, Inc. for supporting this work.

Supplementary material

474211_1_En_40_MOESM1_ESM.zip (62.5 mb)
Supplementary material 1 (zip 63953 KB)

References

  1. 1.
    Alishahi, A., Barking, M., Chrupala, G.: Encoding of phonology in a recurrent neural model of grounded speech. In: CoNLL (2017)Google Scholar
  2. 2.
    Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  3. 3.
    Arandjelovic, R., Zisserman, A.: Look, listen, and learn. In: ICCV (2017)Google Scholar
  4. 4.
    Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems, vol. 29, pp. 892–900 (2016)Google Scholar
  5. 5.
    Bergamo, A., Bazzani, L., Anguelov, D., Torresani, L.: Self-taught object localization with deep networks. CoRR abs/1409.3964 (2014). http://arxiv.org/abs/1409.3964
  6. 6.
    Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. In: Cowan, J.D., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing Systems, vol. 6, pp. 737–744. Morgan-Kaufmann (1994)Google Scholar
  7. 7.
    Cho, M., Kwak, S., Schmid, C., Ponce, J.: Unsupervised object discovery and localization in the wild: part-based matching with bottom-up region proposals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  8. 8.
    Chrupala, G., Gelderloos, L., Alishahi, A.: Representations of language in a model of visually grounded speech signal. In: ACL (2017)Google Scholar
  9. 9.
    Cinbis, R., Verbeek, J., Schmid, C.: Weakly supervised object localization with multi-fold multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 39(1), 189–203 (2016)CrossRefGoogle Scholar
  10. 10.
    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. CoRR abs/1505.05192 (2015). http://arxiv.org/abs/1505.05192
  11. 11.
    Drexler, J., Glass, J.: Analysis of audio-visual features for unsupervised speech recognition. In: Grounded Language Understanding Workshop (2017)Google Scholar
  12. 12.
    Dupoux, E.: Cognitive science in the era of artificial intelligence: a roadmap for reverse-engineering the infant language-learner. Cognition 173, 43–59 (2018)CrossRefGoogle Scholar
  13. 13.
    Fang, H., et al.: From captions to visual concepts and back. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  14. 14.
    Gao, H., Mao, J., Zhou, J., Huang, Z., Yuille, A.: Are you talking to a machine? Dataset and methods for multilingual image question answering. In: NIPS (2015)Google Scholar
  15. 15.
    Gelderloos, L., Chrupała, G.: From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning. arXiv:1610.03342 (2016)
  16. 16.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013)Google Scholar
  17. 17.
    Guérin, J., Gibaru, O., Thiery, S., Nyiri, E.: CNN features are also great at unsupervised classification. CoRR abs/1707.01700 (2017). http://arxiv.org/abs/1707.01700
  18. 18.
    Harwath, D., Glass, J.: Learning word-like units from joint audio-visual analysis. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2017)Google Scholar
  19. 19.
    Harwath, D., Torralba, A., Glass, J.R.: Unsupervised learning of spoken language with visual context. In: Proceedings of the Neural Information Processing Systems (NIPS) (2016)Google Scholar
  20. 20.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. J. Mach. Learn. Res. (JMLR) (2015)Google Scholar
  21. 21.
    Jansen, A., Church, K., Hermansky, H.: Toward spoken term discovery at scale with zero resources. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2010)Google Scholar
  22. 22.
    Jansen, A., Van Durme, B.: Efficient spoken term discovery using randomized algorithms. In: Proceedings of the IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU) (2011)Google Scholar
  23. 23.
    Johnson, J., Karpathy, A., Fei-Fei, L.: DenseCap: fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  24. 24.
    Kamper, H., Elsner, M., Jansen, A., Goldwater, S.: Unsupervised neural network based feature extraction using weak top-down constraints. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015)Google Scholar
  25. 25.
    Kamper, H., Jansen, A., Goldwater, S.: Unsupervised word segmentation and lexicon discovery using acoustic word embeddings. IEEE Trans. Audio Speech Lang. Process. 24(4), 669–679 (2016)CrossRefGoogle Scholar
  26. 26.
    Kamper, H., Settle, S., Shakhnarovich, G., Livescu, K.: Visually grounded learning of keyword prediction from untranscribed speech. In: INTERSPEECH (2017)Google Scholar
  27. 27.
    Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of the Neural Information Processing Systems (NIPS) (2014)Google Scholar
  28. 28.
    Karpathy, A., Li, F.F.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  29. 29.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  30. 30.
    Lee, C., Glass, J.: A nonparametric Bayesian approach to acoustic model discovery. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2012)Google Scholar
  31. 31.
    Lewis, M.P., Simon, G.F., Fennig, C.D.: Ethnologue: Languages of the World, 9th edn. SIL International (2016). http://www.ethnologue.com
  32. 32.
    Lin, T., et al.: Microsoft COCO: common objects in context. arXiv:1405.0312 (2015)
  33. 33.
    Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS (2014)Google Scholar
  34. 34.
    Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: ICCV (2015)Google Scholar
  35. 35.
    Ondel, L., Burget, L., Cernocky, J.: Variational inference for acoustic unit discovery. In: 5th Workshop on Spoken Language Technology for Under-Resourced Language (2016)CrossRefGoogle Scholar
  36. 36.
    Owens, A., Isola, P., McDermott, J.H., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 2405–2413 (2016)Google Scholar
  37. 37.
    Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_48CrossRefGoogle Scholar
  38. 38.
    Park, A., Glass, J.: Unsupervised pattern discovery in speech. IEEE Trans. Audio Speech Lang. Process. 16(1), 186–197 (2008)CrossRefGoogle Scholar
  39. 39.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  40. 40.
    Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. CoRR abs/1605.05396 (2016). http://arxiv.org/abs/1605.05396
  41. 41.
    Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NIPS (2015)Google Scholar
  42. 42.
    Renshaw, D., Kamper, H., Jansen, A., Goldwater, S.: A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2015)Google Scholar
  43. 43.
    Roy, D.: Grounded spoken language acquisition: experiments in word learning. IEEE Trans. Multimed. 5(2), 197–209 (2003)MathSciNetCrossRefGoogle Scholar
  44. 44.
    Roy, D., Pentland, A.: Learning words from sights and sounds: a computational model. Cogn. Sci. 26, 113–146 (2002)CrossRefGoogle Scholar
  45. 45.
    Russell, B., Efros, A., Sivic, J., Freeman, W., Zisserman, A.: Using multiple segmentations to discover objects and their extent in image collections. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2006)Google Scholar
  46. 46.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
  47. 47.
    Spelke, E.S.: Principles of object perception. Cogn. Sci. 14(1), 29–56 (1990).  https://doi.org/10.1016/0364-0213(90)90025-R. http://www.sciencedirect.com/science/article/pii/036402139090025RCrossRefGoogle Scholar
  48. 48.
    Thiolliere, R., Dunbar, E., Synnaeve, G., Versteegh, M., Dupoux, E.: A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2015)Google Scholar
  49. 49.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  50. 50.
    de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.C.: GuessWhat?! Visual object discovery through multi-modal dialogue. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  51. 51.
    Weber, M., Welling, M., Perona, P.: Towards automatic discovery of object categories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)Google Scholar
  52. 52.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)Google Scholar
  53. 53.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD International Conference on Management of Data, pp. 103–114 (1996)Google Scholar
  54. 54.
    Zhang, Y., Salakhutdinov, R., Chang, H.A., Glass, J.: Resource configurable spoken query detection using deep Boltzmann machines. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012)Google Scholar
  55. 55.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Object detectors emerge in deep scene CNNs. arXiv preprint arXiv:1412.6856 (2014)
  56. 56.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Object detectors emerge in deep scene CNNs. In: Proceedings of the International Conference on Learning Representations (ICLR) (2015)Google Scholar
  57. 57.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  58. 58.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Proceedings of the Neural Information Processing Systems (NIPS) (2014)Google Scholar
  59. 59.
    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • David Harwath
    • 1
  • Adrià Recasens
    • 1
  • Dídac Surís
    • 1
  • Galen Chuang
    • 1
  • Antonio Torralba
    • 1
  • James Glass
    • 1
  1. 1.Massachusetts Institute of TechnologyCambridgeUSA

Personalised recommendations