Cross-modal Embeddings for Video and Audio Retrieval

  • Didac Surís
  • Amanda DuarteEmail author
  • Amaia Salvador
  • Jordi Torres
  • Xavier Giró-i-Nieto
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)


In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projecting the audio and visual features into a common feature space, to obtain joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.


Cross-modal Retrieval YouTube-8M 



This work was partially supported by the Spanish Ministry of Economy and Competitivity and the European Regional Development Fund (ERDF) under contract TEC2016-75976-R. Amanda Duarte was funded by the mobility grant of the Severo Ochoa Program at Barcelona Supercomputing Center (BSC-CNS).


  1. 1.
    Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification Benchmark. CoRR abs/1609.08675 (2016).
  2. 2.
    Acar, E., Hopfgartner, F., Albayrak, S.: Understanding affective content of music videos through learned representations. In: Gurrin, C., Hopfgartner, F., Hurst, W., Johansen, H., Lee, H., O’Connor, N. (eds.) MMM 2014. LNCS, vol. 8325, pp. 303–314. Springer, Cham (2014). Scholar
  3. 3.
    Aytar, Y., Vondrick, C., Torralba, A.: See, hear, and read: deep aligned representations. arXiv preprint arXiv:1706.00932 (2017)
  4. 4.
    Brochu, E., De Freitas, N., Bao, K.: The sound of an album cover: probabilistic multimedia and information retrieval. In: Artificial Intelligence and Statistics (AISTATS) (2003)Google Scholar
  5. 5.
    Chao, J., Wang, H., Zhou, W., Zhang, W., Yu, Y.: TuneSensor: a semantic-driven music recommendation service for digital photo albums. In: 10th International Semantic Web Conference (2011)Google Scholar
  6. 6.
    Frome, A., et al.: DeViSE: a deep visual-semantic embedding model. In: Neural Information Processing Systems (2013)Google Scholar
  7. 7.
    Gillet, O., Essid, S., Richard, G.: On the correlation of automatic audio and visual segmentations of music videos. IEEE Trans. Circuits Syst. Video Technol. 17(3), 347–355 (2007)CrossRefGoogle Scholar
  8. 8.
    Hong, S., Im, W., Yang, H.S.: Deep learning for content-based, cross-modal retrieval of videos and music. CoRR abs/1704.06761 (2017)Google Scholar
  9. 9.
    Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs/1411.2539 (2014)Google Scholar
  10. 10.
    Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: Proceedings of the Eleventh ACM International Conference on Multimedia, pp. 604–611. ACM (2003)Google Scholar
  11. 11.
    Libeks, J., Turnbull, D.: You can judge an artist by an album cover: using images for music annotation. IEEE MultiMedia 18(4), 30–37 (2011)CrossRefGoogle Scholar
  12. 12.
    Mayer, R.: Analysing the similarity of album art with self-organising maps. In: Laaksonen, J., Honkela, T. (eds.) WSOM 2011. LNCS, vol. 6731, pp. 357–366. Springer, Heidelberg (2011). Scholar
  13. 13.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning, pp. 689–696 (2011)Google Scholar
  14. 14.
    Salvador, A., et al.: Learning cross-modal embeddings for cooking recipes and food images. In: CVPR (2017)Google Scholar
  15. 15.
    Schindler, A., Rauber, A.: An audio-visual approach to music genre classification through affective color features. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 61–67. Springer, Cham (2015). Scholar
  16. 16.
    Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. CoRR abs/1511.06078 (2015).
  17. 17.
    Wu, X., Qiao, Y., Wang, X., Tang, X.: Bridging music and image via cross-modal ranking analysis. IEEE Trans. Multimedia 18(7), 1305–1318 (2016)CrossRefGoogle Scholar
  18. 18.
    Zhang, H., Zhuang, Y., Wu, F.: Cross-modal correlation learning for clustering on image-audio dataset. In: 15th ACM International Conference on Multimedia, pp. 273–276. ACM (2007)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Didac Surís
    • 1
  • Amanda Duarte
    • 1
    • 2
    Email author
  • Amaia Salvador
    • 1
  • Jordi Torres
    • 1
    • 2
  • Xavier Giró-i-Nieto
    • 1
    • 2
  1. 1.Universitat Politécnica de Catalunya - UPCBarcelonaSpain
  2. 2.Barcelona Supercomputing Center - BSCBarcelonaSpain

Personalised recommendations