Skip to main content

Deep Voice-Visual Cross-Modal Retrieval with Deep Feature Similarity Learning

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2019)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11859))

Included in the following conference series:

Abstract

Thanks to the development of deep learning, voice-visual cross-modal retrieval has made remarkable progress in recent years. However, there still exist some bottlenecks: how to establish effective correlation between voices and images to improve the retrieval precision and how to reduce data storage and speed up retrieval in large-scale cross-modal data. In this paper, we propose a novel Voice-Visual Cross-Modal Hashing (V2CMH) method, which can generate hash codes with low storage memory and fast retrieval properties. Specially, the proposed V2CMH method can leverage deep feature similarity to establish the semantic relationship between voices and images. In addition, for hash codes learning, our method attempts to preserve the semantic similarity of binary codes and reduce the information loss of binary codes generation. Experiments illustrate that V2CMH algorithm can achieve better retrieval performance than other state-of-the-art cross-modal retrieval algorithms.

The first author is a student.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/fchollet/keras.

References

  1. Arandjelovi, R., Zisserman, A.: Look, listen and learn. In: Proceedings of ICCV, pp. 609–617 (2017)

    Google Scholar 

  2. Cao, G., Iosifidis, A., Chen, K., Gabbouj, M.: Generalized multi-view embedding for visual recognition and cross-modal retrieval. IEEE Trans. Cybern. 48(9), 2542–2555 (2018)

    Article  Google Scholar 

  3. Gong, Y., Lazebnik, S.: Iterative quantization: a procrustean approach to learning binary codes. In: Proceedings of CVPR, pp. 817–824 (2011)

    Google Scholar 

  4. Gu, J., Cai, J., Joty, S., Niu, L., Wang, G.: Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 7181–7189 (2018)

    Google Scholar 

  5. Harwath, D.: Unsupervised learning of spoken language with visual context. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1858–1866 (2016)

    Google Scholar 

  6. Harwath, D., Glass, J.R.: Learning word-like units from joint audio-visual analysis. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 506–517 (2017)

    Google Scholar 

  7. Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 659–677. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_40

    Chapter  Google Scholar 

  8. Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)

    Article  MathSciNet  Google Scholar 

  9. Huang, F., Zhang, X., Xu, J., Zhao, Z., Li, Z.: Multimodal learning of social image representation by exploiting social relations. IEEE Trans. Cybern. (2019)

    Google Scholar 

  10. Jiang, Q.Y., Li, W.J.: Deep cross-modal hashing. In: Proceedings of CVPR, pp. 3270–3278 (2017)

    Google Scholar 

  11. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2015)

    Google Scholar 

  12. Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: Proceedings of the ACM International Conference on Multimedia, pp. 604–611 (2003)

    Google Scholar 

  13. Liang, Z., Ma, B., Li, G., Huang, Q., Qi, T.: Cross-modal retrieval using multi-ordered discriminative structured subspace learning. IEEE Trans. Multimed. 19(6), 1220–1233 (2017)

    Article  Google Scholar 

  14. Liang, Z., Ma, B., Li, G., Huang, Q., Qi, T.: Generalized semi-supervised and structured subspace learning for cross-modal retrieval. IEEE Trans. Multimed. 20(1), 128–141 (2018)

    Article  Google Scholar 

  15. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  16. Liu, H., Wang, R., Shan, S., Chen, X.: Deep supervised hashing for fast image retrieval. In: Proceedings of CVPR, pp. 2064–2072 (2016)

    Google Scholar 

  17. Lu, X., Chen, Y., Li, X.: Hierarchical recurrent neural hashing for image retrieval with hierarchical convolutional features. IEEE Trans. Image Process. 27(1), 106–120 (2018)

    Article  MathSciNet  Google Scholar 

  18. Mandal, D., Chaudhury, K.N., Biswas, S.: Generalized semantic preserving hashing for cross-modal retrieval. IEEE Trans. Image Process. 28(1), 102–112 (2018)

    Article  MathSciNet  Google Scholar 

  19. Mao, G., Yuan, Y., Lu, X.: Deep cross-modal retrieval for remote sensing image and audio. In: Proceedings of IAPR Workshop on Pattern Recognition in Remote Sensing, pp. 1–7 (2018)

    Google Scholar 

  20. Mao, M., Lu, J., Zhang, G., Zhang, J.: Multirelational social recommendations via multigraph ranking. IEEE Trans. Cybern. 47(12), 4049–4061 (2017)

    Article  Google Scholar 

  21. Mueller, M., Arzt, A., Balke, S., Dorfer, M., Widmer, G.: Cross-modal music retrieval and applications: an overview of key methodologies. IEEE Signal Process. Mag. 36(1), 52–62 (2019)

    Article  Google Scholar 

  22. Nagrani, A., Albanie, S., Zisserman, A.: Seeing voices and hearing faces: cross-modal biometric matching. In: Proceedings of CVPR (2018)

    Google Scholar 

  23. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of ICML, pp. 807–814 (2010)

    Google Scholar 

  24. Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: Proceedings of CVPR, pp. 2405–2413 (2016)

    Google Scholar 

  25. Torfi, A., Iranmanesh, S.M., Nasrabadi, N., Dawson, J.: 3D convolutional neural networks for cross audio-visual matching recognition. IEEE Access 5, 22081–22091 (2017)

    Article  Google Scholar 

  26. Wei, Y., et al.: Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans. Cybern. 47(2), 449–460 (2016)

    Google Scholar 

  27. Yang, Y., Nie, F., Xu, D., Luo, J., Zhuang, Y., Pan, Y.: A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 723–742 (2012)

    Article  Google Scholar 

  28. Zhang, H., Zhuang, Y., Wu, F.: Cross-modal correlation learning for clustering on image-audio dataset. In: Proceedings of the ACM International Conference on Multimedia, pp. 273–276 (2007)

    Google Scholar 

Download references

Acknowledgments

We thank all the reviewers and ACs. This work was supported in part by the National Key R&D Program of China under Grant 2017YFB0502900, in part by the National Natural Science Foundation of China under Grant 61702498, in part by the CAS “Light of West China” Program under Grant XAB2017B15. In addition, Y. Chen especially wishes to thank and bless B. Fei on August sixth in the lunar calendar.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yachuang Feng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, Y., Lu, X., Feng, Y. (2019). Deep Voice-Visual Cross-Modal Retrieval with Deep Feature Similarity Learning. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2019. Lecture Notes in Computer Science(), vol 11859. Springer, Cham. https://doi.org/10.1007/978-3-030-31726-3_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-31726-3_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-31725-6

  • Online ISBN: 978-3-030-31726-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics