Abstract
Thanks to the development of deep learning, voice-visual cross-modal retrieval has made remarkable progress in recent years. However, there still exist some bottlenecks: how to establish effective correlation between voices and images to improve the retrieval precision and how to reduce data storage and speed up retrieval in large-scale cross-modal data. In this paper, we propose a novel Voice-Visual Cross-Modal Hashing (V2CMH) method, which can generate hash codes with low storage memory and fast retrieval properties. Specially, the proposed V2CMH method can leverage deep feature similarity to establish the semantic relationship between voices and images. In addition, for hash codes learning, our method attempts to preserve the semantic similarity of binary codes and reduce the information loss of binary codes generation. Experiments illustrate that V2CMH algorithm can achieve better retrieval performance than other state-of-the-art cross-modal retrieval algorithms.
The first author is a student.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arandjelovi, R., Zisserman, A.: Look, listen and learn. In: Proceedings of ICCV, pp. 609–617 (2017)
Cao, G., Iosifidis, A., Chen, K., Gabbouj, M.: Generalized multi-view embedding for visual recognition and cross-modal retrieval. IEEE Trans. Cybern. 48(9), 2542–2555 (2018)
Gong, Y., Lazebnik, S.: Iterative quantization: a procrustean approach to learning binary codes. In: Proceedings of CVPR, pp. 817–824 (2011)
Gu, J., Cai, J., Joty, S., Niu, L., Wang, G.: Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 7181–7189 (2018)
Harwath, D.: Unsupervised learning of spoken language with visual context. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1858–1866 (2016)
Harwath, D., Glass, J.R.: Learning word-like units from joint audio-visual analysis. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 506–517 (2017)
Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 659–677. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_40
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
Huang, F., Zhang, X., Xu, J., Zhao, Z., Li, Z.: Multimodal learning of social image representation by exploiting social relations. IEEE Trans. Cybern. (2019)
Jiang, Q.Y., Li, W.J.: Deep cross-modal hashing. In: Proceedings of CVPR, pp. 3270–3278 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2015)
Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: Proceedings of the ACM International Conference on Multimedia, pp. 604–611 (2003)
Liang, Z., Ma, B., Li, G., Huang, Q., Qi, T.: Cross-modal retrieval using multi-ordered discriminative structured subspace learning. IEEE Trans. Multimed. 19(6), 1220–1233 (2017)
Liang, Z., Ma, B., Li, G., Huang, Q., Qi, T.: Generalized semi-supervised and structured subspace learning for cross-modal retrieval. IEEE Trans. Multimed. 20(1), 128–141 (2018)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, H., Wang, R., Shan, S., Chen, X.: Deep supervised hashing for fast image retrieval. In: Proceedings of CVPR, pp. 2064–2072 (2016)
Lu, X., Chen, Y., Li, X.: Hierarchical recurrent neural hashing for image retrieval with hierarchical convolutional features. IEEE Trans. Image Process. 27(1), 106–120 (2018)
Mandal, D., Chaudhury, K.N., Biswas, S.: Generalized semantic preserving hashing for cross-modal retrieval. IEEE Trans. Image Process. 28(1), 102–112 (2018)
Mao, G., Yuan, Y., Lu, X.: Deep cross-modal retrieval for remote sensing image and audio. In: Proceedings of IAPR Workshop on Pattern Recognition in Remote Sensing, pp. 1–7 (2018)
Mao, M., Lu, J., Zhang, G., Zhang, J.: Multirelational social recommendations via multigraph ranking. IEEE Trans. Cybern. 47(12), 4049–4061 (2017)
Mueller, M., Arzt, A., Balke, S., Dorfer, M., Widmer, G.: Cross-modal music retrieval and applications: an overview of key methodologies. IEEE Signal Process. Mag. 36(1), 52–62 (2019)
Nagrani, A., Albanie, S., Zisserman, A.: Seeing voices and hearing faces: cross-modal biometric matching. In: Proceedings of CVPR (2018)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of ICML, pp. 807–814 (2010)
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: Proceedings of CVPR, pp. 2405–2413 (2016)
Torfi, A., Iranmanesh, S.M., Nasrabadi, N., Dawson, J.: 3D convolutional neural networks for cross audio-visual matching recognition. IEEE Access 5, 22081–22091 (2017)
Wei, Y., et al.: Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans. Cybern. 47(2), 449–460 (2016)
Yang, Y., Nie, F., Xu, D., Luo, J., Zhuang, Y., Pan, Y.: A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 723–742 (2012)
Zhang, H., Zhuang, Y., Wu, F.: Cross-modal correlation learning for clustering on image-audio dataset. In: Proceedings of the ACM International Conference on Multimedia, pp. 273–276 (2007)
Acknowledgments
We thank all the reviewers and ACs. This work was supported in part by the National Key R&D Program of China under Grant 2017YFB0502900, in part by the National Natural Science Foundation of China under Grant 61702498, in part by the CAS “Light of West China” Program under Grant XAB2017B15. In addition, Y. Chen especially wishes to thank and bless B. Fei on August sixth in the lunar calendar.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, Y., Lu, X., Feng, Y. (2019). Deep Voice-Visual Cross-Modal Retrieval with Deep Feature Similarity Learning. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2019. Lecture Notes in Computer Science(), vol 11859. Springer, Cham. https://doi.org/10.1007/978-3-030-31726-3_39
Download citation
DOI: https://doi.org/10.1007/978-3-030-31726-3_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31725-6
Online ISBN: 978-3-030-31726-3
eBook Packages: Computer ScienceComputer Science (R0)