Deep Voice-Visual Cross-Modal Retrieval with Deep Feature Similarity Learning

Chen, Yaxiong; Lu, Xiaoqiang; Feng, Yachuang

doi:10.1007/978-3-030-31726-3_39

Yaxiong Chen^16,17,
Xiaoqiang Lu¹⁶ &
Yachuang Feng¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11859))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

1961 Accesses
5 Citations

Abstract

Thanks to the development of deep learning, voice-visual cross-modal retrieval has made remarkable progress in recent years. However, there still exist some bottlenecks: how to establish effective correlation between voices and images to improve the retrieval precision and how to reduce data storage and speed up retrieval in large-scale cross-modal data. In this paper, we propose a novel Voice-Visual Cross-Modal Hashing (V2CMH) method, which can generate hash codes with low storage memory and fast retrieval properties. Specially, the proposed V2CMH method can leverage deep feature similarity to establish the semantic relationship between voices and images. In addition, for hash codes learning, our method attempts to preserve the semantic similarity of binary codes and reduce the information loss of binary codes generation. Experiments illustrate that V2CMH algorithm can achieve better retrieval performance than other state-of-the-art cross-modal retrieval algorithms.

The first author is a student.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/fchollet/keras.

References

Arandjelovi, R., Zisserman, A.: Look, listen and learn. In: Proceedings of ICCV, pp. 609–617 (2017)
Google Scholar
Cao, G., Iosifidis, A., Chen, K., Gabbouj, M.: Generalized multi-view embedding for visual recognition and cross-modal retrieval. IEEE Trans. Cybern. 48(9), 2542–2555 (2018)
Article Google Scholar
Gong, Y., Lazebnik, S.: Iterative quantization: a procrustean approach to learning binary codes. In: Proceedings of CVPR, pp. 817–824 (2011)
Google Scholar
Gu, J., Cai, J., Joty, S., Niu, L., Wang, G.: Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 7181–7189 (2018)
Google Scholar
Harwath, D.: Unsupervised learning of spoken language with visual context. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1858–1866 (2016)
Google Scholar
Harwath, D., Glass, J.R.: Learning word-like units from joint audio-visual analysis. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 506–517 (2017)
Google Scholar
Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 659–677. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_40
Chapter Google Scholar
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
Article MathSciNet Google Scholar
Huang, F., Zhang, X., Xu, J., Zhao, Z., Li, Z.: Multimodal learning of social image representation by exploiting social relations. IEEE Trans. Cybern. (2019)
Google Scholar
Jiang, Q.Y., Li, W.J.: Deep cross-modal hashing. In: Proceedings of CVPR, pp. 3270–3278 (2017)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2015)
Google Scholar
Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: Proceedings of the ACM International Conference on Multimedia, pp. 604–611 (2003)
Google Scholar
Liang, Z., Ma, B., Li, G., Huang, Q., Qi, T.: Cross-modal retrieval using multi-ordered discriminative structured subspace learning. IEEE Trans. Multimed. 19(6), 1220–1233 (2017)
Article Google Scholar
Liang, Z., Ma, B., Li, G., Huang, Q., Qi, T.: Generalized semi-supervised and structured subspace learning for cross-modal retrieval. IEEE Trans. Multimed. 20(1), 128–141 (2018)
Article Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, H., Wang, R., Shan, S., Chen, X.: Deep supervised hashing for fast image retrieval. In: Proceedings of CVPR, pp. 2064–2072 (2016)
Google Scholar
Lu, X., Chen, Y., Li, X.: Hierarchical recurrent neural hashing for image retrieval with hierarchical convolutional features. IEEE Trans. Image Process. 27(1), 106–120 (2018)
Article MathSciNet Google Scholar
Mandal, D., Chaudhury, K.N., Biswas, S.: Generalized semantic preserving hashing for cross-modal retrieval. IEEE Trans. Image Process. 28(1), 102–112 (2018)
Article MathSciNet Google Scholar
Mao, G., Yuan, Y., Lu, X.: Deep cross-modal retrieval for remote sensing image and audio. In: Proceedings of IAPR Workshop on Pattern Recognition in Remote Sensing, pp. 1–7 (2018)
Google Scholar
Mao, M., Lu, J., Zhang, G., Zhang, J.: Multirelational social recommendations via multigraph ranking. IEEE Trans. Cybern. 47(12), 4049–4061 (2017)
Article Google Scholar
Mueller, M., Arzt, A., Balke, S., Dorfer, M., Widmer, G.: Cross-modal music retrieval and applications: an overview of key methodologies. IEEE Signal Process. Mag. 36(1), 52–62 (2019)
Article Google Scholar
Nagrani, A., Albanie, S., Zisserman, A.: Seeing voices and hearing faces: cross-modal biometric matching. In: Proceedings of CVPR (2018)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of ICML, pp. 807–814 (2010)
Google Scholar
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: Proceedings of CVPR, pp. 2405–2413 (2016)
Google Scholar
Torfi, A., Iranmanesh, S.M., Nasrabadi, N., Dawson, J.: 3D convolutional neural networks for cross audio-visual matching recognition. IEEE Access 5, 22081–22091 (2017)
Article Google Scholar
Wei, Y., et al.: Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans. Cybern. 47(2), 449–460 (2016)
Google Scholar
Yang, Y., Nie, F., Xu, D., Luo, J., Zhuang, Y., Pan, Y.: A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 723–742 (2012)
Article Google Scholar
Zhang, H., Zhuang, Y., Wu, F.: Cross-modal correlation learning for clustering on image-audio dataset. In: Proceedings of the ACM International Conference on Multimedia, pp. 273–276 (2007)
Google Scholar

Download references

Acknowledgments

We thank all the reviewers and ACs. This work was supported in part by the National Key R&D Program of China under Grant 2017YFB0502900, in part by the National Natural Science Foundation of China under Grant 61702498, in part by the CAS “Light of West China” Program under Grant XAB2017B15. In addition, Y. Chen especially wishes to thank and bless B. Fei on August sixth in the lunar calendar.

Author information

Authors and Affiliations

The Key Laboratory of Spectral Imaging Technology CAS, Xian Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xian, 710119, China
Yaxiong Chen, Xiaoqiang Lu & Yachuang Feng
University of Chinese Academy of Sciences, Beijing, 100049, People’s Republic of China
Yaxiong Chen

Authors

Yaxiong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoqiang Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yachuang Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yachuang Feng .

Editor information

Editors and Affiliations

School of EECS, Peking University, Beijing, China
Zhouchen Lin
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Liang Wang
Nanjing University of Science and Technology, Nanjing, China
Jian Yang
Xidian University, Xi’an, China
Guangming Shi
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Institute of Artificial Intelligence, Xi’an Jiaotong University, Xi’an, China
Nanning Zheng
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Northwestern Polytechnical University, Xi’an, China
Yanning Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Y., Lu, X., Feng, Y. (2019). Deep Voice-Visual Cross-Modal Retrieval with Deep Feature Similarity Learning. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2019. Lecture Notes in Computer Science(), vol 11859. Springer, Cham. https://doi.org/10.1007/978-3-030-31726-3_39

Download citation

DOI: https://doi.org/10.1007/978-3-030-31726-3_39
Published: 31 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31725-6
Online ISBN: 978-3-030-31726-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics