Skip to main content
Log in

Double-scale similarity with rich features for cross-modal retrieval

  • Regular Article
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

This paper proposes a method named Double-scale Similarity with Rich Features for Cross-modal Retrieval (DSRF) to handle the retrieval task between images and texts. The difficulties of cross-modal retrieval are manifested in how to establish a good similarity metric and obtain rich accurate semantics features. Most existing approaches map different modalities data into a common space by category labels and pair relations, which is insufficient to model the complex semantic relationships of multimodal data. A new similarity measurement method (Double-scale similarity) is proposed, in which the similarity of multimodal data does not depend on category labels only, but also on the objects involved. The retrieval result in the same category without identical objects will be punished appropriately, while the distance between the correct result and query is further close. Moreover, a semantics features extraction framework is designed to provide rich semantics features for the similarity metric. Multiple attention maps are created to focus on local features from different perspectives and obtain numerous semantics features. Distinguish from other works that accumulate multiple semantic representations for averaging, we use LSTM only with forgetting gate to eliminate the redundancy of repetitive information. Specifically, the forgetting factor is generated for each semantics features, and a larger forgetting factor coefficient removes the useless semantics information. We evaluate DSRF on two public benchmark, DSRF achieves competitive performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned, pp. 5797–5808. Association for Computational Linguistics, Stroudsburg (2019)

    Google Scholar 

  2. Staudemeyer, R. C. & Morris, E. R.: Understanding lstm: A tutorial into long short-term memory recurrent neural networks. arXiv preprint arXiv:1909.09586 (2019)

  3. Wu, Y., Wang, S., Song, G., Huang, Q.: Learning fragment self-attention embeddings for image-text matching, pp. 2088–2096. ACM, New York (2019)

    Google Scholar 

  4. Li, W., et al.: Cross-modal retrieval with dual multi-angle self-attention. J. Assoc. Inf. Sci. Technol. 72(1), 46–65 (2021)

    Article  Google Scholar 

  5. Marin, J., et al.: Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE Trans. Pattern Anal. Mach. Intell. 43, 187–203 (2021)

    Article  Google Scholar 

  6. Carvalho, M., et al.: Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings, pp. 35–44. ACM, London (2018)

    Google Scholar 

  7. Thompson, B.: Canonical correlation analysis. American psychological association 6(2), (2000)

  8. Melzer, T., Reiter, M., Bischof, H.: Kernel canonical correlation analysis. John Wiley, New Jersey (2001)

    MATH  Google Scholar 

  9. Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: International conference on machine learning, pp. 1247–1255 (2013)

  10. Wang, Z., et al.: CAMP: Cross-modal adaptive message passing for text-image retrieval. In: IEEE, pp. 5763–5772 (2019)

  11. Park, G., Han, C., Yoon, W., Kim, D.: MHSAN: Multi-head self-attention network for visual semantic embedding. In: IEEE, pp. 1507–1515 (2020)

  12. Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: Computer Vision Foundation/IEEE, pp. 10394–10403 (2019)

  13. Khojasteh, H. A., Ansari, E., Razzaghi, P., Karimi, A.: Deep multimodal image-text embeddings for automatic cross-media retrieval. arXiv preprint arXiv:2002.10016 (2020)

  14. Fu, X., Zhao, Y., Wei, Y., Zhao, Y., Wei, S.: Rich features embedding for cross-modal retrieval: A simple baseline. IEEE Trans. Multimed. 22, 2354–2365 (2020)

    Article  Google Scholar 

  15. Huang, P., Chang, X., Hauptmann, A.G.: Multi-head attention with diversity for learning grounded multilingual multimodal representations, pp. 1461–1467. Association for Computational Linguistics, Stroudsburg (2019)

    Google Scholar 

  16. Kan, M., Shan, S., Zhang, H., Lao, S., Chen, X.: Multi-view discriminant analysis. IEEE Trans. Pattern Anal. Mach. Intell. 38, 188–194 (2016)

    Article  Google Scholar 

  17. Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching, pp. 707–723. Springer, Berlin (2018)

    Google Scholar 

  18. Li, Z., Ling, F., Zhang, C., Ma, H.: Combining global and local similarity for cross-media retrieval. IEEE Access 8, 21847–21856 (2020)

    Article  Google Scholar 

  19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, pp. 770–778. IEEE Computer Society, Washington (2016)

    Google Scholar 

  20. Deng, J., et al.: Imagenet:A large-scale hierarchical image database, pp. 248–255. IEEE Computer Society, Washington (2009)

    Google Scholar 

  21. Pennington, J., Socher, R., Manning, C. D.: Glove: Global vectors for word representation. In: ACL, pp. 1532–1543 (2014)

  22. Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: ACL, pp. 1724–1734 (2014)

  23. Vaswani, A., et al.: Attention is all you need. Advances in neural information processing systems 30 (2017)

  24. Van Der Westhuizen, J., Lasenby, J.: The unreasonable effectiveness of the forget gate. arXiv preprint arXiv:1804.04849 (2018)

  25. Huang, Y., Wu, Q., Song, C. , Wang, L.: Learning semantic concepts and order for image and sentence matching. In: Computer Vision Foundation/IEEE Computer Society, pp. 6163–6171 (2018)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dexin Zhao.

Ethics declarations

Conflict of interest

The authors certify that there is no conflict of interest with any individual/organization for the present work.

Additional information

Communicated by B.-K. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, K., Wang, H. & Zhao, D. Double-scale similarity with rich features for cross-modal retrieval. Multimedia Systems 28, 1767–1777 (2022). https://doi.org/10.1007/s00530-022-00933-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-022-00933-7

Keywords

Navigation