Skip to main content

Cross-Media and Multilingual Image Understanding Method Based on Attention Mechanism

  • Conference paper
  • First Online:
Advances in Intelligent Automation and Soft Computing (IASC 2021)

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 80))

Included in the following conference series:

  • 2148 Accesses

Abstract

Different types of media data are represented by the underlying features of different dimensions and different attributes, resulting in heterogeneity and incomparability; on the other hand, in order to use the underlying feature information extracted from cross-media, additional information must be obtained from different styles. There is a compromise between the difference and the semantic ambiguity obtained only from the bottom layer. These problems lead to traditional feature learning methods not suitable for cross-media analysis. This paper proposes a neural network structure that combines feature extraction and context semantics and introduces a new attention mechanism. Tests on experimental data show that the network structure proposed in this paper can obtain a better BLEU score than traditional methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 279.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)

    Article  Google Scholar 

  2. May, P., Ehrlich, H.-C., Steinke, T.: ZIB structure prediction pipeline: composing a complex biological workflow through web services. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 1148–1158. Springer, Heidelberg (2006). https://doi.org/10.1007/11823285_121

    Chapter  Google Scholar 

  3. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

  4. Czajkowski, K., Fitzgerald, S., Foster, I., Kesselman, C.: Grid information services for distributed resource sharing. In: 10th IEEE International Symposium on High Performance Distributed Computing, pp. 181–184. IEEE Press, New York (2001)

    Google Scholar 

  5. Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The Physiology of the Grid: an Open Grid Services Architecture for Distributed Systems Integration. Technical report, Global Grid Forum (2002)

    Google Scholar 

  6. National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov

  7. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int. J. Comput. Vision 116(1), 1–20 (2015). https://doi.org/10.1007/s11263-015-0823-z

    Article  MathSciNet  Google Scholar 

  8. Dollar, P., et al.: Fast feature pyramids for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 36(8), 1532–1545 (2014)

    Article  Google Scholar 

  9. Gupta, A., et al.: Synthetic data for text localisation in natural images. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2315–2324 (2016)

    Google Scholar 

  10. Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4

    Chapter  Google Scholar 

  11. Transactions on Multimedia, vol. 20, no. 11, pp. 3111–3122 (2018)

    Google Scholar 

  12. Shi, B., et al.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)

    Article  Google Scholar 

  13. Li, H., et al.: Towards end-to-end text spotting with convolutional recurrent neural networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5248–5256 (2017)

    Google Scholar 

  14. Busta, M., et al.: Deep TextSpotter: an end-to-end trainable scene text localization and recognition framework. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2223–2231 (2017)

    Google Scholar 

  15. Patel, Y., et al.: E2E-MLT an Unconstrained End-to-End Method for Multi-Language Scene Text. arXiv Preprint ArXiv: 1801.09919 (2018)

  16. Kojima, A., et al.: Generating natural language description of human behavior from video images. In: Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, vol. 4, pp. 728–731 (2000)

    Google Scholar 

  17. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 2048–2057 (2015)

    Google Scholar 

  18. Zhao, B., et al.: CAM-RNN: co-attention model based RNN for video captioning. IEEE Trans. Image Process. 28(11), 5552–5565 (2019)

    Article  MathSciNet  Google Scholar 

  19. Park, J., et al.: A study of evaluation metrics and datasets for video captioning. In: 2017 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS) (2017)

    Google Scholar 

  20. Bin, Y., et al.: Describing video with attention-based bidirectional LSTM. IEEE Trans. Syst. Man Cybern. 49(7), 2631–2641 (2019)

    Google Scholar 

  21. Krishna, R., et al.: Dense-captioning events in videos. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 706–715 (2017)

    Google Scholar 

  22. Shen, Z., et al.: Weakly supervised dense video captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5159–5167 (2017)

    Google Scholar 

  23. Guadarrama, S., et al.: YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: 2013 IEEE International Conference on Computer Vision, pp. 2712–2719 (2013)

    Google Scholar 

  24. Gao, X., Hoi, S.C., Zhang, Y., et al.: SOML: sparse online metric learning with application to image retrieval. In: AAAI, pp. 1206–1212 (2014)

    Google Scholar 

  25. Zhang, Y., Gao, X., Chen, Z., et al.: Learning passive-aggressive correlation filter for long-term and short-term visual tracking. J. Electr. Imaging 28(06), 063017 (2019)

    Article  Google Scholar 

  26. Xia, Z., Hong, X., Gao, X., et al.: Spatiotemporal recurrent convolutional networks for recognizing spontaneous micro-expressions. IEEE Trans. Multimed. 22(3), 626–640 (2020)

    Article  Google Scholar 

  27. Zhang, Y., Gao, X., Chen, Z., et al.: Mining spatial-temporal similarity for visual tracking. IEEE Trans. Image Process. 29, 8107–8119 (2020)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gao, L., Chen, Z., Wang, L., Li, B., Nie, L., Zheng, F. (2022). Cross-Media and Multilingual Image Understanding Method Based on Attention Mechanism. In: Li, X. (eds) Advances in Intelligent Automation and Soft Computing. IASC 2021. Lecture Notes on Data Engineering and Communications Technologies, vol 80. Springer, Cham. https://doi.org/10.1007/978-3-030-81007-8_90

Download citation

Publish with us

Policies and ethics