Skip to main content

CMFN: Cross-Modal Fusion Network forĀ Irregular Scene Text Recognition

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2023)

Abstract

Scene text recognition, as a cross-modal task involving vision and text, is an important research topic in computer vision. Most existing methods use language models to extract semantic information for optimizing visual recognition. However, the guidance of visual cues is ignored in the process of semantic mining, which limits the performance of the algorithm in recognizing irregular scene text. To tackle this issue, we propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition, which incorporates visual cues into the semantic mining process. Specifically, CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch. The position self-enhanced encoder provides character sequence position encoding for both the visual recognition branch and the iterative semantic recognition branch. The visual recognition branch carries out visual recognition based on the visual features extracted by CNN and the position encoding information provided by the position self-enhanced encoder. The iterative semantic recognition branch, which consists of a language recognition module and a cross-modal fusion gate, simulates the way that human recognizes scene text and integrates cross-modal visual cues for text recognition. The experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms, indicating its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Baek, J., et al.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: ICCV, pp. 4714ā€“4722 (2019)

    Google ScholarĀ 

  2. Bhunia, A.K., Sain, A., Kumar, A., Ghose, S., Nath Chowdhury, P., Song, Y.Z.: Joint visual semantic reasoning: multi-stage decoder for text recognition. In: ICCV, pp. 14920ā€“14929 (2021)

    Google ScholarĀ 

  3. Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. In: CVPR, pp. 7094ā€“7103 (2021)

    Google ScholarĀ 

  4. Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. In: ICML, pp. 1243ā€“1252. JMLR. org (2017)

    Google ScholarĀ 

  5. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315ā€“2324 (2016)

    Google ScholarĀ 

  6. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. CoRR abs/ arXiv: 1406.2227 (2014)

  7. Karatzas, D., et al.: Icdar 2015 competition on robust reading. In: ICDAR, pp. 1156ā€“1160 (2015)

    Google ScholarĀ 

  8. Karatzas, D., et al.: Icdar 2013 robust reading competition. In: ICDAR, pp. 1484ā€“1493 (2013)

    Google ScholarĀ 

  9. Lee, J., Park, S., Baek, J., Oh, S.J., Kim, S., Lee, H.: On recognizing texts of arbitrary shapes with 2d self-attention. In: CVPRW, pp. 2326ā€“2335 (2020)

    Google ScholarĀ 

  10. Liao, M., Lyu, P., He, M., Yao, C., Wu, W., Bai, X.: Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Trans. Pattern Anal. Mach. Intell. 43(2), 532ā€“548 (2021)

    ArticleĀ  Google ScholarĀ 

  11. Luo, C., Jin, L., Sun, Z.: Moran: a multi-object rectified attention network for scene text recognition. Pattern Recogn. 90, 109ā€“118 (2019)

    ArticleĀ  Google ScholarĀ 

  12. Mishra, A., Karteek, A., Jawahar, C.V.: Scene text recognition using higher order language priors. In: BMVC, pp. 1ā€“11 (2012)

    Google ScholarĀ 

  13. Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV, pp. 569ā€“576 (2013)

    Google ScholarĀ 

  14. Qiao, Z., et al.: Pimnet: a parallel, iterative and mimicking network for scene text recognition. In: ACM MM, pp. 2046ā€“2055 (2021)

    Google ScholarĀ 

  15. Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027ā€“8048 (2014)

    ArticleĀ  Google ScholarĀ 

  16. Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: An attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2035ā€“2048 (2019)

    ArticleĀ  Google ScholarĀ 

  17. Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 6000ā€“6010 (2017)

    Google ScholarĀ 

  18. Wan, Z., He, M., Chen, H., Bai, X., Yao, C.: Textscanner: reading characters in order for robust scene text recognition. In: AAAI, pp. 12120ā€“12127 (2020)

    Google ScholarĀ 

  19. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457ā€“1464 (2011)

    Google ScholarĀ 

  20. Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S., Zhang, Y.: From two to one: a new scene text recognizer with visual language modeling network. In: ICCV, pp. 14194ā€“14203 (2021)

    Google ScholarĀ 

  21. Yu, D., et al.: Towards accurate scene text recognition with semantic reasoning networks. In: CVPR, pp. 12110ā€“12119 (2020)

    Google ScholarĀ 

  22. Yue, X., Kuang, Z., Lin, C., Sun, H., Zhang, W.: RobustScanner: dynamically enhancing positional clues for robust text recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 135ā€“151. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_9

    ChapterĀ  Google ScholarĀ 

  23. He, Y., et al.: Visual semantics allow for textual reasoning better in scene text recognition. In: AAAI, pp. 888ā€“896 (2022)

    Google ScholarĀ 

  24. Zhang, X., Zhu, B., Yao, X., Sun, Q., Li, R., Yu, B.: Context-based contrastive learning for scene text recognition. In: AAAI, pp. 3353ā€“3361 (2022)

    Google ScholarĀ 

  25. Zhong, D., et al.: Sgbanet: semantic gan and balanced attention network for arbitrarily oriented scene text recognition. In: ECCV, pp. 464ā€“480 (2022). https://doi.org/10.1007/978-3-031-19815-1_27

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruyi Ji .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zheng, J., Ji, R., Zhang, L., Wu, Y., Zhao, C. (2024). CMFN: Cross-Modal Fusion Network forĀ Irregular Scene Text Recognition. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Lecture Notes in Computer Science, vol 14452. Springer, Singapore. https://doi.org/10.1007/978-981-99-8076-5_31

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8076-5_31

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8075-8

  • Online ISBN: 978-981-99-8076-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics