CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition

Zheng, Jinzhi; Ji, Ruyi; Zhang, Libo; Wu, Yanjun; Zhao, Chen

doi:10.1007/978-981-99-8076-5_31

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14452))

Included in the following conference series:

International Conference on Neural Information Processing

518 Accesses

Abstract

Scene text recognition, as a cross-modal task involving vision and text, is an important research topic in computer vision. Most existing methods use language models to extract semantic information for optimizing visual recognition. However, the guidance of visual cues is ignored in the process of semantic mining, which limits the performance of the algorithm in recognizing irregular scene text. To tackle this issue, we propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition, which incorporates visual cues into the semantic mining process. Specifically, CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch. The position self-enhanced encoder provides character sequence position encoding for both the visual recognition branch and the iterative semantic recognition branch. The visual recognition branch carries out visual recognition based on the visual features extracted by CNN and the position encoding information provided by the position self-enhanced encoder. The iterative semantic recognition branch, which consists of a language recognition module and a cross-modal fusion gate, simulates the way that human recognizes scene text and integrates cross-modal visual cues for text recognition. The experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms, indicating its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-modal Text Recognition Networks: Interactive Enhancements Between Visual and Semantic Features

Visual and semantic ensemble for scene text recognition with gated dual mutual attention

Article 06 October 2022

An end-to-end text spotter with text relation networks

Article Open access 01 April 2021

References

Baek, J., et al.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: ICCV, pp. 4714–4722 (2019)
Google Scholar
Bhunia, A.K., Sain, A., Kumar, A., Ghose, S., Nath Chowdhury, P., Song, Y.Z.: Joint visual semantic reasoning: multi-stage decoder for text recognition. In: ICCV, pp. 14920–14929 (2021)
Google Scholar
Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. In: CVPR, pp. 7094–7103 (2021)
Google Scholar
Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. In: ICML, pp. 1243–1252. JMLR. org (2017)
Google Scholar
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016)
Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. CoRR abs/ arXiv: 1406.2227 (2014)
Karatzas, D., et al.: Icdar 2015 competition on robust reading. In: ICDAR, pp. 1156–1160 (2015)
Google Scholar
Karatzas, D., et al.: Icdar 2013 robust reading competition. In: ICDAR, pp. 1484–1493 (2013)
Google Scholar
Lee, J., Park, S., Baek, J., Oh, S.J., Kim, S., Lee, H.: On recognizing texts of arbitrary shapes with 2d self-attention. In: CVPRW, pp. 2326–2335 (2020)
Google Scholar
Liao, M., Lyu, P., He, M., Yao, C., Wu, W., Bai, X.: Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Trans. Pattern Anal. Mach. Intell. 43(2), 532–548 (2021)
Article Google Scholar
Luo, C., Jin, L., Sun, Z.: Moran: a multi-object rectified attention network for scene text recognition. Pattern Recogn. 90, 109–118 (2019)
Article Google Scholar
Mishra, A., Karteek, A., Jawahar, C.V.: Scene text recognition using higher order language priors. In: BMVC, pp. 1–11 (2012)
Google Scholar
Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV, pp. 569–576 (2013)
Google Scholar
Qiao, Z., et al.: Pimnet: a parallel, iterative and mimicking network for scene text recognition. In: ACM MM, pp. 2046–2055 (2021)
Google Scholar
Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014)
Article Google Scholar
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: An attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2035–2048 (2019)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 6000–6010 (2017)
Google Scholar
Wan, Z., He, M., Chen, H., Bai, X., Yao, C.: Textscanner: reading characters in order for robust scene text recognition. In: AAAI, pp. 12120–12127 (2020)
Google Scholar
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464 (2011)
Google Scholar
Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S., Zhang, Y.: From two to one: a new scene text recognizer with visual language modeling network. In: ICCV, pp. 14194–14203 (2021)
Google Scholar
Yu, D., et al.: Towards accurate scene text recognition with semantic reasoning networks. In: CVPR, pp. 12110–12119 (2020)
Google Scholar
Yue, X., Kuang, Z., Lin, C., Sun, H., Zhang, W.: RobustScanner: dynamically enhancing positional clues for robust text recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 135–151. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_9
Chapter Google Scholar
He, Y., et al.: Visual semantics allow for textual reasoning better in scene text recognition. In: AAAI, pp. 888–896 (2022)
Google Scholar
Zhang, X., Zhu, B., Yao, X., Sun, Q., Li, R., Yu, B.: Context-based contrastive learning for scene text recognition. In: AAAI, pp. 3353–3361 (2022)
Google Scholar
Zhong, D., et al.: Sgbanet: semantic gan and balanced attention network for arbitrarily oriented scene text recognition. In: ECCV, pp. 464–480 (2022). https://doi.org/10.1007/978-3-031-19815-1_27

Download references

Author information

Authors and Affiliations

Intelligent Software Research Center, Institute of Software Chinese Academy of Sciences, Beijing, 100190, China
Jinzhi Zheng, Libo Zhang, Yanjun Wu & Chen Zhao
The Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
Ruyi Ji
University of Chinese Academy of Sciences, Beijing, China
Jinzhi Zheng
State Key Laboratory of Computer Science, Institute of Software Chinese Academy of Sciences, Beijing, 100190, China
Libo Zhang, Yanjun Wu & Chen Zhao

Authors

Jinzhi Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Ruyi Ji
View author publications
You can also search for this author in PubMed Google Scholar
Libo Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yanjun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Chen Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruyi Ji .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Biao Luo
Chinese Academy of Sciences, Beijing, China
Long Cheng
Zhejiang University, Hangzhou, China
Zheng-Guang Wu
Guangdong University of Technology, Guangzhou, China
Hongyi Li
UNSW Sydney, Sydney, NSW, Australia
Chaojie Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, J., Ji, R., Zhang, L., Wu, Y., Zhao, C. (2024). CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Lecture Notes in Computer Science, vol 14452. Springer, Singapore. https://doi.org/10.1007/978-981-99-8076-5_31

Download citation

DOI: https://doi.org/10.1007/978-981-99-8076-5_31
Published: 14 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8075-8
Online ISBN: 978-981-99-8076-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition

Abstract

Access this chapter

Similar content being viewed by others

Multi-modal Text Recognition Networks: Interactive Enhancements Between Visual and Semantic Features

Visual and semantic ensemble for scene text recognition with gated dual mutual attention

An end-to-end text spotter with text relation networks

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition

Abstract

Access this chapter

Similar content being viewed by others

Multi-modal Text Recognition Networks: Interactive Enhancements Between Visual and Semantic Features

Visual and semantic ensemble for scene text recognition with gated dual mutual attention

An end-to-end text spotter with text relation networks

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation