Skip to main content
Log in

CDistNet: Perceiving Multi-domain Character Distance for Robust Text Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The transformer-based encoder-decoder framework is becoming popular in scene text recognition, largely because it naturally integrates recognition clues from both visual and semantic domains. However, recent studies show that the two kinds of clues are not always well registered and therefore, feature and character might be misaligned in difficult text (e.g., with a rare shape). As a result, constraints such as character position are introduced to alleviate this problem. Despite certain success, visual and semantic are still separately modeled and they are merely loosely associated. In this paper, we propose a novel module called multi-domain character distance perception (MDCDP) to establish a visually and semantically related position embedding. MDCDP uses the position embedding to query both visual and semantic features following the cross-attention mechanism. The two kinds of clues are fused into the position branch, generating a content-aware embedding that well perceives character spacing and orientation variants, character semantic affinities, and clues tying the two kinds of information. They are summarized as the multi-domain character distance. We develop CDistNet that stacks multiple MDCDPs to guide a gradually precise distance modeling. Thus, the feature-character alignment is well build even though various recognition difficulties are presented. We verify CDistNet on ten challenging public datasets and two series of augmented datasets created by ourselves. The experiments demonstrate that CDistNet performs highly competitively. It not only ranks top-tier in standard benchmarks, but also outperforms recent popular methods by obvious margins on real and augmented datasets presenting severe text deformation, poor linguistic support, and rare character layouts. In addition, the visualization shows that CDistNet achieves proper information utilization in both visual and semantic domains. Our code is available at https://github.com/simplify23/CDistNet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  • Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., Oh, S.J., Lee, H. (2019). What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: ICCV (pp. 4714–4722).

  • Baek, J., Matsui, Y., Aizawa, K. (2021). What if we only use real datasets for scene text recognition? toward scene text recognition with fewer labels. In: CVPR (pp. 3113–3122).

  • Bai, J., Chen, Z., Feng, B., & Xu, B. (2014). Chinese image text recognition on grayscale pixels. 2014 IEEE International Conference on Acoustics (pp. 1380–1384). IEEE: Speech and Signal Processing (ICASSP).

  • Bautista, D. & Atienza, R. (2022). Scene text recognition with permuted autoregressive sequence models. In: ECCV, Springer (pp. 178–196).

  • Bhunia, A.K., Sain, A., Kumar, A., Ghose, S., Chowdhury, P.N. & Song, Y.Z. (2021). Joint visual semantic reasoning: Multi-stage decoder for text recognition. In: ICCV (pp. 14920–14929).

  • Chen, X., Wang, T., Zhu, Y., Jin, L., & Luo, C. (2020). Adaptive embedding gate for attention-based scene text recognition. Neurocomputing, 381, 261–271.

    Article  Google Scholar 

  • Chen, X., Jin, L., Zhu, Y., Luo, C., & Wang, T. (2021). Text recognition in the wild: A survey. ACM Computing Surveys (CSUR), 54(2), 1–35.

    Article  Google Scholar 

  • Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S. & Zhou, S. (2017). Focusing attention: Towards accurate text recognition in natural images. In: ICCV (pp. 5076–5084).

  • Cheng, Z., Xu, Y., Bai, F., Niu, Y., Pu, S. & Zhou, S. (2018). Aon: Towards arbitrarily-oriented text recognition. In: CVPR (pp. 5571–5579).

  • Chng, C.K., Liu, Y., Sun, Y., Ng, C.C., Luo, C., Ni, Z., Fang, C., Zhang, S., Han, J., Ding, E., et al. (2019). Icdar2019 robust reading challenge on arbitrary-shaped text-RRC-ART. In: ICDAR, IEEE (pp. 1571–1576).

  • Da, C., Wang, P. & Yao, C. (2022). Levenshtein ocr. In: ECCV, Springer (pp. 322–338).

  • Devlin, J., Chang, M.W., Lee, K. & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT

  • Du, Y., Chen, Z., Jia, C., Yin, X., Zheng, T., Li, C., Du, Y. & Jiang, Y.G. (2022). SVTR: Scene text recognition with a single visual model. In: IJCAI (pp. 884–890).

  • Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y. (2021). Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In: CVPR (pp. 7094–7103).

  • Fang, S., Mao, Z., Xie, H., Wang, Y., Yan, C., Zhang, Y. (2022). Abinet++: Autonomous, bidirectional and iterative language modeling for scene text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence

  • Gupta, A., Vedaldi, A. & Zisserman, A. (2016) Synthetic data for text localisation in natural images. In: CVPR (pp. 2315–2324).

  • He, P., Huang, W., Qiao, Y., Loy, C.C. & Tang, X. (2016). Reading scene text in deep convolutional sequences. In: AAAI (pp. 3501–3508).

  • He, Y., Chen, C., Zhang, J., Liu, J., He, F., Wang, C. & Du, B. (2022). Visual semantics allow for textual reasoning better in scene text recognition. In: AAAI (pp. 888–896).

  • Hu, W., Cai, X., Hou, J., Yi, S., & Lin, Z. (2020). GTC: Guided training of CTC towards efficient and accurate scene text recognition. AAAI, 34, 11005–11012.

    Article  Google Scholar 

  • Jaderberg, M., Simonyan, K., Vedaldi, A. & Zisserman, A. (2014). Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227

  • Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1), 1–20.

    Article  MathSciNet  Google Scholar 

  • Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., & Lu, S. (2015) ICDAR 2015 competition on robust reading. In: ICDAR (pp. 1156–1160).

  • Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019) Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942

  • Lee, C.Y. & Osindero, S. (2016). Recursive recurrent nets with attention modeling for OCR in the wild. In: CVPR (pp. 2231–2239).

  • Li, H., Wang, P., Shen, C., & Zhang, G. (2019). Show, attend and read: A simple and strong baseline for irregular text recognition. AAAI, 33, 8610–8617.

    Article  Google Scholar 

  • Li, Y., Qi, H., Dai, J., Ji, X., & Wei, Y. (2017). Fully convolutional instance-aware semantic segmentation. In: CVPR (pp. 2359–2367).

  • Liao, M., Lyu, P., He, M., Yao, C., Wu, W., & Bai, X. (2019). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 532–548.

    Article  Google Scholar 

  • Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., Yao, C., & Bai, X. (2019). Scene text recognition from two-dimensional perspective. AAAI, 33, 8714–8721.

    Article  Google Scholar 

  • Long, S., He, X., & Yao, C. (2021). Scene text detection and recognition: The deep learning era. International Journal of Computer Vision, 129(1), 161–184.

    Article  Google Scholar 

  • Luo, C., Jin, L., & Sun, Z. (2019). MORAN: A multi-object rectified attention network for scene text recognition. Pattern Recognition, 90, 109–118.

    Article  Google Scholar 

  • Luo, C., Zhu, Y., Jin, L., & Wang, Y. (2020). Learn to augment: Joint data augmentation and network optimization for text recognition. In: CVPR (pp. 13746–13755).

  • Luo, C., Lin, Q., Liu, Y., Jin, L., & Shen, C. (2021). Separating content from style using adversarial learning for recognizing text in the wild. International Journal of Computer Vision, 129(4), 960–976.

    Article  MathSciNet  Google Scholar 

  • Lyu, P., Liao, M., Yao, C., Wu, W., Bai, X. (2018). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In: ECCV (pp 67–83).

  • Lyu, P., Yang, Z., Leng, X., Wu, X., Li, R., Shen, X. (2019). 2d attentional irregular scene text recognizer. arXiv preprint arXiv:1906.05708

  • Mishra, A., Alahari, K., & Jawahar, C. (2012) Scene text recognition using higher order language priors. In: BMVC (pp. 1–11).

  • Nayef, N., Yin, F., Bizid, I., Choi, H., Feng, Y., Karatzas, D., Luo, Z., Pal, U., Rigaud, C., & Chazalon, J. (2017). Icdar 2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. ICDAR, IEEE, 1, 1454–1459.

    Google Scholar 

  • Nayef, N., Patel, Y., Busta, M., Chowdhury, P.N., Karatzas, D., Khlif, W., Matas, J., Pal, U., Burie, J.C., & Liu, C.l. et al. (2019) Icdar2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In: ICDAR, IEEE (pp. 1582–1587).

  • Nguyen, N., Nguyen, T., Tran, V., Tran, M.T., Ngo, T.D., Nguyen, T.H., & Hoai, M. (2021). Dictionary-guided scene text recognition. In: CVPR (pp. 7383–7392).

  • Peng, D., Jin, L., Liu, Y., Luo, C., & Lai, S. (2022). Pagenet: Towards end-to-end weakly supervised page-level handwritten Chinese text recognition. International Journal of Computer Vision, 130(11), 2623–2645.

    Article  Google Scholar 

  • Phan, T.Q., Shivakumara, P., Tian, S., & Tan, C.L. (2013) Recognizing text with perspective distortion in natural scenes. In: ICCV (pp. 569–576).

  • Qiao, Z., Zhou, Y., Yang, D., Zhou, Y., & Wang, W. (2020) Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In: CVPR (pp. 13525–13534).

  • Risnumawan, A., Shivakumara, P., Chan, C. S., & Tan, C. L. (2014). A robust arbitrary text detection system for natural scene images. ESA, 41(18), 8027–8048.

    Google Scholar 

  • Rodriguez-Serrano, J. A., Gordo, A., & Perronnin, F. (2015). Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision, 113(3), 193–207.

    Article  Google Scholar 

  • Sheng, F., Chen, Z., & Xu, B. (2019) Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition. In: ICDAR (pp. 781–786).

  • Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298–2304.

    Article  Google Scholar 

  • Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2018). Aster: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9), 2035–2048.

    Article  Google Scholar 

  • Su, B., & Lu, S. (2017). Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recognition, 63, 397–405.

    Article  Google Scholar 

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. & Polosukhin, I. (2017). Attention is all you need. In: NIPS (pp. 5998–6008).

  • Wan, Z., He, M., Chen, H., Bai, X., & Yao, C. (2020). Textscanner: Reading characters in order for robust scene text recognition. AAAI, 34, 12120–12127.

  • Wang, K., Babenko, B., & Belongie, S. (2011) End-to-end scene text recognition. In: ICCV (pp. 1457–1464).

  • Wang, P., Yang, L., Li, H., Deng, Y., Shen, C. & Zhang, Y. (2019a). A simple and robust convolutional-attention network for irregular text recognition. arXiv preprint arXiv:1904.01375

  • Wang, P., Da, C., & Yao, C. (2022a) Multi-granularity prediction for scene text recognition. In: ECCV, Springer (pp. 339–355).

  • Wang, S., Wang, Y., Qin, X., Zhao, Q., & Tang, Z. (2019b) Scene text recognition via gated cascade attention. In: ICME (pp. 1018–1023).

  • Wang, T., Zhu, Y., Jin, L., Luo, C., Chen, X., Wu, Y., Wang, Q., & Cai, M. (2020). Decoupled attention network for text recognition. AAAI, 34, 12216–12224.

    Article  Google Scholar 

  • Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S., Zhang, Y. (2021). From two to one: A new scene text recognizer with visual language modeling network. In: ICCV (pp. 14174–14183).

  • Wang, Y., Xie, H., Fang, S., Xing, M., Wang, J., Zhu, S., & Zhang, Y. (2022). Petr: Rethinking the capability of transformer-based language model in scene text recognition. IEEE Transactions on Image Processing, 31, 5585–5598.

    Article  Google Scholar 

  • Xie, X., Fu, L., Zhang, Z., Wang, Z., Bai, X. (2022). Toward understanding wordart: Corner-guided transformer for scene text recognition. In: ECCV, Springer (pp. 303–321).

  • Xing, L., Tian, Z., Huang, W. & Scott, M.R. (2019). Convolutional character networks. In: ICCV (pp. 9126–9136).

  • Yan, R., Peng, L., Xiao, S. & Yao, G. (2021). Primitive representation learning for scene text recognition. In: CVPR (pp. 284–293).

  • Ye, Q., & Doermann, D. (2014). Text detection and recognition in imagery: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(7), 1480–1500.

    Article  Google Scholar 

  • Yu, D., Li, X., Zhang, C., Liu, T., Han, J., Liu, J., & Ding, E. (2020) Towards accurate scene text recognition with semantic reasoning networks. In: CVPR (pp. 12113–12122).

  • Yue, X., Kuang, Z., Lin, C., Sun, H. & Zhang, W. (2020) Robustscanner: Dynamically enhancing positional clues for robust text recognition. In: ECCV (pp. 135–151)

  • Zhan, F., & Lu, S. (2019) Esir: end-to-end scene text recognition via iterative image rectification. In: CVPR (pp. 2059–2068).

  • Zhang, Y., Gueguen, L., Zharkov, I., Zhang, P., Seifert, K., & Kadlec, B. (2017). Uber-text: a large-scale dataset for optical character recognition from street-level imagery. In: SUNw: Scene Understanding Workshop-CVPR

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants 62172103 and 62102384.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhineng Chen.

Additional information

Communicated by Dimosthenis Karatzas.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, T., Chen, Z., Fang, S. et al. CDistNet: Perceiving Multi-domain Character Distance for Robust Text Recognition. Int J Comput Vis 132, 300–318 (2024). https://doi.org/10.1007/s11263-023-01880-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01880-0

Keywords

Navigation