Skip to main content

CarveNet: a channel-wise attention-based network for irregular scene text recognition

Abstract

Although it has achieved considerable progress in recent years, recognizing irregular text in natural scene is still a challenging problem due to the distortion and background interference. The prior works use either spatial transformation network(STN) or 2D Attention mechanism to improve the recognition accuracy. However, STN-based methods are not robust as the limited network capacity while 2D Attention-based methods are highly interfered by fuzziness, distortion and background. In this paper, we propose a text recognition model CarveNet which consists of three substructures: feature extractor, feature filter and decoder. Feature extractor utilizes FPN (Feature Pyramid Network) to aggregate multi-scale hierarchical feature maps and obtain a larger receptive field. Then, feature filter composed of stacked Residual Channel Attention Block is followed to separate text features from background interference. The 2D self-attention-based decoder generates the text sequence according to the output of feature filter and the previously generated symbols. Extensive evaluation results show CarveNet achieves state-of-the-art on both regular and irregular scene text recognition benchmark datasets. Compared with the previous work based on 2D self-attention, CarveNet achieves accuracy increases of 2.3 and 4.6% on irregular dataset SVTP and CT80.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

References

  1. Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., Oh, S.J., Lee, H.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 4715–4723 (2019)

  2. Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T.S.: Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5659–5667 (2017)

  3. Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: Towards accurate text recognition in natural images. In: Proceedings of the IEEE international conference on computer vision, pp. 5076–5084 (2017)

  4. Cheng, Z., Xu, Y., Bai, F., Niu, Y., Pu, S., Zhou, S.: Aon: Towards arbitrarily-oriented text recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5571–5579 (2018)

  5. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2315–2324 (2016)

  6. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141 (2018)

  7. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)

  8. Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., et al.: Icdar 2015 competition on robust reading. In: 2015 13th international conference on document analysis and recognition (ICDAR), pp. 1156–1160. IEEE (2015)

  9. Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., De Las Heras, L.P.: Icdar 2013 robust reading competition. In: 2013 12th international conference on document analysis and recognition, pp. 1484–1493. IEEE (2013)

  10. Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for ocr in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2231–2239 (2016)

  11. Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: A simple and strong baseline for irregular text recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 8610–8617 (2019)

  12. Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., Yao, C., Bai, X.: Scene text recognition from two-dimensional perspective. In: Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 8714–8721 (2019)

  13. Liu, W., Chen, C., Wong, K.Y.K., Su, Z., Han, J.: Star-net: a spatial attention residue network for scene text recognition. BMVC 2, 7 (2016)

    Google Scholar 

  14. Liu, Y., Jin, L., Zhang, S., Luo, C., Zhang, S.: Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognit. 90, 337–345 (2019)

    Article  Google Scholar 

  15. Long, S., He, X., Yao, C.: Scene text detection and recognition: the deep learning era. Int. J. Comput. Vis. 129(1), 161–184 (2021)

    Article  Google Scholar 

  16. Luo, C., Jin, L., Sun, Z.: Moran: a multi-object rectified attention network for scene text recognition. Pattern Recogn. 90, 109–118 (2019)

    Article  Google Scholar 

  17. Luo, C., Zhu, Y., Jin, L., Wang, Y.: Learn to augment: joint data augmentation and network optimization for text recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13746–13755 (2020)

  18. Mishra, A., Alahari, K., Jawahar, C.: Top-down and bottom-up cues for scene text recognition. In: 2012 IEEE conference on computer vision and pattern recognition, pp. 2687–2694. IEEE (2012)

  19. Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: Proceedings of the IEEE international conference on computer vision, pp. 569–576 (2013)

  20. Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014)

    Article  Google Scholar 

  21. Sheng, F., Chen, Z., Xu, B.: Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition. In: 2019 International conference on document analysis and recognition (ICDAR), pp. 781–786. IEEE (2019)

  22. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)

    Article  Google Scholar 

  23. Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4168–4176 (2016)

  24. Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2035–2048 (2018)

    Article  Google Scholar 

  25. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)

  26. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: 2011 International conference on computer vision, pp. 1457–1464. IEEE (2011)

  27. Wang, T., Zhu, Y., Jin, L., Luo, C., Chen, X., Wu, Y., Wang, Q., Cai, M.: Decoupled attention network for text recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 12216–12224 (2020)

  28. Xie, Z., Huang, Y., Zhu, Y., Jin, L., Liu, Y., Xie, L.: Aggregation cross-entropy for sequence recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6538–6547 (2019)

  29. Yang, L., Wang, P., Li, H., Li, Z., Zhang, Y.: A holistic representation guided attention network for scene text recognition. Neurocomputing 414, 67–75 (2020)

    Article  Google Scholar 

  30. Yang, M., Guan, Y., Liao, M., He, X., Bian, K., Bai, S., Yao, C., Bai, X.: Symmetry-constrained rectification network for scene text recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 9147–9156 (2019)

  31. Yang, X., He, D., Zhou, Z., Kifer, D., Giles, C.L.: Learning to read irregular text with attention mechanisms. IJCAI 1, 3 (2017)

    Google Scholar 

  32. Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: Proceedings of the European conference on computer vision (ECCV), pp. 286–301 (2018)

  33. Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: East: an efficient and accurate scene text detector. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5551–5560 (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongping Xiong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wu, G., Zhang, Z. & Xiong, Y. CarveNet: a channel-wise attention-based network for irregular scene text recognition. IJDAR 25, 177–186 (2022). https://doi.org/10.1007/s10032-022-00398-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-022-00398-4

Keywords

  • Scene text recognition
  • Optical character recognition
  • Channel-wise attention
  • Transformer