Coarse-to-fine document localization in natural scene image with regional attention and recursive corner refinement

Abstract

Document localization is a promising step for document-based optical character recognition. This task gains difficulty when documents are located in complex natural scene images. In this paper, we propose a coarse-to-fine document localization approach to detect the four corner points of the document in natural scene images. In the first stage, the four corners are roughly predicted through a deep neural networks-based Joint Corner Detector (JCD) with an attention mechanism, which roughly localize the document region via an attentional map. As a key to produce accurate inference of corners, the JCD module suppresses the interference from background in convolutional features substantially. In the second stage, a corner-specific refiner module is designed to refine the previously predicted corners. Considering the different characteristics of the four document corners, the patches cropped around the predicted corners are input into four different corner-specified CNN models, to search the accurate corner locations recursively. Three datasets (ICDAR 2015 SmartDoc competition 1 dataset, SEECS-NUSF dataset and a self-collected dataset) are used to evaluate the performance of our method. The experimental results demonstrate the superiority of the proposed method in localizing the document in natural images, especially in those with complex background. Compared with the state-of-the-art works, our method outperforms most of them.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

References

  1. 1.

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceeding of International Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  2. 2.

    He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., Li, X.: Single shot text detector with regional attention. In: IEEE International Conference on Computer Vision, pp. 3047–3055 (2017)

  3. 3.

    Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceeding of International Conference on Computer Vision, pp. 1520–1528 (2015)

  4. 4.

    Qiao, Y., Hu, Q.M., Qian, G.Y., Luo, S.H., Nowinski, W.L.: Thresholding based on variance and intensity contrast. Pattern Recognit. 40, 596–608 (2007)

    Article  MATH  Google Scholar 

  5. 5.

    Tobias, O.J., Seara, R.: Image segmentation by histogram thresholding using fuzzy sets. IEEE Trans. Image Process. 11, 1457–65 (2002)

    Article  Google Scholar 

  6. 6.

    Lampert, C.H., Braun, T., Ulges, A., Keysers, D., Breuel, T.M.: Oblivious document capture and real-time retrieval. In: Proceeding of International Conference on Camera Based Document Analysis and Recognition, pp. 79–86 (2005)

  7. 7.

    Guillou, E., Meneveaux, D., Maisel, E., Bouaouch, K.: Using vanishing points for camera calibration and coarse 3D reconstruction from a single image. Visual Comput. 16, 396–410 (2000)

    Article  MATH  Google Scholar 

  8. 8.

    Kofler, C., Keysers, D., Koetsier, A., Laagland, J., Breuel, T.M.: Gestural interaction for an automatic document capture system. In: Proceedings of the International Workshop on Camera-Based Document Analysis and Recognition, pp. 161–167 (2007)

  9. 9.

    Clark, P., Mirmehdi, M.: Rectifying perspective view of text in 3D scenes using vanishing points. Pattern Recognit. 36, 2673–2686 (2003)

    Article  Google Scholar 

  10. 10.

    Miao, L., Peng, S.: Perspective rectification of document images based on morphology. In: International Conference on Computational Intelligence and Security, pp. 1805–1808 (2009)

  11. 11.

    Lu, S., Tan, C.L.: The restoration of camera documents through image segmentation. In: Proceeding of Document Analysis Systems, vol. 3872, pp. 484–495 (2006)

  12. 12.

    Lu, S., Chen, B.M., Ko, C.C.: Perspective rectification of document images using fuzzy set and morphological operations. Image Vis. Comput. 23, 541–553 (2005)

    Article  Google Scholar 

  13. 13.

    Stamatopoulos, N., Gatos, B., Kesidis, A.: Automatic borders detection of camera document images. Psychopharmacology 182, 597–598 (2007)

    Google Scholar 

  14. 14.

    Bulatov, K., Arlazarov, V.V., Chernov, T., Slavin, O., Nikolaev, D.: Smart IDReader: document recognition in video stream. In: Proceeding of International Conference on Document Analysis and Recognition, pp. 39–44 (2018)

  15. 15.

    Zhang, Z., He, L. W.: Note-taking with a camera: whiteboard scanning and image enhancement. In: Proceeding of International Conference on Acoustics, Speech, and Signal Processing, pp. 533–536 (2004)

  16. 16.

    Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: Proceeding of International Conference on Computer Vision and Pattern Recognition, pp. 3476–3483 (2013)

  17. 17.

    Zhukovsky, A., Nikolaev, D., Arlazarov, V., Postnikov, V., Polevoy, D., Skoryukina, N., Chernov, T., Shemiakina, J., Mukovozov, A., Konovalenko, I.: Segments graph-based approach for document capture in a smartphone video stream. In: Proceeding of International Conference on Document Analysis and Recognition, pp. 337–342 (2018)

  18. 18.

    Javed, K., Shafait, F.: Real-time document localization in natural images by recursive application of a CNN. In: Proceeding of International Conference on Document Analysis and Recognition, pp. 105–110 (2017)

  19. 19.

    Yin, X.C., Sun, J., Naoi, S., Fujimoto, K., Fujii, Y., Kurokawa, K., Takebe, H.: A multi-stage strategy to perspective rectification for mobile phone camera-based document images. In: Proceeding of International Conference on Document Analysis and Recognition, pp. 574–578 (2007)

  20. 20.

    Azulay, A., Weiss, Y.: Why do deep convolutional networks generalize so poorly to small image transformations (2018). arXiv preprint arXiv: 1805.12177

  21. 21.

    Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI Conference on Artificial Intelligence, pp. 4278–4284 (2016)

  22. 22.

    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceeding of International Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

  23. 23.

    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. In: USENIX Symposium on Operating Systems Design and Implementation, pp. 265–283 (2016)

  24. 24.

    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceeding of International Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

  25. 25.

    Burie, J.C., Chazalon, J., Coustaty, M., Eskenazi, S., Luqman, M.M., Mehri, M., Nayef, N., Ogier, J.M., Prum, S., Rusinol, M.: ICDAR2015 competition on smartphone document capture and OCR (Smart-Doc). In: Proceeding of International Conference on Document Analysis and Recognition, pp. 1161–1165 (2015)

  26. 26.

    Zisserman, A.: The Pascal Visual Object Classes Challenge. Lecture Notes in Computer Science, vol. 111, pp. 98–136 (2007)

Download references

Acknowledgements

This work was partly supported by the National Natural Science Foundation of China under Grant 61703316 and the National Key R&D Program of China (Grant No. 2017YFB1402200).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Shengwu Xiong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhu, A., Zhang, C., Li, Z. et al. Coarse-to-fine document localization in natural scene image with regional attention and recursive corner refinement. IJDAR 22, 351–360 (2019). https://doi.org/10.1007/s10032-019-00341-0

Download citation

Keywords

  • Document localization
  • Coarse-to-fine procedure
  • Attention mechanism
  • Corner-specified refiner