Skip to main content

MSNet: A Multi-scale Segmentation Network for Documents Layout Analysis

  • Conference paper
  • First Online:
Learning Technologies and Systems (SETE 2020, ICWL 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12511))

  • 1110 Accesses

Abstract

Layout analysis is often a crucial step in document image analysis and understanding. In this paper, we propose a deep learning-based layout analysis approach to identify and categorize the regions of interests in the scanned image of text document. Although semantic segmentation has been applied at pixel-level of document image for geometric layout analysis with much progress, many challenges remain with complex and heterogeneous documents which often have a sparse structure without closed boundaries and fine typologies with variable scales. We propose a multi-scale segmentation network, called MSNet, for high-resolution document image. The model is characterized by the enlarged receptive field size and multi-scale feature extraction. Experiments are conducted on a Chinese document dataset with satisfying performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Xu, Y., Yin F., Zhang Z.X., et al.: Page segmentation for historical handwritten documents using fully convolutional networks. In: 27th International Joint Conference on Artificial Intelligence, pp. 1057–1063. IEEE Computer Society, Kyoto (2017)

    Google Scholar 

  2. Ye, Q., Doermann, D.: Text detection and recognition in imagery: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1480–1500 (2015)

    Article  Google Scholar 

  3. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. IEEE, Boston (2015)

    Google Scholar 

  4. Badrinarayanan, V., Kendal, L.A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017). https://doi.org/10.1109/tpami.2016.2644615

  5. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds.) Medical Image Computing and Computer-Assisted Intervention, MICCAI 2015. MICCAI 2015. LNCS, vol 9351. pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

  6. Chen, L.C., Papandreou, G., Kokkinos, I., et al.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018). https://doi.org/10.1109/TPAMI.2017.2699184

    Article  Google Scholar 

  7. Chen, L.C., Papandreou, G., Schroff, F., et al.: Rethinking atrous convolution for semantic image segmentation. arXiv (2017). https://arxiv.org/abs/1706.05587

  8. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49

    Chapter  Google Scholar 

  9. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE, Las Vegas (2016)

    Google Scholar 

  10. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: 30th Neural Information Processing Systems, pp. 6000–6010. Curran Associates, Long Beach (2017)

    Google Scholar 

  11. Fu, J., Liu, J., Tian, H., et al.: Dual attention network for scene segmentation. In: 2019 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154. IEEE, Long Beach (2019)

    Google Scholar 

  12. Hu, J., Shen, L., Albanie, S., et al.: Squeeze-and-excitation networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. IEEE, Salt Lake City (2018)

    Google Scholar 

  13. Simonyan, k., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv (2015), https://arxiv.org/abs/1409.1556

  14. Zhao, H., Shi, J., Qi, X., et al.: Pyramid scene parsing network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 6230–6239. IEEE, Honolulu (2017)

    Google Scholar 

  15. Lin, G., Milan, A., Shen, C., et al.: RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1926–1934. IEEE, Honolulu (2017)

    Google Scholar 

  16. Zhang, H., Dana, K., Shi, J., et al.: Context encoding for semantic segmentation. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 7151–7160. IEEE, Salt Lake City (2018)

    Google Scholar 

  17. Yu, F., Koltun, V.: Multi-Scale Context Aggregation by Dilated Convolutions. arXiv (2016). https://arxiv.org/abs/1511.07122

  18. Li, H., Xiong, P., An, J., et al.: Pyramid attention network for semantic segmentation. arXiv (2018). https://arxiv.org/abs/1805.10180

  19. Yu, C., Wang, J., Peng, C., et al.: Learning a discriminative feature network for semantic segmentation. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1857–1866. IEEE, Salt Lake City (2018)

    Google Scholar 

  20. Oktay, O., Schlemper, J., Folgoc, L.L., et al.: Attention U-Net: learning where to look for the pancreas. arXiv (2018). https://arxiv.org/abs/1804.03999

  21. Sun, K., Xiao, B., Liu, D., et al.: Deep high-resolution representation learning for human pose estimation. In: 2019 IEEE Conference on Computer Vision and Pattern Recognition, pp. 5693–5703. IEEE, Long Beach (2019)

    Google Scholar 

  22. Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: BiSeNet: bilateral segmentation network for real-time semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 334–349. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_20

    Chapter  Google Scholar 

  23. Yu, C., Gao, C., Wang, J., et al.: BiSeNet v2: bilateral network with guided aggregation for real-time semantic segmentation. arXiv (2020). https://arxiv.org/abs/2004.02147

  24. Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: largest dataset ever for document layout analysis. arXiv (2019). https://arxiv.org/abs/1908.07836

  25. Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. In: 28th International Conference on Neural Information Processing Systems, pp. 91–99. Curran Associates, Montreal (2015)

    Google Scholar 

  26. He, K., Gkioxari, G., Dollár, P., et al.: Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 386–397 (2018). https://doi.org/10.1109/TPAMI.2018.2844175

    Article  Google Scholar 

  27. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv (2018). https://arxiv.org/abs/1804.02767

  28. Gilani, A., Qasim, S.R., Malik, I., et al.: Table detection using deep learning. In: 14th IAPR International Conference on Document Analysis and Recognition, pp. 771–776. IEEE, Kyoto (2017)

    Google Scholar 

  29. Tensmeyer, C., Davis, B., Wigington, C., et al.: PageNet: page boundary extraction in historical handwritten documents. arXiv (2017). https://arxiv.org/abs/1709.01618

  30. Tuan, T.A., Oh, K., Na, I.S., et al.: A robust system for document layout analysis using multilevel homogeneity structure. Expert Syst. Appl. 85(1), 99–113 (2017)

    Google Scholar 

  31. Oliveira, D.A.B., Viana, M.P.: Fast CNN-based document layout analysis. In: 2017 IEEE Conference on International Conference on Computer Vision, pp. 1173–1180. IEEE, Venice (2017)

    Google Scholar 

  32. Howard, A.G., Zhu, M., Chen, B., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv (2017). https://arxiv.org/abs/1704.04861

  33. Xie, S., Girshick, R., Dollar, P., et al.: Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500. IEEE, Honolulu (2017)

    Google Scholar 

  34. Yang, M., Yu, K., Zhang, C., et al.: DenseASPP for semantic segmentation in street scenes. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3684–3692. IEEE, Salt Lake City (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Bo Wang , Ju Zhou or Bailing Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, B., Zhou, J., Zhang, B. (2021). MSNet: A Multi-scale Segmentation Network for Documents Layout Analysis. In: Pang, C., et al. Learning Technologies and Systems. SETE ICWL 2020 2020. Lecture Notes in Computer Science(), vol 12511. Springer, Cham. https://doi.org/10.1007/978-3-030-66906-5_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-66906-5_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-66905-8

  • Online ISBN: 978-3-030-66906-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics