The YOLO model that still excels in document layout analysis

Deng, Qilin; Ibrayim, Mayire; Hamdulla, Askar; Zhang, Chunhu

doi:10.1007/s11760-023-02838-y

The YOLO model that still excels in document layout analysis

Original Paper
Published: 19 November 2023

Volume 18, pages 1539–1548, (2024)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Qilin Deng^1,2,
Mayire Ibrayim^1,2,
Askar Hamdulla^1,2 &
…
Chunhu Zhang^1,2

448 Accesses
Explore all metrics

Abstract

Document layout analysis can help people better understand and use the information in a document. However, the diversity of document layouts and considerable variation in aspect ratios among document objects pose significant challenges. In this study, we designed the multi-convolutional deformable separation (MCDS) module as the main structure of the network, using the YOLO model as a baseline. Integration of this module into the backbone and neck layers enhances the image feature extraction process significantly. Moreover, we incorporate ParNet-Attention to direct the network’s focus toward document objects through parallel networks, thereby facilitating a more exhaustive feature extraction. To optimize the model’s predictive potential, the decouple fusion head is employed within the head layer. This technique leverages multi-scale features based on the decoupled head, thereby enhancing the accuracy of predictions. Our proposed model achieves remarkable performance on three distinct public datasets with varying characteristics, namely ICDAR-POD, PubLayNet, and IIIT-AR-13K. Notably, in ICDAR-POD, both IoU\(_{0.6}\) and IoU\(_{0.8}\) achieve the optimal mean Average Precision (mAP), 96.2 and 94.4, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

HiM: hierarchical multimodal network for document layout analysis

Article 23 July 2023

VTLayout: Fusion of Visual and Text Features for Document Layout Analysis

References

Ren, S., He, K., Girshick, R., et al.: Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inform. Process. Syst. 9199(105555), 2969239–50 (2015)
Google Scholar
He, K., Gkioxari, G., Dollár, P., et al.: Mask r-cnn[C]//In: Proceedings of the IEEE International Conference on Computer Vision. (2017): 2961-2969
Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area Voronoi diagram. Comput. Vis. Image Underst. 70(3), 370–382 (1998)
Article Google Scholar
Nagy, George, Seth, Sharad: (1984). Hierarchical representation of optically scanned documents. In: The International Conference on Pattern Recognition. IEEE, 347-349
Yun, Jia, Xuedong, Tian, Lina, Zuo: A method for analyzing ancient book layout images based on local outlier factors and fluctuation thresholds. Sci. Technol. Eng. 20(29), 12021–12027 (2020)
Google Scholar
Saha, R., Mondal, A., Jawahar, C.V.: Graphical object detection in document images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE: 51-58 (2019)
Alaasam, R., Kurar, B., El-Sana, J.: Layout analysis on challenging historical Arabic manuscripts using Siamese network. Int. Conf. Document Anal. Recognit. (ICDAR) 2019, 738–742 (2019). https://doi.org/10.1109/ICDAR.2019.00123
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In CVPR, (2016)
Yang, H., Hsu, W.H.: ”Vision-Based Layout Detection from Scientific Literature using Recurrent Convolutional Neural Networks,” In: 2020 25th International Conference on Pattern Recognition (ICPR), (2021), pp. 6455-6462, https://doi.org/10.1109/ICPR48806.2021.9412557.
Lee, Y., Hwang, J., Lee, S., et al.: An energy and GPU-computation efficient backbone network for real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. (2019): 0-0
Huang, Y., et al.: "LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking." (2022)
Liu, Zhuang et al. “A ConvNet for the 2020s.” In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022): 11966-11976
Han, K., Xiao, A., Wu, E., et al.: Transformer in transformer. Adv. Neural. Inf. Process. Syst. 34, 15908–15919 (2021)
Google Scholar
Dai, Jifeng et al. “Deformable Convolutional Networks.” In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017): 764-773
Goyal, A., Bochkovskiy, A., Deng, J., et al.: Non-deep networks. Adv. Neural. Inf. Process. Syst. 35, 6789–6801 (2022)
Google Scholar
Gao, L., et al. “ICDAR2017 competition on page object detection,” In: 2017 14th IAPR Int. Conf. Doc. Anal. Recogn. (ICDAR), IEEE, vol. 1, pp. 1417-1422, (2017)
Zhong, X., Tang, J., Yepes, A.J.: “PubLayNet: largest dataset ever for document layout analysis’.’ In: 2019 Int. Conf. Document Anal Recog. (ICDAR), IEEE, pp. 1015-1022, (2019)
Mondal, A., Lipps, P., Jawahar, C.: IIIT-AR-13K: A new dataset for graphical object detection in documents. In: International Workshop on Document Analysis Systems; Springer: Cham, Switzerland, (2020); pp. 216-230
Gao, L., Yi, X., Jiang, Z., Hao, L., Tang, Z.: “ICDAR2017 competition on page object detection”. In ICDAR, (2017)
Younas, J., Siddiqui, S.A., Munir, M., et al.: Fi-fo detector: Figure and formula detection using deformable networks. Appl. Sci. 10(18), 6460 (2020)
Article CAS Google Scholar
Bi, H., Xu, C., Shi, C., et al.: SRRV: A novel document object detector based on spatial-related relation and vision. IEEE Trans. Multimed. 25, 3788–3798 (2023). https://doi.org/10.1109/TMM.2022.3165717
Zhang, H., Xu, C., Shi, C., et al.: HSCA-Net: A hybrid spatial-channel attention network in multiscale feature pyramid for document layout analysis. J. Artif. Intell. Technol. 3(1), 10–17 (2023)
CAS Google Scholar
Li, X.-H., Yin, F., Liu, C.-L.: “Page object detection from pdf document images by deep structured prediction and supervised clustering”. In: 2018 24th International Conference on Pattern Recognition (ICPR), (2018), pp. 3627-3632
Li, K., Wigington, C., Tensmeyer, C., Zhao, H., Barmpalios, N., Morariu, V.I., Manjunatha, V., Sun, T., Fu, Y.: "Cross-domain documentobject detection: Benchmark suite and method”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12 915-12 924 (2020)
He, K. et al. "Masked Autoencoders Are Scalable Vision Learners." (2021)
Gu, J., et al.: UniDoc: Unified Pretraining Framework for Document Understanding. Adv. Neural Inform. Process. Syst. 34, 39–50 (2021)
Li, Junlong, Xu, Yiheng, Lv, Tengchao, Cui, Lei, Zhang, Cha, Wei, Furu: Dit: Self-supervised pre-training for docu-ment image transformer. arXiv preprint arXiv:2203.02378,2022
Bao, H., Dong, L., Piao, S., et al. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, (2021)
Zhang, P., Li, C., Qiao, L., Cheng, Z., Pu, S., Niu, Y., Wu, F.: VSR:a unified framework for document layout analysis combining vision, semantics and relations [C]//Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part I 16. Springer International Publishing, pp. 115–130
Xie, S., Girshick, R., Dollár, P., et al.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017): 1492-1500
Nguyen, P., Ngo, L., Truong, T.: Nguyen, T.T.; Vo, N.D.; Nguyen, K. Page Object Detection with YOLOF. In: Proceedings of the 2021 8th NAFOSTED Conference on Information and Computer Science (NICS), Hanoi, Vietnam, 21-22 December 2021;pp. 205-210
Kallempudi, G., Hashmi, K.A., Pagani, A., et al.: Toward semi-supervised graphical object detection in document images. Future Internet 14(6), 176 (2022)
Article Google Scholar
Naik, S., Hashmi, K.A., Pagani, A., et al.: "Investigating attention mechanism for page object detection in document images". Appl. Sci. 12(15), 7486 (2022)

Download references

Funding

Funding This work was supported by the National Natural Science Foundation of China (No.62166043, U2003207).

Author information

Authors and Affiliations

School of Computer Science and Technology, Xinjiang University, Shuimagou District, Ürümqi, 830017, Xinjiang, China
Qilin Deng, Mayire Ibrayim, Askar Hamdulla & Chunhu Zhang
Xinjiang Key Laboratory of Signal Detection and Processing, Xinjiang University, Shuimagou District, Ürümqi, 830017, Xinjiang, China
Qilin Deng, Mayire Ibrayim, Askar Hamdulla & Chunhu Zhang

Authors

Qilin Deng
View author publications
You can also search for this author in PubMed Google Scholar
Mayire Ibrayim
View author publications
You can also search for this author in PubMed Google Scholar
Askar Hamdulla
View author publications
You can also search for this author in PubMed Google Scholar
Chunhu Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

QD: Writing - original draft, Conceptualization. MI: Supervision, Funding acquisition, Writing - review & editing. CZ and AH: Validation, Writing - review & editing, Formal analysis.

Corresponding author

Correspondence to Mayire Ibrayim.

Ethics declarations

Conflict of interest

Conflict of Interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Deng, Q., Ibrayim, M., Hamdulla, A. et al. The YOLO model that still excels in document layout analysis. SIViP 18, 1539–1548 (2024). https://doi.org/10.1007/s11760-023-02838-y

Download citation

Received: 16 August 2023
Revised: 16 August 2023
Accepted: 13 October 2023
Published: 19 November 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11760-023-02838-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The YOLO model that still excels in document layout analysis

Abstract

Access this article

Similar content being viewed by others

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

HiM: hierarchical multimodal network for document layout analysis

VTLayout: Fusion of Visual and Text Features for Document Layout Analysis

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The YOLO model that still excels in document layout analysis

Abstract

Access this article

Similar content being viewed by others

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

HiM: hierarchical multimodal network for document layout analysis

VTLayout: Fusion of Visual and Text Features for Document Layout Analysis

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation