CDeRSNet: Towards High Performance Object Detection in Vietnamese Document Images

Nguyen, Thuan Trong; Nguyen, Thuan Q.; Duong, Long; Vo, Nguyen D.; Nguyen, Khang

doi:10.1007/978-3-030-98355-0_36

Thuan Trong Nguyen^15,16,
Thuan Q. Nguyen^15,16,
Long Duong^15,16,
Nguyen D. Vo^15,16 &
…
Khang Nguyen^15,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13142))

Included in the following conference series:

International Conference on Multimedia Modeling

2044 Accesses
7 Citations

Abstract

In recent years, document image understanding (DIU) has received much attention from the research community. Localizing page objects (tables, figures, equations) in document images is an important problem in DIU, which is the foundation for extracting information from document images. However, it has remained many challenges due to the high degree of intra-class variability in page document. Especially, object detection in Vietnamese image documents has still limited. In this paper, we propose CDeRSNet: a novel end-to-end trainable deep learning network to solve object detection in Vietnamese documents. The proposed network consists of Cascade R-CNN with the deformable convolution backbone and Rank & Sort (RS) Loss. CDeRSNet detects objects varying in scale with high detection accuracy at a higher IoU threshold to localize objects that differ in scale with detection accuracy at high quality. We empirically evaluate CDeRSNet on the Vietnamese image document dataset - UIT-DODV with four classes of objects: table, figure, caption, and formula. We achieved the best performance on the UIT-DODV dataset with 79.9% in terms of mAP, which is higher 5.4% than current results. In addition, we also provide a comprehensive evaluation and insightful analysis of CDeRSNet. Finally, we demonstrate CDeRSNet outperformance over state-of-the-arts models in object detection such as GFocal, GFocalV2, VFNet, DetectoRS on the UIT-DODV dataset. Code can be available at: https://github.com/trongthuan205/CDeRSNet.git.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The YOLO model that still excels in document layout analysis

Article 19 November 2023

VTLayout: Fusion of Visual and Text Features for Document Layout Analysis

Parsing Digitized Vietnamese Paper Documents

Notes

References

Agarwal, M., Mondal, A., Jawahar, C.V.: CDeC-Net: composite deformable cascade network for table detection in document images. arXiv: 2008.10831 [cs.CV] (2020)
Alcácer, V., Cruz-Machado, V.: Scanning the industry 4.0: a literature review on technologies for manufacturing systems. Eng. Sci. Technol. Int. J. 22(3), 899–919 (2019)
Google Scholar
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018). 2018.00644, https://doi.org/10.1109/CVPR
Cesarini, F., Marinai, S., Sarti, L., Soda, G.: Trainable table location in document images. In: Object Recognition Supported by User Interaction for Service Robots, vol. 3, pp. 236–240. IEEE (2002)
Google Scholar
Dai, J., et al.: Deformable convolutional networks. arXiv: 1703.06211 [cs.CV] (2017)
Dieu, L.T., Nguyen, T.T., Vo, N.D., Nguyen, T.V., Nguyen, K.: Parsing digitized Vietnamese paper documents. In: Tsapatsoulis, N., Panayides, A., Theocharides, T., Lanitis, A., Pattichis, C., Vento, M. (eds.) CAIP 2021. LNCS, vol. 13052, pp. 382–392. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89128-2_37
Chapter Google Scholar
Gatos, B., Danatsas, D., Pratikakis, I., Perantonis, S.J.: Automatic table detection in document images. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3686, pp. 609–618. Springer, Heidelberg (2005). https://doi.org/10.1007/11551188_67
Chapter Google Scholar
Li, X., et al.: Generalized focal loss V2: learning reliable localization quality estimation for dense object detection. arXiv: 2011.12885 [cs.CV] (2020)
Li, X., et al.: Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. CoRR abs/2006.04388 arXiv: 2006.04388 [cs.CV] (2020)
Li, Y., et al.: Rethinking table structure recognition using sequence labeling methods. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 541–553. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_35
Chapter Google Scholar
Oksuz, K., Cam, B.C., Akbas, E., Kalkan, S.: A ranking based, balanced loss function unifying classification and localisation in object detection. arXiv preprint arXiv:2009.13592 (2020)
Oksuz, K., Cam, B.C., Akbas, E., Kalkan, S.: Rank & sort loss for object detection and instance segmentation. arXiv preprint arXiv:2107.11669 (2021)
Peng, S., Gao, L., Yuan, K., Tang, Z.: Image to LaTeX with graph neural network for mathematical formula recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 648–663. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_42
Chapter Google Scholar
Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: CascadeTabNet: an approach for end to end table detection and structure recognition from image-based documents. arXiv: 2004.12629 [cs.CV] (2020)
Qiao, L., et al.: LGPMA: complicated table structure recognition with local and global pyramid mask alignment. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 99–114. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_7
Chapter Google Scholar
Qiao, S., Chen, L.-C., Yuille, A.: DetectoRS: detecting objects with recursive feature pyramid and switchable atrous convolution. arXiv: 2006.02334 [cs.CV] (2020)
Rezatofighi, H., et al.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666 (2019)
Google Scholar
Vo, N.D., Nguyen, K., Nguyen, T.V., Nguyen, K.: Empirical evaluation of state-of-the-art object detection methods for document image understanding. In: Fundamental and Applied IT Research Conference, pp. 180–184 (2017). https://doi.org/10.15625/vap.2017.00022
Vo, N.D., Nguyen, K., Nguyen, T.V., Nguyen, K.: Ensemble of deep object detectors for page object detection. In: Proceedings of the 12th International Conference on Ubiquitous Information Management and Communication, pp. 1–6 (2018)
Google Scholar
Zhang, H., Wang, Y., Dayoub, F., Sünderhauf, N.: VarifocalNet: an IoU-aware dense object detector. arXiv: 2008.13367 [cs.CV] (2021)
Zhang, H., Chang, H., Ma, B., Wang, N., Chen, X.: Dynamic R-CNN: towards high quality object detection via dynamic training. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 260–275. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_16
Chapter Google Scholar
Zhang, P., et al.: VSR: a unified framework for document layout analysis combining vision, semantics and relations. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 115–130. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_8
Chapter Google Scholar

Download references

Acknowledgment

This research is funded by the University of Information Technology − Vietnam National University Ho Chi Minh City under grant number D1-2021-32. We express our sincere thanks to UIT-Together Research Group, Multimedia Communications Laboratory (MMLab), Faculty of Information Science and Engineering − University of Information Technology − Vietnam National University − Ho Chi Minh City, for supporting my team in this research process.

Author information

Authors and Affiliations

University of Information Technology, Ho Chi Minh, Vietnam
Thuan Trong Nguyen, Thuan Q. Nguyen, Long Duong, Nguyen D. Vo & Khang Nguyen
Vietnam National University, Ho Chi Minh, Vietnam
Thuan Trong Nguyen, Thuan Q. Nguyen, Long Duong, Nguyen D. Vo & Khang Nguyen

Authors

Thuan Trong Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Thuan Q. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Long Duong
View author publications
You can also search for this author in PubMed Google Scholar
Nguyen D. Vo
View author publications
You can also search for this author in PubMed Google Scholar
Khang Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thuan Trong Nguyen .

Editor information

Editors and Affiliations

IT University of Copenhagen, Copenhagen, Denmark
Björn Þór Jónsson
Dublin City University, Dublin, Ireland
Cathal Gurrin
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
Minh-Triet Tran
University of Bergen, Bergen, Norway
Duc-Tien Dang-Nguyen
National Tsing Hua University, Hsinchu, Taiwan
Anita Min-Chun Hu
Hanoi University of Science and Technology, Hanoi, Vietnam
Binh Huynh Thi Thanh
Median Technologies, Valbonne, France
Benoit Huet

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, T.T., Nguyen, T.Q., Duong, L., Vo, N.D., Nguyen, K. (2022). CDeRSNet: Towards High Performance Object Detection in Vietnamese Document Images. In: Þór Jónsson, B., et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham. https://doi.org/10.1007/978-3-030-98355-0_36

Download citation

DOI: https://doi.org/10.1007/978-3-030-98355-0_36
Published: 15 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98354-3
Online ISBN: 978-3-030-98355-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CDeRSNet: Towards High Performance Object Detection in Vietnamese Document Images

Abstract

Access this chapter

Similar content being viewed by others

The YOLO model that still excels in document layout analysis

VTLayout: Fusion of Visual and Text Features for Document Layout Analysis

Parsing Digitized Vietnamese Paper Documents

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

CDeRSNet: Towards High Performance Object Detection in Vietnamese Document Images

Abstract

Access this chapter

Similar content being viewed by others

The YOLO model that still excels in document layout analysis

VTLayout: Fusion of Visual and Text Features for Document Layout Analysis

Parsing Digitized Vietnamese Paper Documents

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation