Skip to main content

CDeRSNet: Towards High Performance Object Detection in Vietnamese Document Images

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13142))

Included in the following conference series:

Abstract

In recent years, document image understanding (DIU) has received much attention from the research community. Localizing page objects (tables, figures, equations) in document images is an important problem in DIU, which is the foundation for extracting information from document images. However, it has remained many challenges due to the high degree of intra-class variability in page document. Especially, object detection in Vietnamese image documents has still limited. In this paper, we propose CDeRSNet: a novel end-to-end trainable deep learning network to solve object detection in Vietnamese documents. The proposed network consists of Cascade R-CNN with the deformable convolution backbone and Rank & Sort (RS) Loss. CDeRSNet detects objects varying in scale with high detection accuracy at a higher IoU threshold to localize objects that differ in scale with detection accuracy at high quality. We empirically evaluate CDeRSNet on the Vietnamese image document dataset - UIT-DODV with four classes of objects: table, figure, caption, and formula. We achieved the best performance on the UIT-DODV dataset with 79.9% in terms of mAP, which is higher 5.4% than current results. In addition, we also provide a comprehensive evaluation and insightful analysis of CDeRSNet. Finally, we demonstrate CDeRSNet outperformance over state-of-the-arts models in object detection such as GFocal, GFocalV2, VFNet, DetectoRS on the UIT-DODV dataset. Code can be available at: https://github.com/trongthuan205/CDeRSNet.git.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/open-mmlab/mmdetection.

  2. 2.

    https://github.com/hkzhang95/DynamicRCNN.

  3. 3.

    https://github.com/kemaloksuz/RankSortLoss.

References

  1. Agarwal, M., Mondal, A., Jawahar, C.V.: CDeC-Net: composite deformable cascade network for table detection in document images. arXiv: 2008.10831 [cs.CV] (2020)

  2. Alcácer, V., Cruz-Machado, V.: Scanning the industry 4.0: a literature review on technologies for manufacturing systems. Eng. Sci. Technol. Int. J. 22(3), 899–919 (2019)

    Google Scholar 

  3. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018). 2018.00644, https://doi.org/10.1109/CVPR

  4. Cesarini, F., Marinai, S., Sarti, L., Soda, G.: Trainable table location in document images. In: Object Recognition Supported by User Interaction for Service Robots, vol. 3, pp. 236–240. IEEE (2002)

    Google Scholar 

  5. Dai, J., et al.: Deformable convolutional networks. arXiv: 1703.06211 [cs.CV] (2017)

  6. Dieu, L.T., Nguyen, T.T., Vo, N.D., Nguyen, T.V., Nguyen, K.: Parsing digitized Vietnamese paper documents. In: Tsapatsoulis, N., Panayides, A., Theocharides, T., Lanitis, A., Pattichis, C., Vento, M. (eds.) CAIP 2021. LNCS, vol. 13052, pp. 382–392. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89128-2_37

    Chapter  Google Scholar 

  7. Gatos, B., Danatsas, D., Pratikakis, I., Perantonis, S.J.: Automatic table detection in document images. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3686, pp. 609–618. Springer, Heidelberg (2005). https://doi.org/10.1007/11551188_67

    Chapter  Google Scholar 

  8. Li, X., et al.: Generalized focal loss V2: learning reliable localization quality estimation for dense object detection. arXiv: 2011.12885 [cs.CV] (2020)

  9. Li, X., et al.: Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. CoRR abs/2006.04388 arXiv: 2006.04388 [cs.CV] (2020)

  10. Li, Y., et al.: Rethinking table structure recognition using sequence labeling methods. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 541–553. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_35

    Chapter  Google Scholar 

  11. Oksuz, K., Cam, B.C., Akbas, E., Kalkan, S.: A ranking based, balanced loss function unifying classification and localisation in object detection. arXiv preprint arXiv:2009.13592 (2020)

  12. Oksuz, K., Cam, B.C., Akbas, E., Kalkan, S.: Rank & sort loss for object detection and instance segmentation. arXiv preprint arXiv:2107.11669 (2021)

  13. Peng, S., Gao, L., Yuan, K., Tang, Z.: Image to LaTeX with graph neural network for mathematical formula recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 648–663. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_42

    Chapter  Google Scholar 

  14. Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: CascadeTabNet: an approach for end to end table detection and structure recognition from image-based documents. arXiv: 2004.12629 [cs.CV] (2020)

  15. Qiao, L., et al.: LGPMA: complicated table structure recognition with local and global pyramid mask alignment. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 99–114. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_7

    Chapter  Google Scholar 

  16. Qiao, S., Chen, L.-C., Yuille, A.: DetectoRS: detecting objects with recursive feature pyramid and switchable atrous convolution. arXiv: 2006.02334 [cs.CV] (2020)

  17. Rezatofighi, H., et al.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666 (2019)

    Google Scholar 

  18. Vo, N.D., Nguyen, K., Nguyen, T.V., Nguyen, K.: Empirical evaluation of state-of-the-art object detection methods for document image understanding. In: Fundamental and Applied IT Research Conference, pp. 180–184 (2017). https://doi.org/10.15625/vap.2017.00022

  19. Vo, N.D., Nguyen, K., Nguyen, T.V., Nguyen, K.: Ensemble of deep object detectors for page object detection. In: Proceedings of the 12th International Conference on Ubiquitous Information Management and Communication, pp. 1–6 (2018)

    Google Scholar 

  20. Zhang, H., Wang, Y., Dayoub, F., Sünderhauf, N.: VarifocalNet: an IoU-aware dense object detector. arXiv: 2008.13367 [cs.CV] (2021)

  21. Zhang, H., Chang, H., Ma, B., Wang, N., Chen, X.: Dynamic R-CNN: towards high quality object detection via dynamic training. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 260–275. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_16

    Chapter  Google Scholar 

  22. Zhang, P., et al.: VSR: a unified framework for document layout analysis combining vision, semantics and relations. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 115–130. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_8

    Chapter  Google Scholar 

Download references

Acknowledgment

This research is funded by the University of Information Technology − Vietnam National University Ho Chi Minh City under grant number D1-2021-32. We express our sincere thanks to UIT-Together Research Group, Multimedia Communications Laboratory (MMLab), Faculty of Information Science and Engineering − University of Information Technology − Vietnam National University − Ho Chi Minh City, for supporting my team in this research process.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thuan Trong Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nguyen, T.T., Nguyen, T.Q., Duong, L., Vo, N.D., Nguyen, K. (2022). CDeRSNet: Towards High Performance Object Detection in Vietnamese Document Images. In: Þór Jónsson, B., et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham. https://doi.org/10.1007/978-3-030-98355-0_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-98355-0_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-98354-3

  • Online ISBN: 978-3-030-98355-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics