Abstract
In recent years, document image understanding (DIU) has received much attention from the research community. Localizing page objects (tables, figures, equations) in document images is an important problem in DIU, which is the foundation for extracting information from document images. However, it has remained many challenges due to the high degree of intra-class variability in page document. Especially, object detection in Vietnamese image documents has still limited. In this paper, we propose CDeRSNet: a novel end-to-end trainable deep learning network to solve object detection in Vietnamese documents. The proposed network consists of Cascade R-CNN with the deformable convolution backbone and Rank & Sort (RS) Loss. CDeRSNet detects objects varying in scale with high detection accuracy at a higher IoU threshold to localize objects that differ in scale with detection accuracy at high quality. We empirically evaluate CDeRSNet on the Vietnamese image document dataset - UIT-DODV with four classes of objects: table, figure, caption, and formula. We achieved the best performance on the UIT-DODV dataset with 79.9% in terms of mAP, which is higher 5.4% than current results. In addition, we also provide a comprehensive evaluation and insightful analysis of CDeRSNet. Finally, we demonstrate CDeRSNet outperformance over state-of-the-arts models in object detection such as GFocal, GFocalV2, VFNet, DetectoRS on the UIT-DODV dataset. Code can be available at: https://github.com/trongthuan205/CDeRSNet.git.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agarwal, M., Mondal, A., Jawahar, C.V.: CDeC-Net: composite deformable cascade network for table detection in document images. arXiv: 2008.10831 [cs.CV] (2020)
Alcácer, V., Cruz-Machado, V.: Scanning the industry 4.0: a literature review on technologies for manufacturing systems. Eng. Sci. Technol. Int. J. 22(3), 899–919 (2019)
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018). 2018.00644, https://doi.org/10.1109/CVPR
Cesarini, F., Marinai, S., Sarti, L., Soda, G.: Trainable table location in document images. In: Object Recognition Supported by User Interaction for Service Robots, vol. 3, pp. 236–240. IEEE (2002)
Dai, J., et al.: Deformable convolutional networks. arXiv: 1703.06211 [cs.CV] (2017)
Dieu, L.T., Nguyen, T.T., Vo, N.D., Nguyen, T.V., Nguyen, K.: Parsing digitized Vietnamese paper documents. In: Tsapatsoulis, N., Panayides, A., Theocharides, T., Lanitis, A., Pattichis, C., Vento, M. (eds.) CAIP 2021. LNCS, vol. 13052, pp. 382–392. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89128-2_37
Gatos, B., Danatsas, D., Pratikakis, I., Perantonis, S.J.: Automatic table detection in document images. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3686, pp. 609–618. Springer, Heidelberg (2005). https://doi.org/10.1007/11551188_67
Li, X., et al.: Generalized focal loss V2: learning reliable localization quality estimation for dense object detection. arXiv: 2011.12885 [cs.CV] (2020)
Li, X., et al.: Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. CoRR abs/2006.04388 arXiv: 2006.04388 [cs.CV] (2020)
Li, Y., et al.: Rethinking table structure recognition using sequence labeling methods. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 541–553. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_35
Oksuz, K., Cam, B.C., Akbas, E., Kalkan, S.: A ranking based, balanced loss function unifying classification and localisation in object detection. arXiv preprint arXiv:2009.13592 (2020)
Oksuz, K., Cam, B.C., Akbas, E., Kalkan, S.: Rank & sort loss for object detection and instance segmentation. arXiv preprint arXiv:2107.11669 (2021)
Peng, S., Gao, L., Yuan, K., Tang, Z.: Image to LaTeX with graph neural network for mathematical formula recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 648–663. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_42
Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: CascadeTabNet: an approach for end to end table detection and structure recognition from image-based documents. arXiv: 2004.12629 [cs.CV] (2020)
Qiao, L., et al.: LGPMA: complicated table structure recognition with local and global pyramid mask alignment. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 99–114. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_7
Qiao, S., Chen, L.-C., Yuille, A.: DetectoRS: detecting objects with recursive feature pyramid and switchable atrous convolution. arXiv: 2006.02334 [cs.CV] (2020)
Rezatofighi, H., et al.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666 (2019)
Vo, N.D., Nguyen, K., Nguyen, T.V., Nguyen, K.: Empirical evaluation of state-of-the-art object detection methods for document image understanding. In: Fundamental and Applied IT Research Conference, pp. 180–184 (2017). https://doi.org/10.15625/vap.2017.00022
Vo, N.D., Nguyen, K., Nguyen, T.V., Nguyen, K.: Ensemble of deep object detectors for page object detection. In: Proceedings of the 12th International Conference on Ubiquitous Information Management and Communication, pp. 1–6 (2018)
Zhang, H., Wang, Y., Dayoub, F., Sünderhauf, N.: VarifocalNet: an IoU-aware dense object detector. arXiv: 2008.13367 [cs.CV] (2021)
Zhang, H., Chang, H., Ma, B., Wang, N., Chen, X.: Dynamic R-CNN: towards high quality object detection via dynamic training. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 260–275. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_16
Zhang, P., et al.: VSR: a unified framework for document layout analysis combining vision, semantics and relations. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 115–130. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_8
Acknowledgment
This research is funded by the University of Information Technology − Vietnam National University Ho Chi Minh City under grant number D1-2021-32. We express our sincere thanks to UIT-Together Research Group, Multimedia Communications Laboratory (MMLab), Faculty of Information Science and Engineering − University of Information Technology − Vietnam National University − Ho Chi Minh City, for supporting my team in this research process.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Nguyen, T.T., Nguyen, T.Q., Duong, L., Vo, N.D., Nguyen, K. (2022). CDeRSNet: Towards High Performance Object Detection in Vietnamese Document Images. In: Þór Jónsson, B., et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham. https://doi.org/10.1007/978-3-030-98355-0_36
Download citation
DOI: https://doi.org/10.1007/978-3-030-98355-0_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98354-3
Online ISBN: 978-3-030-98355-0
eBook Packages: Computer ScienceComputer Science (R0)