Skip to main content

Document Domain Randomization for Deep Learning Document Layout Extraction

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2021 (ICDAR 2021)

Abstract

We present document domain randomization (DDR), the first successful transfer of CNNs trained only on graphically rendered pseudo-paper pages to real-world document segmentation. DDR renders pseudo-document pages by modeling randomized textual and non-textual contents of interest, with user-defined layout and font styles to support joint learning of fine-grained classes. We demonstrate competitive results using our DDR approach to extract nine document classes from the benchmark CS-150 and papers published in two domains, namely annual meetings of Association for Computational Linguistics (ACL) and IEEE Visualization (VIS). We compare DDR to conditions of style mismatch, fewer or more noisy samples that are more easily obtained in the real world. We show that high-fidelity semantic information is not necessary to label semantic classes but style mismatch between train and test can lower model accuracy. Using smaller training samples had a slightly detrimental effect. Finally, network models still achieved high test accuracy when correct labels are diluted towards confusing labels; this behavior hold across several classes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Github: Tensorpack Faster R-CNN (February 2021). https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN

  2. Arif, S., Shafait, F.: Table detection in document images using foreground and background features. In: Proceedings of the DICTA, pp. 245–252. IEEE, Piscataway (2018). https://doi.org/10.1109/DICTA.2018.8615795

  3. Battle, L., Duan, P., Miranda, Z., Mukusheva, D., Chang, R., Stonebraker, M.: Beagle: automated extraction and interpretation of visualizations from the web. In: Proceedings of the CHI, pp. 594:1–594:8. ACM, New York (2018). https://doi.org/10.1145/3173574.3174168

  4. Borkin, M.A., et al.: What makes a visualization memorable? IEEE Trans. Vis. Comput. Graph. 19(12), 2306–2315 (2013). https://doi.org/10.1109/TVCG.2013.234

    Article  Google Scholar 

  5. Caragea, C., et al.: CiteSeerx: a scholarly big dataset. In: de Rijke, M., et al. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 311–322. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_26

  6. Chatzimparmpas, A., Jusufi, I.: The state of the art in enhancing trust in machine learning models with the use of visualizations. Comput. Graph. Forum 39(3), 713–756 (2020). https://doi.org/10.1111/cgf.14034

    Article  Google Scholar 

  7. Chen, J., et al.: IEEE VIS figures and tables image dataset. IEEE Dataport (2020). https://doi.org/10.21227/4hy6-vh52. https://visimagenavigator.github.io/

  8. Chen, J., et al.: VIS30K: a collection of figures and tables from IEEE visualization conference publications. IEEE Trans. Vis. Comput. Graph. 27, 3826–3833 (2021). https://doi.org/10.1109/TVCG.2021.3054916

    Article  Google Scholar 

  9. Choudhury, S.R., Mitra, P., Giles, C.L.: Automatic extraction of figures from scholarly documents. In: Proceedings of the DocEng, pp. 47–50. ACM, New York (2015). https://doi.org/10.1145/2682571.2797085

  10. Clark, C., Divvala, S.: Looking beyond text: Extracting figures, tables and captions from computer science papers. In: Workshops at the 29th AAAI Conference on Artificial Intelligence (2015). https://aaai.org/ocs/index.php/WS/AAAIW15/paper/view/10092

  11. Clark, C., Divvala, S.: PDFFigures 2.0: mining figures from research papers. In: Proceedings of the JCDL, pp. 143–152. ACM, New York (2016). https://doi.org/10.1145/2910896.2910904

  12. Davila, K., Setlur, S., Doermann, D., Bhargava, U.K., Govindaraju, V.: Chart mining: a survey of methods for automated chart analysis. IEEE Trans. Pattern Anal. Mach. Intell. 43 (2021, to appear). https://doi.org/10.1109/TPAMI.2020.2992028

  13. Dong, X., et al.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of the KDD, pp. 601–610. ACM, New York (2014). https://doi.org/10.1145/2623330.2623623

  14. Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: Proceedings of the ICCV, pp. 2758–2766. IEEE, Los Alamitos (2015). https://doi.org/10.1109/ICCV.2015.316

  15. Funke, C.M., Borowski, J., Stosio, K., Brendel, W., Wallis, T.S., Bethge, M.: Five points to check when comparing visual perception in humans and machines. J. Vis. 21(3), 1–23 (2021). https://doi.org/10.1167/jov.21.3.16

    Article  Google Scholar 

  16. Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., Brendel, W.: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness (2018). https://arxiv.org/abs/1811.12231

  17. Giles, C.L., Bollacker, K.D., Lawrence, S.: CiteSeer: an automatic citation indexing system. In: Proceedings of the DL, pp. 89–98. ACM, New York (1998). https://doi.org/10.1145/276675.276685

  18. He, D., Cohen, S., Price, B., Kifer, D., Giles, C.L.: Multi-scale multi-task FCN for semantic page segmentation and table detection. In: Proceedings of the ICDAR, pp. 254–261. IEEE, Los Alamitos (2017). https://doi.org/10.1109/ICDAR.2017.50

  19. James, S., Johns, E.: 3D simulation for robot arm control with deep Q-learning (2016). https://arxiv.org/abs/1609.03759

  20. Katona, G.: Component Extraction from Scientific Publications using Convolutional Neural Networks. Master’s thesis, Computer Science Department, University of Vienna, Austria (2019)

    Google Scholar 

  21. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7

    Article  MathSciNet  Google Scholar 

  22. Li, M., et al.: DocBank: a benchmark dataset for document layout analysis. In: Proceedings of the COLING, pp. 949–960. ICCL, Praha, Czech Republic (2020). https://doi.org/10.18653/v1/2020.coling-main.82

  23. Li, R., Chen, J.: Toward a deep understanding of what makes a scientific visualization memorable. In: Proceedings of the SciVis, pp. 26–31. IEEE, Los Alamitos (2018). https://doi.org/10.1109/SciVis.2018.8823764

  24. Ling, M., Chen, J.: DeepPaperComposer: a simple solution for training data preparation for parsing research papers. In: Proceedings of the EMNLP/Scholarly Document Processing, pp. 91–96. ACL, Stroudsburg (2020). https://doi.org/10.18653/v1/2020.sdp-1.10

  25. Ling, M., et al.: Three benchmark datasets for scholarly article layout analysis. IEEE Dataport (2020). https://doi.org/10.21227/326q-bf39

  26. Lo, K., Wang, L.L., Neumann, M., Kinney, R., Weld, D.S.: S2ORC: the semantic scholar open research corpus. In: Proceedings of the ACL, pp. 4969–4983. ACL, Stroudsburg (2020). https://doi.org/10.18653/v1/2020.acl-main.447

  27. Lopez, P.: GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 473–474. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04346-8_62

    Chapter  Google Scholar 

  28. Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the CVPR, pp. 4040–4048. IEEE, Los Alamitos (2016). https://doi.org/10.1109/CVPR.2016.438

  29. Poppler: Poppler. Dataset and online search (2014). https://poppler.freedesktop.org/

  30. Praczyk, P., Nogueras-Iso, J.: A semantic approach for the annotation of figures: application to high-energy physics. In: Garoufallou, E., Greenberg, J. (eds.) MTSR 2013. CCIS, vol. 390, pp. 302–314. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-03437-9_30

    Chapter  Google Scholar 

  31. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  32. Rolnick, D., Veit, A., Belongie, S., Shavit, N.: Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694 (2017)

  33. Sadeghi, F., Levine, S.: CAD2RL: real single-image flight without a single real image. In: Proceedings of the RSS, pp. 34:1–34:10. RSS Foundation (2017). https://doi.org/10.15607/RSS.2017.XIII.034

  34. Siegel, N., Horvitz, Z., Levin, R., Divvala, S., Farhadi, A.: FigureSeer: parsing result-figures in research papers. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 664–680. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_41

    Chapter  Google Scholar 

  35. Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: Proceedings of the JCDL, pp. 223–232. ACM, New York (2018). https://doi.org/10.1145/3197026.3197040

  36. Sinha, A., et al.: An overview of Microsoft Academic Service (MAS) and applications. In: Proceedings of the WWW, pp. 243–246. ACM, New York (2015). https://doi.org/10.1145/2740908.2742839

  37. Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: a RGB-D scene understanding benchmark suite. In: Proceedings of the CVPR, pp. 567–576. IEEE, Los Alamitos (2015). https://doi.org/10.1109/CVPR.2015.7298655

  38. Stribling, J., Krohn, M., Aguayo, D.: SCIgen - an automatic CS paper generator (2005). Online tool: https://pdos.csail.mit.edu/archive/scigen/

  39. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: Proceedings of the IROS, pp. 23–30. IEEE, Piscataway (2017). https://doi.org/10.1109/IROS.2017.8202133

  40. Tremblay, J., et al.: Training deep networks with synthetic data: bridging the reality gap by domain randomization. In: Proceedings of the CVPRW, pp. 969–977. IEEE, Los Alamitos (2018). https://doi.org/10.1109/CVPRW.2018.00143

  41. Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Lee Giles, C.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: Proceedings of the CVPR, pp. 5315–5324. IEEE, Los Alamitos (2017). https://doi.org/10.1109/CVPR.2017.462

  42. Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: largest dataset ever for document layout analysis. In: Proceedings of the ICDAR, pp. 1015–1022. IEEE, Los Alamitos (2019). https://doi.org/10.1109/ICDAR.2019.00166

Download references

Acknowledgements

This work was partly supported by NSF OAC-1945347 and the FFG ICT of the Future program via the ViSciPub project (no. 867378).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Meng Ling .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ling, M. et al. (2021). Document Domain Randomization for Deep Learning Document Layout Extraction. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12821. Springer, Cham. https://doi.org/10.1007/978-3-030-86549-8_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86549-8_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86548-1

  • Online ISBN: 978-3-030-86549-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics