Document Domain Randomization for Deep Learning Document Layout Extraction

Ling, Meng; Chen, Jian; Möller, Torsten; Isenberg, Petra; Isenberg, Tobias; Sedlmair, Michael; Laramee, Robert S.; Shen, Han-Wei; Wu, Jian; Giles, C. Lee

doi:10.1007/978-3-030-86549-8_32

Meng Ling¹¹,
Jian Chen¹¹,
Torsten Möller¹²,
Petra Isenberg¹³,
Tobias Isenberg¹³,
Michael Sedlmair¹⁴,
Robert S. Laramee¹⁵,
Han-Wei Shen¹¹,
Jian Wu¹⁶ &
…
C. Lee Giles¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12821))

Included in the following conference series:

International Conference on Document Analysis and Recognition

3940 Accesses
1 Citations
1 Altmetric

Abstract

We present document domain randomization (DDR), the first successful transfer of CNNs trained only on graphically rendered pseudo-paper pages to real-world document segmentation. DDR renders pseudo-document pages by modeling randomized textual and non-textual contents of interest, with user-defined layout and font styles to support joint learning of fine-grained classes. We demonstrate competitive results using our DDR approach to extract nine document classes from the benchmark CS-150 and papers published in two domains, namely annual meetings of Association for Computational Linguistics (ACL) and IEEE Visualization (VIS). We compare DDR to conditions of style mismatch, fewer or more noisy samples that are more easily obtained in the real world. We show that high-fidelity semantic information is not necessary to label semantic classes but style mismatch between train and test can lower model accuracy. Using smaller training samples had a slightly detrimental effect. Finally, network models still achieved high test accuracy when correct labels are diluted towards confusing labels; this behavior hold across several classes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Github: Tensorpack Faster R-CNN (February 2021). https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN
Arif, S., Shafait, F.: Table detection in document images using foreground and background features. In: Proceedings of the DICTA, pp. 245–252. IEEE, Piscataway (2018). https://doi.org/10.1109/DICTA.2018.8615795
Battle, L., Duan, P., Miranda, Z., Mukusheva, D., Chang, R., Stonebraker, M.: Beagle: automated extraction and interpretation of visualizations from the web. In: Proceedings of the CHI, pp. 594:1–594:8. ACM, New York (2018). https://doi.org/10.1145/3173574.3174168
Borkin, M.A., et al.: What makes a visualization memorable? IEEE Trans. Vis. Comput. Graph. 19(12), 2306–2315 (2013). https://doi.org/10.1109/TVCG.2013.234
Article Google Scholar
Caragea, C., et al.: CiteSeer^x: a scholarly big dataset. In: de Rijke, M., et al. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 311–322. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_26
Chatzimparmpas, A., Jusufi, I.: The state of the art in enhancing trust in machine learning models with the use of visualizations. Comput. Graph. Forum 39(3), 713–756 (2020). https://doi.org/10.1111/cgf.14034
Article Google Scholar
Chen, J., et al.: IEEE VIS figures and tables image dataset. IEEE Dataport (2020). https://doi.org/10.21227/4hy6-vh52. https://visimagenavigator.github.io/
Chen, J., et al.: VIS30K: a collection of figures and tables from IEEE visualization conference publications. IEEE Trans. Vis. Comput. Graph. 27, 3826–3833 (2021). https://doi.org/10.1109/TVCG.2021.3054916
Article Google Scholar
Choudhury, S.R., Mitra, P., Giles, C.L.: Automatic extraction of figures from scholarly documents. In: Proceedings of the DocEng, pp. 47–50. ACM, New York (2015). https://doi.org/10.1145/2682571.2797085
Clark, C., Divvala, S.: Looking beyond text: Extracting figures, tables and captions from computer science papers. In: Workshops at the 29th AAAI Conference on Artificial Intelligence (2015). https://aaai.org/ocs/index.php/WS/AAAIW15/paper/view/10092
Clark, C., Divvala, S.: PDFFigures 2.0: mining figures from research papers. In: Proceedings of the JCDL, pp. 143–152. ACM, New York (2016). https://doi.org/10.1145/2910896.2910904
Davila, K., Setlur, S., Doermann, D., Bhargava, U.K., Govindaraju, V.: Chart mining: a survey of methods for automated chart analysis. IEEE Trans. Pattern Anal. Mach. Intell. 43 (2021, to appear). https://doi.org/10.1109/TPAMI.2020.2992028
Dong, X., et al.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of the KDD, pp. 601–610. ACM, New York (2014). https://doi.org/10.1145/2623330.2623623
Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: Proceedings of the ICCV, pp. 2758–2766. IEEE, Los Alamitos (2015). https://doi.org/10.1109/ICCV.2015.316
Funke, C.M., Borowski, J., Stosio, K., Brendel, W., Wallis, T.S., Bethge, M.: Five points to check when comparing visual perception in humans and machines. J. Vis. 21(3), 1–23 (2021). https://doi.org/10.1167/jov.21.3.16
Article Google Scholar
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., Brendel, W.: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness (2018). https://arxiv.org/abs/1811.12231
Giles, C.L., Bollacker, K.D., Lawrence, S.: CiteSeer: an automatic citation indexing system. In: Proceedings of the DL, pp. 89–98. ACM, New York (1998). https://doi.org/10.1145/276675.276685
He, D., Cohen, S., Price, B., Kifer, D., Giles, C.L.: Multi-scale multi-task FCN for semantic page segmentation and table detection. In: Proceedings of the ICDAR, pp. 254–261. IEEE, Los Alamitos (2017). https://doi.org/10.1109/ICDAR.2017.50
James, S., Johns, E.: 3D simulation for robot arm control with deep Q-learning (2016). https://arxiv.org/abs/1609.03759
Katona, G.: Component Extraction from Scientific Publications using Convolutional Neural Networks. Master’s thesis, Computer Science Department, University of Vienna, Austria (2019)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Article MathSciNet Google Scholar
Li, M., et al.: DocBank: a benchmark dataset for document layout analysis. In: Proceedings of the COLING, pp. 949–960. ICCL, Praha, Czech Republic (2020). https://doi.org/10.18653/v1/2020.coling-main.82
Li, R., Chen, J.: Toward a deep understanding of what makes a scientific visualization memorable. In: Proceedings of the SciVis, pp. 26–31. IEEE, Los Alamitos (2018). https://doi.org/10.1109/SciVis.2018.8823764
Ling, M., Chen, J.: DeepPaperComposer: a simple solution for training data preparation for parsing research papers. In: Proceedings of the EMNLP/Scholarly Document Processing, pp. 91–96. ACL, Stroudsburg (2020). https://doi.org/10.18653/v1/2020.sdp-1.10
Ling, M., et al.: Three benchmark datasets for scholarly article layout analysis. IEEE Dataport (2020). https://doi.org/10.21227/326q-bf39
Lo, K., Wang, L.L., Neumann, M., Kinney, R., Weld, D.S.: S2ORC: the semantic scholar open research corpus. In: Proceedings of the ACL, pp. 4969–4983. ACL, Stroudsburg (2020). https://doi.org/10.18653/v1/2020.acl-main.447
Lopez, P.: GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 473–474. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04346-8_62
Chapter Google Scholar
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the CVPR, pp. 4040–4048. IEEE, Los Alamitos (2016). https://doi.org/10.1109/CVPR.2016.438
Poppler: Poppler. Dataset and online search (2014). https://poppler.freedesktop.org/
Praczyk, P., Nogueras-Iso, J.: A semantic approach for the annotation of figures: application to high-energy physics. In: Garoufallou, E., Greenberg, J. (eds.) MTSR 2013. CCIS, vol. 390, pp. 302–314. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-03437-9_30
Chapter Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
Rolnick, D., Veit, A., Belongie, S., Shavit, N.: Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694 (2017)
Sadeghi, F., Levine, S.: CAD²RL: real single-image flight without a single real image. In: Proceedings of the RSS, pp. 34:1–34:10. RSS Foundation (2017). https://doi.org/10.15607/RSS.2017.XIII.034
Siegel, N., Horvitz, Z., Levin, R., Divvala, S., Farhadi, A.: FigureSeer: parsing result-figures in research papers. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 664–680. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_41
Chapter Google Scholar
Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: Proceedings of the JCDL, pp. 223–232. ACM, New York (2018). https://doi.org/10.1145/3197026.3197040
Sinha, A., et al.: An overview of Microsoft Academic Service (MAS) and applications. In: Proceedings of the WWW, pp. 243–246. ACM, New York (2015). https://doi.org/10.1145/2740908.2742839
Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: a RGB-D scene understanding benchmark suite. In: Proceedings of the CVPR, pp. 567–576. IEEE, Los Alamitos (2015). https://doi.org/10.1109/CVPR.2015.7298655
Stribling, J., Krohn, M., Aguayo, D.: SCIgen - an automatic CS paper generator (2005). Online tool: https://pdos.csail.mit.edu/archive/scigen/
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: Proceedings of the IROS, pp. 23–30. IEEE, Piscataway (2017). https://doi.org/10.1109/IROS.2017.8202133
Tremblay, J., et al.: Training deep networks with synthetic data: bridging the reality gap by domain randomization. In: Proceedings of the CVPRW, pp. 969–977. IEEE, Los Alamitos (2018). https://doi.org/10.1109/CVPRW.2018.00143
Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Lee Giles, C.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: Proceedings of the CVPR, pp. 5315–5324. IEEE, Los Alamitos (2017). https://doi.org/10.1109/CVPR.2017.462
Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: largest dataset ever for document layout analysis. In: Proceedings of the ICDAR, pp. 1015–1022. IEEE, Los Alamitos (2019). https://doi.org/10.1109/ICDAR.2019.00166

Download references

Acknowledgements

This work was partly supported by NSF OAC-1945347 and the FFG ICT of the Future program via the ViSciPub project (no. 867378).

Author information

Authors and Affiliations

The Ohio State University, Columbus, USA
Meng Ling, Jian Chen & Han-Wei Shen
University of Vienna, Wien, Austria
Torsten Möller
Université Paris-Saclay, CNRS, Inria, LISN, Gif-sur-Yvette, France
Petra Isenberg & Tobias Isenberg
University of Stuttgart, Stuttgart, Germany
Michael Sedlmair
University of Nottingham, Nottingham, UK
Robert S. Laramee
Old Dominion University, Norfolk, USA
Jian Wu
The Pennsylvania State University, State College, USA
C. Lee Giles

Authors

Meng Ling
View author publications
You can also search for this author in PubMed Google Scholar
Jian Chen
View author publications
You can also search for this author in PubMed Google Scholar
Torsten Möller
View author publications
You can also search for this author in PubMed Google Scholar
Petra Isenberg
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Isenberg
View author publications
You can also search for this author in PubMed Google Scholar
Michael Sedlmair
View author publications
You can also search for this author in PubMed Google Scholar
Robert S. Laramee
View author publications
You can also search for this author in PubMed Google Scholar
Han-Wei Shen
View author publications
You can also search for this author in PubMed Google Scholar
Jian Wu
View author publications
You can also search for this author in PubMed Google Scholar
C. Lee Giles
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Meng Ling .

Editor information

Editors and Affiliations

Universitat Autònoma de Barcelona, Barcelona, Spain
Josep Lladós
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti
Kyushu University, Fukuoka-shi, Japan
Seiichi Uchida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ling, M. et al. (2021). Document Domain Randomization for Deep Learning Document Layout Extraction. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12821. Springer, Cham. https://doi.org/10.1007/978-3-030-86549-8_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-86549-8_32
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86548-1
Online ISBN: 978-3-030-86549-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)