DocILE Benchmark for Document Information Localization and Extraction

Šimsa, Štěpán; Šulc, Milan; Uřičář, Michal; Patel, Yash; Hamdi, Ahmed; Kocián, Matěj; Skalický, Matyáš; Matas, Jiří; Doucet, Antoine; Coustaty, Mickaël; Karatzas, Dimosthenis

doi:10.1007/978-3-031-41679-8_9

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14188))

Included in the following conference series:

International Conference on Document Analysis and Recognition

1049 Accesses
2 Citations

Abstract

This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer; applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset, baselines and supplementary material are available at https://github.com/rossumai/docile.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We use the term semi-structured documents as [62, 69]; visual structure is strongly related to the document semantics, but the layout is variable.
2.
Using hash of page images to capture duplicates differing only in PDF metadata.
3.
Invoice-like documents are tax invoice, order, purchase order, receipt, sales order, proforma invoice, credit note, utility bill and debit note. We used a proprietary document-type classifier provided by Rossum.ai.
4.
The document date was retrieved from the UCSF IDL metadata. Note that the majority of the documents in this source are from the 20th century.
5.
We loosely define layout as the positioning of fields of each type in a document. We allow, e.g., different length of values, missing values, and resulting translations of whole sections.
6.
For the test set, documents in both training and validation sets are considered as seen during training. Note that some test set layouts may be present in the validation set, not the training set.
7.
https://rrc.cvc.uab.es/.
8.
Such as generators of names, emails, addresses, bank account numbers, etc. Some utilize the Mimesis library [16]. Some content, such as keys, is copied from the annotated document.
9.
Pre-processing consists of correcting page orientation, de-skewing scanned documents and normalizing them to 150 DPI.
10.
Axis-aligned bounding boxes, optionally with additional snapping to reduce white space around word predictions, described in the Supplementary Material.
11.
https://github.com/rossumai/docile.
12.
Note that LayoutLMv3_BASE [28] used two additional pre-training objectives, namely masked image modelling and word-patch alignment. Since pre-training code is not publicly available and some of the implementation details are missing, LayoutLMv3_OURS used only masked language modelling.
13.
https://huggingface.co/facebook/detr-resnet-50.

References

Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: Docformer: end-to-end transformer for document understanding. In: ICCV (2021)
Google Scholar
Baek, Y., et al.: Cleval: character-level evaluation for text detection and recognition tasks. In: CVPR workshops (2020)
Google Scholar
Bensch, O., Popa, M., Spille, C.: Key information extraction from documents: evaluation and generator. In: Abbès, S.B., et al. (eds.) Proceedings of DeepOntoNLP and X-SENTIMENT (2021)
Google Scholar
Biten, A.F., Tito, R., Gomez, L., Valveny, E., Karatzas, D.: OCR-IDL: OCR annotations for industry document library dataset. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) Computer Vision – ECCV 2022 Workshops. ECCV 2022. LNCS, vol. 13804, pp. 241–252. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-25069-9_16
Borchmann, Ł., et al.: DUE: end-to-end document understanding benchmark. In: NeurIPS (2021)
Google Scholar
Bušta, M., Patel, Y., Matas, J.: E2E-MLT - an unconstrained end-to-end method for multi-language scene text. In: ACCV workshops (2019)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Davis, B., Morse, B., Cohen, S., Price, B., Tensmeyer, C.: Deep visual template-free form parsing. In: ICDAR (2019)
Google Scholar
Denk, T.I., Reisswig, C.: BERTgrid: contextualized embedding for 2d document representation and understanding. arXiv (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv (2018)
Google Scholar
Dhakal, P., Munikar, M., Dahal, B.: One-shot template matching for automatic document data capture. In: Artificial Intelligence for Transforming Business and Society (AITB) (2019)
Google Scholar
Dosovitskiy, A., et al.: Flownet: learning optical flow with convolutional networks. In: ICCV (2015)
Google Scholar
Du, Y., et al.: PP-OCR: a practical ultra lightweight OCR system. arXiv (2020)
Google Scholar
Fang, J., Tao, X., Tang, Z., Qiu, R., Liu, Y.: Dataset, ground-truth and performance metrics for table detection evaluation. In: Blumenstein, M., Pal, U., Uchida, S. (eds.) DAS (2012)
Google Scholar
Garncarek, Ł., et al.: Lambert: layout-aware language modeling for information extraction. In: ICDAR (2021)
Google Scholar
Geimfari, L.: Mimesis: the fake data generator (2022). http://github.com/lk-geimfari/mimesis
Gu, J., et al.: Unidoc: Unified pretraining framework for document understanding. In: NeurIPS (2021)
Google Scholar
Gu, Z., et al.: XYLayoutLM: towards layout-aware multimodal networks for visually-rich document understanding. In: CVPR (2022)
Google Scholar
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR (2016)
Google Scholar
Hamad, K.A., Mehmet, K.: A detailed analysis of optical character recognition technology. Int. J. Appl. Math. Electron. Comput. 2016, 244–249 (2016)
Article Google Scholar
Hamdi, A., Carel, E., Joseph, A., Coustaty, M., Doucet, A.: Information extraction from invoices. In: ICDAR (2021)
Google Scholar
Hammami, M., Héroux, P., Adam, S., d’Andecy, V.P.: One-shot field spotting on colored forms using subgraph isomorphism. In: ICDAR (2015)
Google Scholar
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: ICDAR (2015)
Google Scholar
Herzig, J., Nowak, P.K., Müller, T., Piccinno, F., Eisenschlos, J.M.: Tapas: weakly supervised table parsing via pre-training. arXiv (2020)
Google Scholar
Holeček, M., Hoskovec, A., Baudiš, P., Klinger, P.: Table understanding in structured documents. In: ICDAR Workshops (2019)
Google Scholar
Holt, X., Chisholm, A.: Extracting structured data from invoices. In: Proceedings of the Australasian Language Technology Association Workshop 2018, pp. 53–59 (2018)
Google Scholar
Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: a pre-trained language model focusing on text and layout for better key information extraction from documents. In: AAAI (2022)
Google Scholar
Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: pre-training for document AI with unified text and image masking. In: ACM-MM (2022)
Google Scholar
Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: ICDAR (2019)
Google Scholar
Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial dependency parsing for semi-structured document information extraction. arXiv (2020)
Google Scholar
Islam, N., Islam, Z., Noor, N.: A survey on optical character recognition system. arXiv (2017)
Google Scholar
Jaume, G., Ekenel, H.K., Thiran, J.P.: FUNSD: a dataset for form understanding in noisy scanned documents. In: ICDAR (2019)
Google Scholar
Katti, A.R., et al.: Chargrid: towards understanding 2d documents. In: EMNLP (2018)
Google Scholar
Kil, J., Chao, W.L.: Revisiting document representations for large-scale zero-shot learning. arXiv (2021)
Google Scholar
Krieger, F., Drews, P., Funk, B., Wobbe, T.: Information extraction from invoices: a graph neural network approach for datasets with high layout variety. In: Innovation Through Information Systems: Volume II: A Collection of Latest Research on Technology Issues (2021)
Google Scholar
Lee, C.Y., et al.: FormNet: structural encoding beyond sequential modeling in form document information extraction. In: ACL (2022)
Google Scholar
Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: SIGIR (2006)
Google Scholar
Li, C., et al.: StructuralLM: structural pre-training for form understanding. In: ACL (2021)
Google Scholar
Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34, 50–70 (2020)
Article Google Scholar
Li, Y., et al.: Structext: structured text understanding with multi-modal transformers. In: ACM-MM (2021)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lin, W., et al.: VibertGrid: a jointly trained multi-modal 2d document representation for key information extraction from documents. In: ICDAR (2021)
Google Scholar
Liu, W., Zhang, Y., Wan, B.: Unstructured document recognition on business invoice. Technical report (2016)
Google Scholar
Liu, Y., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv (2019)
Google Scholar
Lohani, D., Belaïd, A., Belaïd, Y.: An invoice reading system using a graph convolutional network. In: ACCV workshops (2018)
Google Scholar
Majumder, B.P., Potti, N., Tata, S., Wendt, J.B., Zhao, Q., Najork, M.: Representation learning for information extraction from form-like documents. In: ACL (2020)
Google Scholar
Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: InfographicVQA. In: WACV (2022)
Google Scholar
Mathew, M., Karatzas, D., Jawahar, C.: DocVQA: a dataset for VQA on document images. In: WACV (2021)
Google Scholar
Medvet, E., Bartoli, A., Davanzo, G.: A probabilistic approach to printed document understanding. In: ICDAR (2011)
Google Scholar
Memon, J., Sami, M., Khan, R.A., Uddin, M.: Handwritten optical character recognition (OCR): a comprehensive systematic literature review (SLR). IEEE Access. 8, 142642–142668 (2020)
Article Google Scholar
Mindee: docTR: Document text recognition (2021). http://github.com/mindee/doctr
Nassar, A., Livathinos, N., Lysak, M., Staar, P.W.J.: TableFormer: table structure understanding with transformers. arXiv (2022)
Google Scholar
Nayef, N., et al.: ICDAR 2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In: ICDAR (2019)
Google Scholar
Olejniczak, K., Šulc, M.: Text detection forgot about document OCR. In: CVWW (2023)
Google Scholar
Palm, R.B., Laws, F., Winther, O.: Attend, copy, parse end-to-end information extraction from documents. In: ICDAR (2019)
Google Scholar
Palm, R.B., Winther, O., Laws, F.: CloudScan - a configuration-free invoice analysis system using recurrent neural networks. In: ICDAR (2017)
Google Scholar
Pampari, A., Ermon, S.: Unsupervised calibration under covariate shift. arXiv (2020)
Google Scholar
Park, S., et al.: Cord: a consolidated receipt dataset for post-OCR parsing. In: NeurIPS Workshops (2019)
Google Scholar
Powalski, R., Borchmann, Ł., Jurkiewicz, D., Dwojak, T., Pietruszka, M., Pałka, G.: Going full-tilt boogie on document understanding with text-image-layout transformer. In: ICDAR (2021)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR. 21, 5485–5551 (2020)
MathSciNet MATH Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Google Scholar
Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detection in invoice documents by graph neural networks. In: ICDAR (2019)
Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. IJCV. 115, 211–252 (2015)
Article MathSciNet Google Scholar
Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: DeepDeSRT: deep learning for detection and structure recognition of tables in document images. In: ICDAR (2017)
Google Scholar
Schuster, D., et al.: Intellix-end-user trained information extraction for document archiving. In: ICDAR (2013)
Google Scholar
Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: Chen, J., Gonçalves, M.A., Allen, J.M., Fox, E.A., Kan, M., Petras, V. (eds.) Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL (2018)
Google Scholar
Šimsa, Š., Šulc, M., Skalickỳ, M., Patel, Y., Hamdi, A.: Docile 2023 teaser: document information localization and extraction. In: ECIR (2023)
Google Scholar
Šipka, T., Šulc, M., Matas, J.: The hitchhiker’s guide to prior-shift adaptation. In: WACV (2022)
Google Scholar
Skalický, M., Šimsa, Š., Uřičář, M., Šulc, M.: Business document information extraction: Towards practical benchmarks. In: CLEF (2022)
Google Scholar
Smith, R.: An overview of the tesseract OCR engine. In: ICDAR (2007)
Google Scholar
Smock, B., Pesala, R., Abraham, R.: PubTables-1M: towards comprehensive table extraction from unstructured documents. In: CVPR (2022)
Google Scholar
Stanisławek, T., et al.: Kleister: key information extraction datasets involving long documents with complex layouts. In: ICDAR (2021)
Google Scholar
Stray, J., Svetlichnaya, S.: DeepForm: extract information from documents (2020). http://wandb.ai/deepform/political-ad-extraction, benchmark
Sun, H., Kuang, Z., Yue, X., Lin, C., Zhang, W.: Spatial dual-modality graph reasoning for key information extraction. arXiv (2021)
Google Scholar
Sunder, V., Srinivasan, A., Vig, L., Shroff, G., Rahul, R.: One-shot information extraction from document images using neuro-deductive program synthesis. arXiv (2019)
Google Scholar
Tanaka, R., Nishida, K., Yoshida, S.: VisualMRC: machine reading comprehension on document images. In: AAAI (2021)
Google Scholar
Tang, Z., et al.: Unifying vision, text, and layout for universal document processing. arXiv (2022)
Google Scholar
Tensmeyer, C., Morariu, V.I., Price, B., Cohen, S., Martinez, T.: Deep splitting and merging for table structure decomposition. In: ICDAR (2019)
Google Scholar
Wang, J., et al.: Towards robust visual information extraction in real world: new dataset and novel solution. In: AAAI (2021)
Google Scholar
Web: Industry Documents Library. www.industrydocuments.ucsf.edu/. Accessed 20 Oct 2022
Web: Industry Documents Library API. www.industrydocuments.ucsf.edu/research-tools/api/. Accessed 20 Oct 2022
Web: Public Inspection Files. http://publicfiles.fcc.gov/. Accessed 20 Oct 2022
Xu, Y., et al.: Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In: ACL (2021)
Google Scholar
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: KDD (2020)
Google Scholar
Xu, Y., et al.: LayoutXLM: multimodal pre-training for multilingual visually-rich document understanding. arXiv (2021)
Google Scholar
Zhang, Z., Ma, J., Du, J., Wang, L., Zhang, J.: Multimodal pre-training based on graph attention network for document understanding. IEEE Trans. Multimed. (2022)
Google Scholar
Zhao, X., Wu, Z., Wang, X.: CUTIE: learning to understand documents with convolutional universal text information extractor. arXiv (2019)
Google Scholar
Zheng, X., Burdick, D., Popa, L., Zhong, X., Wang, N.X.R.: Global table extractor (GTE): a framework for joint table identification and cell structure recognition using visual context. In: WACV (2021)
Google Scholar
Zhong, X., Tang, J., Jimeno-Yepes, A.: PublayNet: largest dataset ever for document layout analysis. In: ICDAR (2019)
Google Scholar
Zhou, J., Yu, H., Xie, C., Cai, H., Jiang, L.: IRMP: from printed forms to relational data model. In: HPCC (2016)
Google Scholar
Zhu, Y., et al.: Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: ICCV (2015)
Google Scholar

Download references

Acknowledgements

We acknowledge the funding and support from Rossum and the intensive work of its annotation team, particularly Petra Hrdličková and Kateřina Večerková. YP and JM were supported by Research Center for Informatics (project CZ.02.1.01/0.0/0.0/16_019/0000765 funded by OP VVV), by the Grant Agency of the Czech Technical University in Prague, grant No. SGS20/171/OHK3 /3T/13, by Project StratDL in the realm of COMET K1 center Software Competence Center Hagenberg, and Amazon Research Award. DK was supported by grant PID2020-116298GB-I00 funded by MCIN/AE/NextGenerationEU and ELSA (GA 101070617) funded by EU.

Author information

Authors and Affiliations

Rossum, Prague, Czech Republic
Štěpán Šimsa, Milan Šulc, Michal Uřičář, Matěj Kocián & Matyáš Skalický
Visual Recognition Group, Czech Technical University in Prague, Prague, Czechia
Yash Patel & Jiří Matas
University of La Rochelle, La Rochelle, France
Ahmed Hamdi, Antoine Doucet & Mickaël Coustaty
Computer Vision Center, Universitat Autónoma de Barcelona, Barcelona, Spain
Dimosthenis Karatzas

Authors

Štěpán Šimsa
View author publications
You can also search for this author in PubMed Google Scholar
Milan Šulc
View author publications
You can also search for this author in PubMed Google Scholar
Michal Uřičář
View author publications
You can also search for this author in PubMed Google Scholar
Yash Patel
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Hamdi
View author publications
You can also search for this author in PubMed Google Scholar
Matěj Kocián
View author publications
You can also search for this author in PubMed Google Scholar
Matyáš Skalický
View author publications
You can also search for this author in PubMed Google Scholar
Jiří Matas
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Doucet
View author publications
You can also search for this author in PubMed Google Scholar
Mickaël Coustaty
View author publications
You can also search for this author in PubMed Google Scholar
Dimosthenis Karatzas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Štěpán Šimsa .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Šimsa, Š. et al. (2023). DocILE Benchmark for Document Information Localization and Extraction. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14188. Springer, Cham. https://doi.org/10.1007/978-3-031-41679-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-41679-8_9
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41678-1
Online ISBN: 978-3-031-41679-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

DocILE Benchmark for Document Information Localization and Extraction