Skip to main content

DocILE Benchmark for Document Information Localization and Extraction

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2023 (ICDAR 2023)

Abstract

This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer; applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset, baselines and supplementary material are available at https://github.com/rossumai/docile.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We use the term semi-structured documents as [62, 69]; visual structure is strongly related to the document semantics, but the layout is variable.

  2. 2.

    Using hash of page images to capture duplicates differing only in PDF metadata.

  3. 3.

    Invoice-like documents are tax invoice, order, purchase order, receipt, sales order, proforma invoice, credit note, utility bill and debit note. We used a proprietary document-type classifier provided by Rossum.ai.

  4. 4.

    The document date was retrieved from the UCSF IDL metadata. Note that the majority of the documents in this source are from the 20th century.

  5. 5.

    We loosely define layout as the positioning of fields of each type in a document. We allow, e.g., different length of values, missing values, and resulting translations of whole sections.

  6. 6.

    For the test set, documents in both training and validation sets are considered as seen during training. Note that some test set layouts may be present in the validation set, not the training set.

  7. 7.

    https://rrc.cvc.uab.es/.

  8. 8.

    Such as generators of names, emails, addresses, bank account numbers, etc. Some utilize the Mimesis library [16]. Some content, such as keys, is copied from the annotated document.

  9. 9.

    Pre-processing consists of correcting page orientation, de-skewing scanned documents and normalizing them to 150 DPI.

  10. 10.

    Axis-aligned bounding boxes, optionally with additional snapping to reduce white space around word predictions, described in the Supplementary Material.

  11. 11.

    https://github.com/rossumai/docile.

  12. 12.

    Note that LayoutLMv3BASE [28] used two additional pre-training objectives, namely masked image modelling and word-patch alignment. Since pre-training code is not publicly available and some of the implementation details are missing, LayoutLMv3OURS used only masked language modelling.

  13. 13.

    https://huggingface.co/facebook/detr-resnet-50.

References

  1. Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: Docformer: end-to-end transformer for document understanding. In: ICCV (2021)

    Google Scholar 

  2. Baek, Y., et al.: Cleval: character-level evaluation for text detection and recognition tasks. In: CVPR workshops (2020)

    Google Scholar 

  3. Bensch, O., Popa, M., Spille, C.: Key information extraction from documents: evaluation and generator. In: Abbès, S.B., et al. (eds.) Proceedings of DeepOntoNLP and X-SENTIMENT (2021)

    Google Scholar 

  4. Biten, A.F., Tito, R., Gomez, L., Valveny, E., Karatzas, D.: OCR-IDL: OCR annotations for industry document library dataset. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) Computer Vision – ECCV 2022 Workshops. ECCV 2022. LNCS, vol. 13804, pp. 241–252. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-25069-9_16

  5. Borchmann, Ł., et al.: DUE: end-to-end document understanding benchmark. In: NeurIPS (2021)

    Google Scholar 

  6. Bušta, M., Patel, Y., Matas, J.: E2E-MLT - an unconstrained end-to-end method for multi-language scene text. In: ACCV workshops (2019)

    Google Scholar 

  7. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  8. Davis, B., Morse, B., Cohen, S., Price, B., Tensmeyer, C.: Deep visual template-free form parsing. In: ICDAR (2019)

    Google Scholar 

  9. Denk, T.I., Reisswig, C.: BERTgrid: contextualized embedding for 2d document representation and understanding. arXiv (2019)

    Google Scholar 

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv (2018)

    Google Scholar 

  11. Dhakal, P., Munikar, M., Dahal, B.: One-shot template matching for automatic document data capture. In: Artificial Intelligence for Transforming Business and Society (AITB) (2019)

    Google Scholar 

  12. Dosovitskiy, A., et al.: Flownet: learning optical flow with convolutional networks. In: ICCV (2015)

    Google Scholar 

  13. Du, Y., et al.: PP-OCR: a practical ultra lightweight OCR system. arXiv (2020)

    Google Scholar 

  14. Fang, J., Tao, X., Tang, Z., Qiu, R., Liu, Y.: Dataset, ground-truth and performance metrics for table detection evaluation. In: Blumenstein, M., Pal, U., Uchida, S. (eds.) DAS (2012)

    Google Scholar 

  15. Garncarek, Ł., et al.: Lambert: layout-aware language modeling for information extraction. In: ICDAR (2021)

    Google Scholar 

  16. Geimfari, L.: Mimesis: the fake data generator (2022). http://github.com/lk-geimfari/mimesis

  17. Gu, J., et al.: Unidoc: Unified pretraining framework for document understanding. In: NeurIPS (2021)

    Google Scholar 

  18. Gu, Z., et al.: XYLayoutLM: towards layout-aware multimodal networks for visually-rich document understanding. In: CVPR (2022)

    Google Scholar 

  19. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR (2016)

    Google Scholar 

  20. Hamad, K.A., Mehmet, K.: A detailed analysis of optical character recognition technology. Int. J. Appl. Math. Electron. Comput. 2016, 244–249 (2016)

    Article  Google Scholar 

  21. Hamdi, A., Carel, E., Joseph, A., Coustaty, M., Doucet, A.: Information extraction from invoices. In: ICDAR (2021)

    Google Scholar 

  22. Hammami, M., Héroux, P., Adam, S., d’Andecy, V.P.: One-shot field spotting on colored forms using subgraph isomorphism. In: ICDAR (2015)

    Google Scholar 

  23. Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: ICDAR (2015)

    Google Scholar 

  24. Herzig, J., Nowak, P.K., Müller, T., Piccinno, F., Eisenschlos, J.M.: Tapas: weakly supervised table parsing via pre-training. arXiv (2020)

    Google Scholar 

  25. Holeček, M., Hoskovec, A., Baudiš, P., Klinger, P.: Table understanding in structured documents. In: ICDAR Workshops (2019)

    Google Scholar 

  26. Holt, X., Chisholm, A.: Extracting structured data from invoices. In: Proceedings of the Australasian Language Technology Association Workshop 2018, pp. 53–59 (2018)

    Google Scholar 

  27. Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: a pre-trained language model focusing on text and layout for better key information extraction from documents. In: AAAI (2022)

    Google Scholar 

  28. Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: pre-training for document AI with unified text and image masking. In: ACM-MM (2022)

    Google Scholar 

  29. Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: ICDAR (2019)

    Google Scholar 

  30. Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial dependency parsing for semi-structured document information extraction. arXiv (2020)

    Google Scholar 

  31. Islam, N., Islam, Z., Noor, N.: A survey on optical character recognition system. arXiv (2017)

    Google Scholar 

  32. Jaume, G., Ekenel, H.K., Thiran, J.P.: FUNSD: a dataset for form understanding in noisy scanned documents. In: ICDAR (2019)

    Google Scholar 

  33. Katti, A.R., et al.: Chargrid: towards understanding 2d documents. In: EMNLP (2018)

    Google Scholar 

  34. Kil, J., Chao, W.L.: Revisiting document representations for large-scale zero-shot learning. arXiv (2021)

    Google Scholar 

  35. Krieger, F., Drews, P., Funk, B., Wobbe, T.: Information extraction from invoices: a graph neural network approach for datasets with high layout variety. In: Innovation Through Information Systems: Volume II: A Collection of Latest Research on Technology Issues (2021)

    Google Scholar 

  36. Lee, C.Y., et al.: FormNet: structural encoding beyond sequential modeling in form document information extraction. In: ACL (2022)

    Google Scholar 

  37. Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: SIGIR (2006)

    Google Scholar 

  38. Li, C., et al.: StructuralLM: structural pre-training for form understanding. In: ACL (2021)

    Google Scholar 

  39. Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34, 50–70 (2020)

    Article  Google Scholar 

  40. Li, Y., et al.: Structext: structured text understanding with multi-modal transformers. In: ACM-MM (2021)

    Google Scholar 

  41. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  42. Lin, W., et al.: VibertGrid: a jointly trained multi-modal 2d document representation for key information extraction from documents. In: ICDAR (2021)

    Google Scholar 

  43. Liu, W., Zhang, Y., Wan, B.: Unstructured document recognition on business invoice. Technical report (2016)

    Google Scholar 

  44. Liu, Y., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv (2019)

    Google Scholar 

  45. Lohani, D., Belaïd, A., Belaïd, Y.: An invoice reading system using a graph convolutional network. In: ACCV workshops (2018)

    Google Scholar 

  46. Majumder, B.P., Potti, N., Tata, S., Wendt, J.B., Zhao, Q., Najork, M.: Representation learning for information extraction from form-like documents. In: ACL (2020)

    Google Scholar 

  47. Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: InfographicVQA. In: WACV (2022)

    Google Scholar 

  48. Mathew, M., Karatzas, D., Jawahar, C.: DocVQA: a dataset for VQA on document images. In: WACV (2021)

    Google Scholar 

  49. Medvet, E., Bartoli, A., Davanzo, G.: A probabilistic approach to printed document understanding. In: ICDAR (2011)

    Google Scholar 

  50. Memon, J., Sami, M., Khan, R.A., Uddin, M.: Handwritten optical character recognition (OCR): a comprehensive systematic literature review (SLR). IEEE Access. 8, 142642–142668 (2020)

    Article  Google Scholar 

  51. Mindee: docTR: Document text recognition (2021). http://github.com/mindee/doctr

  52. Nassar, A., Livathinos, N., Lysak, M., Staar, P.W.J.: TableFormer: table structure understanding with transformers. arXiv (2022)

    Google Scholar 

  53. Nayef, N., et al.: ICDAR 2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In: ICDAR (2019)

    Google Scholar 

  54. Olejniczak, K., Šulc, M.: Text detection forgot about document OCR. In: CVWW (2023)

    Google Scholar 

  55. Palm, R.B., Laws, F., Winther, O.: Attend, copy, parse end-to-end information extraction from documents. In: ICDAR (2019)

    Google Scholar 

  56. Palm, R.B., Winther, O., Laws, F.: CloudScan - a configuration-free invoice analysis system using recurrent neural networks. In: ICDAR (2017)

    Google Scholar 

  57. Pampari, A., Ermon, S.: Unsupervised calibration under covariate shift. arXiv (2020)

    Google Scholar 

  58. Park, S., et al.: Cord: a consolidated receipt dataset for post-OCR parsing. In: NeurIPS Workshops (2019)

    Google Scholar 

  59. Powalski, R., Borchmann, Ł., Jurkiewicz, D., Dwojak, T., Pietruszka, M., Pałka, G.: Going full-tilt boogie on document understanding with text-image-layout transformer. In: ICDAR (2021)

    Google Scholar 

  60. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR. 21, 5485–5551 (2020)

    MathSciNet  MATH  Google Scholar 

  61. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)

    Google Scholar 

  62. Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detection in invoice documents by graph neural networks. In: ICDAR (2019)

    Google Scholar 

  63. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. IJCV. 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  64. Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: DeepDeSRT: deep learning for detection and structure recognition of tables in document images. In: ICDAR (2017)

    Google Scholar 

  65. Schuster, D., et al.: Intellix-end-user trained information extraction for document archiving. In: ICDAR (2013)

    Google Scholar 

  66. Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: Chen, J., Gonçalves, M.A., Allen, J.M., Fox, E.A., Kan, M., Petras, V. (eds.) Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL (2018)

    Google Scholar 

  67. Šimsa, Š., Šulc, M., Skalickỳ, M., Patel, Y., Hamdi, A.: Docile 2023 teaser: document information localization and extraction. In: ECIR (2023)

    Google Scholar 

  68. Šipka, T., Šulc, M., Matas, J.: The hitchhiker’s guide to prior-shift adaptation. In: WACV (2022)

    Google Scholar 

  69. Skalický, M., Šimsa, Š., Uřičář, M., Šulc, M.: Business document information extraction: Towards practical benchmarks. In: CLEF (2022)

    Google Scholar 

  70. Smith, R.: An overview of the tesseract OCR engine. In: ICDAR (2007)

    Google Scholar 

  71. Smock, B., Pesala, R., Abraham, R.: PubTables-1M: towards comprehensive table extraction from unstructured documents. In: CVPR (2022)

    Google Scholar 

  72. Stanisławek, T., et al.: Kleister: key information extraction datasets involving long documents with complex layouts. In: ICDAR (2021)

    Google Scholar 

  73. Stray, J., Svetlichnaya, S.: DeepForm: extract information from documents (2020). http://wandb.ai/deepform/political-ad-extraction, benchmark

  74. Sun, H., Kuang, Z., Yue, X., Lin, C., Zhang, W.: Spatial dual-modality graph reasoning for key information extraction. arXiv (2021)

    Google Scholar 

  75. Sunder, V., Srinivasan, A., Vig, L., Shroff, G., Rahul, R.: One-shot information extraction from document images using neuro-deductive program synthesis. arXiv (2019)

    Google Scholar 

  76. Tanaka, R., Nishida, K., Yoshida, S.: VisualMRC: machine reading comprehension on document images. In: AAAI (2021)

    Google Scholar 

  77. Tang, Z., et al.: Unifying vision, text, and layout for universal document processing. arXiv (2022)

    Google Scholar 

  78. Tensmeyer, C., Morariu, V.I., Price, B., Cohen, S., Martinez, T.: Deep splitting and merging for table structure decomposition. In: ICDAR (2019)

    Google Scholar 

  79. Wang, J., et al.: Towards robust visual information extraction in real world: new dataset and novel solution. In: AAAI (2021)

    Google Scholar 

  80. Web: Industry Documents Library. www.industrydocuments.ucsf.edu/. Accessed 20 Oct 2022

  81. Web: Industry Documents Library API. www.industrydocuments.ucsf.edu/research-tools/api/. Accessed 20 Oct 2022

  82. Web: Public Inspection Files. http://publicfiles.fcc.gov/. Accessed 20 Oct 2022

  83. Xu, Y., et al.: Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In: ACL (2021)

    Google Scholar 

  84. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: KDD (2020)

    Google Scholar 

  85. Xu, Y., et al.: LayoutXLM: multimodal pre-training for multilingual visually-rich document understanding. arXiv (2021)

    Google Scholar 

  86. Zhang, Z., Ma, J., Du, J., Wang, L., Zhang, J.: Multimodal pre-training based on graph attention network for document understanding. IEEE Trans. Multimed. (2022)

    Google Scholar 

  87. Zhao, X., Wu, Z., Wang, X.: CUTIE: learning to understand documents with convolutional universal text information extractor. arXiv (2019)

    Google Scholar 

  88. Zheng, X., Burdick, D., Popa, L., Zhong, X., Wang, N.X.R.: Global table extractor (GTE): a framework for joint table identification and cell structure recognition using visual context. In: WACV (2021)

    Google Scholar 

  89. Zhong, X., Tang, J., Jimeno-Yepes, A.: PublayNet: largest dataset ever for document layout analysis. In: ICDAR (2019)

    Google Scholar 

  90. Zhou, J., Yu, H., Xie, C., Cai, H., Jiang, L.: IRMP: from printed forms to relational data model. In: HPCC (2016)

    Google Scholar 

  91. Zhu, Y., et al.: Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: ICCV (2015)

    Google Scholar 

Download references

Acknowledgements

We acknowledge the funding and support from Rossum and the intensive work of its annotation team, particularly Petra Hrdličková and Kateřina Večerková. YP and JM were supported by Research Center for Informatics (project CZ.02.1.01/0.0/0.0/16_019/0000765 funded by OP VVV), by the Grant Agency of the Czech Technical University in Prague, grant No. SGS20/171/OHK3 /3T/13, by Project StratDL in the realm of COMET K1 center Software Competence Center Hagenberg, and Amazon Research Award. DK was supported by grant PID2020-116298GB-I00 funded by MCIN/AE/NextGenerationEU and ELSA (GA 101070617) funded by EU.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Štěpán Šimsa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Šimsa, Š. et al. (2023). DocILE Benchmark for Document Information Localization and Extraction. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14188. Springer, Cham. https://doi.org/10.1007/978-3-031-41679-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41679-8_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41678-1

  • Online ISBN: 978-3-031-41679-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics