Skip to main content

Automated Structured Data Extraction from Scanned Document Images

  • Conference paper
  • First Online:
Data Management, Analytics and Innovation (ICDMAI 2022)

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 137))

Included in the following conference series:

  • 521 Accesses

Abstract

Digital technologies are now becoming part of all the sectors be it banking, automobile, infrastructure, and more. These technologies are empowered by “Data”. This is raising the need for the digitization of documents to fulfill the need for data for driving the digital transformation throughout sectors. Digitization requires the extraction of a huge amount of data from paper-based documents. Automating data extraction from paper-based documents can help in dealing with large volumes of data at a lower cost with lesser efforts. A solution is proposed which uses open-source components to automate the process of data extraction from scanned documents with minimal user input. The solution is capable of generating the structured output reflecting the document layout with the data in a document. The solution is capable of extracting data from tables and stamps present in documents in a well-structured format. The solution is driven by a configuration file, which can help in fine-tuning different processes to improve extracted data. The solution generates an XML for the scanned document which can be used further for storing and processing the data present in paper-based documents by different digital processes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. P. Kurhekar, S. Nigam, S. Pillai, Automated text and tabular data extraction from scanned document images, in Data Management, Analytics and Innovation. Lecture Notes on Data Engineering and Communications Technologies, ed. by N. Sharma, A. Chakrabarti, V.E. Balas, A.M. Bruckstein, Vol. 70 (Springer, Singapore, 2021). https://doi.org/10.1007/978-981-16-2934-1_11

  2. Y. Baek, B. Lee, D. Han, S. Yun, H. Lee, Character region awareness for text detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 9365–9374

    Google Scholar 

  3. D. Prasad, A. Gadpal, K. Kapadni, M. Visave, K. Sultanpure, CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents, in CoRR abs/2004.12629 (2020)

    Google Scholar 

  4. S. Paliwal, et al., TableNet: deep learning model for end-to-end table detection and tabular data extraction from scanned document images, in 2019 International Conference on Document Analysis and Recognition (ICDAR) (IEEE, 2019)

    Google Scholar 

  5. B. Majumder, N. Potti, S. Tata, J.B. Wendt, Q. Zhao, M. Najork, Representation learning for information extraction from form-like documents, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020) (2020), pp. 6495–6504

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shivani Nigam .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nigam, S. (2023). Automated Structured Data Extraction from Scanned Document Images. In: Goswami, S., Barara, I.S., Goje, A., Mohan, C., Bruckstein, A.M. (eds) Data Management, Analytics and Innovation. ICDMAI 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 137. Springer, Singapore. https://doi.org/10.1007/978-981-19-2600-6_4

Download citation

Publish with us

Policies and ethics