Automated Structured Data Extraction from Scanned Document Images

Nigam, Shivani

doi:10.1007/978-981-19-2600-6_4

Shivani Nigam⁷

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 137))

Included in the following conference series:

International Conference on Data Management, Analytics & Innovation

521 Accesses

Abstract

Digital technologies are now becoming part of all the sectors be it banking, automobile, infrastructure, and more. These technologies are empowered by “Data”. This is raising the need for the digitization of documents to fulfill the need for data for driving the digital transformation throughout sectors. Digitization requires the extraction of a huge amount of data from paper-based documents. Automating data extraction from paper-based documents can help in dealing with large volumes of data at a lower cost with lesser efforts. A solution is proposed which uses open-source components to automate the process of data extraction from scanned documents with minimal user input. The solution is capable of generating the structured output reflecting the document layout with the data in a document. The solution is capable of extracting data from tables and stamps present in documents in a well-structured format. The solution is driven by a configuration file, which can help in fine-tuning different processes to improve extracted data. The solution generates an XML for the scanned document which can be used further for storing and processing the data present in paper-based documents by different digital processes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

P. Kurhekar, S. Nigam, S. Pillai, Automated text and tabular data extraction from scanned document images, in Data Management, Analytics and Innovation. Lecture Notes on Data Engineering and Communications Technologies, ed. by N. Sharma, A. Chakrabarti, V.E. Balas, A.M. Bruckstein, Vol. 70 (Springer, Singapore, 2021). https://doi.org/10.1007/978-981-16-2934-1_11
Y. Baek, B. Lee, D. Han, S. Yun, H. Lee, Character region awareness for text detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 9365–9374
Google Scholar
D. Prasad, A. Gadpal, K. Kapadni, M. Visave, K. Sultanpure, CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents, in CoRR abs/2004.12629 (2020)
Google Scholar
S. Paliwal, et al., TableNet: deep learning model for end-to-end table detection and tabular data extraction from scanned document images, in 2019 International Conference on Document Analysis and Recognition (ICDAR) (IEEE, 2019)
Google Scholar
B. Majumder, N. Potti, S. Tata, J.B. Wendt, Q. Zhao, M. Najork, Representation learning for information extraction from form-like documents, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020) (2020), pp. 6495–6504
Google Scholar

Download references

Author information

Authors and Affiliations

Tata Consultancy Services (TCS), Mumbai, India
Shivani Nigam

Authors

Shivani Nigam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shivani Nigam .

Editor information

Editors and Affiliations

Bangabasi Morning College, Kolkata, West Bengal, India
Saptarsi Goswami
Vara Technology, Saket, Delhi, India
Inderjit Singh Barara
Society for Data Science, Pune, Maharashtra, India
Amol Goje
National University of Singapore, Singapore, Singapore
C. Mohan
Department of Computer Science, Technion—Israel Institute of Technology, Haifa, Israel
Alfred M. Bruckstein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nigam, S. (2023). Automated Structured Data Extraction from Scanned Document Images. In: Goswami, S., Barara, I.S., Goje, A., Mohan, C., Bruckstein, A.M. (eds) Data Management, Analytics and Innovation. ICDMAI 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 137. Springer, Singapore. https://doi.org/10.1007/978-981-19-2600-6_4

Download citation

DOI: https://doi.org/10.1007/978-981-19-2600-6_4
Published: 22 September 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2599-3
Online ISBN: 978-981-19-2600-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics