Abstract
Digital technologies are now becoming part of all the sectors be it banking, automobile, infrastructure, and more. These technologies are empowered by “Data”. This is raising the need for the digitization of documents to fulfill the need for data for driving the digital transformation throughout sectors. Digitization requires the extraction of a huge amount of data from paper-based documents. Automating data extraction from paper-based documents can help in dealing with large volumes of data at a lower cost with lesser efforts. A solution is proposed which uses open-source components to automate the process of data extraction from scanned documents with minimal user input. The solution is capable of generating the structured output reflecting the document layout with the data in a document. The solution is capable of extracting data from tables and stamps present in documents in a well-structured format. The solution is driven by a configuration file, which can help in fine-tuning different processes to improve extracted data. The solution generates an XML for the scanned document which can be used further for storing and processing the data present in paper-based documents by different digital processes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
P. Kurhekar, S. Nigam, S. Pillai, Automated text and tabular data extraction from scanned document images, in Data Management, Analytics and Innovation. Lecture Notes on Data Engineering and Communications Technologies, ed. by N. Sharma, A. Chakrabarti, V.E. Balas, A.M. Bruckstein, Vol. 70 (Springer, Singapore, 2021). https://doi.org/10.1007/978-981-16-2934-1_11
Y. Baek, B. Lee, D. Han, S. Yun, H. Lee, Character region awareness for text detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 9365–9374
D. Prasad, A. Gadpal, K. Kapadni, M. Visave, K. Sultanpure, CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents, in CoRR abs/2004.12629 (2020)
S. Paliwal, et al., TableNet: deep learning model for end-to-end table detection and tabular data extraction from scanned document images, in 2019 International Conference on Document Analysis and Recognition (ICDAR) (IEEE, 2019)
B. Majumder, N. Potti, S. Tata, J.B. Wendt, Q. Zhao, M. Najork, Representation learning for information extraction from form-like documents, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020) (2020), pp. 6495–6504
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Nigam, S. (2023). Automated Structured Data Extraction from Scanned Document Images. In: Goswami, S., Barara, I.S., Goje, A., Mohan, C., Bruckstein, A.M. (eds) Data Management, Analytics and Innovation. ICDMAI 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 137. Springer, Singapore. https://doi.org/10.1007/978-981-19-2600-6_4
Download citation
DOI: https://doi.org/10.1007/978-981-19-2600-6_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2599-3
Online ISBN: 978-981-19-2600-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)