Abstract
Tabular data is a crucial form of information expression, which can organize data in a standard structure for easy information retrieval and comparison. However, in financial industry and many other fields, tables are often disclosed in unstructured digital files, e.g. Portable Document Format (PDF) and images, which are difficult to be extracted directly. In this paper, to facilitate deep learning based table extraction from unstructured digital files, we publish a standard Chinese dataset named FinTab, which contains more than 1,600 financial tables of diverse kinds and their corresponding structure representation in JSON. In addition, we propose a novel graph-based convolutional neural network model named GFTE as a baseline for future comparison. GFTE integrates image feature, position feature and textual feature together for precise edge prediction and reaches overall good results https://github.com/Irene323/GFTE.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Etzioni, O., Fader, A., et al.: Open information extraction: the second generation. Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, AAAI Press, Spain, 2011, pp. 3–10 (2011)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (1999)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press New York, NY, USA (2010)
Hurst, M.: Towards a theory of tables. Int. J. Document Anal. Recogn. (IJDAR) 8(2–3), 123–131 (2006)
Long, V., Dale, R., Cassidy, S.: A model for detecting and merging vertically spanned table cells in plain text documents. In: International Conference on Document Analysis and Recognition, New York (2005)
Shigarov, A., Altaev, A., et al.: TabbyPDF: web-based system for PDF table extraction. In: 24th International Conference on Information and Software Technologies (2018)
Institute of Computer Science and Techonology of Peking University, Institute of Digital Publishing of Founder R&D Center, “Marmot Dataset,” China (2011)
Shahab, A.: Table Ground Truth for the UW3 and UNLV datasets. In: German Research Center for Artificial Intelligence (DFKI) (2013)
Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, pp. 1449–1453 (2013)
Gao, L., Huang, Y., Dejean, H., Meunier, J.: ICDAR 2019 Competition on Table Detection and Recognition (cTDaR). In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1510–1515 (2019)
Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: largest dataset ever for document layout analysis. In: International Conference on Document Analysis and Recognition (ICDAR) (2019)
Chi, Z., Huang, H., Xu, H., Yu, H., Yin, W., Mao, X.: Complicated table structure recognition (2019). arXiv preprint arXiv:1908.04729
Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: table benchmark for image-based table detection and recognition. In: The International Conference on Language Resources and Evaluation (2020)
Göbel, M., Hassan, T., Oro, E., Orsi, G., Rastan, R.: Table modelling, extraction and processing. In: Proceedings of the 2016 ACM Symposium on Document Engineering, pp. 1–2 (2016)
Coüasnon, B., Lemaitre, A.: Handbook of Document Image Processing and Recognition. Chap. Recognition of Tables and Forms, pp. 647–677 (2014)
Khusro, S., Latif, A., Ullah, I.: On methods and tools of table detection, extraction and annotation in PDF documents. J. Inf. Sci. 41(1), 41–57 (2015)
Wang, Y.: Document analysis: table structure understanding and zone content classification. PhD thesis, Washington University (2002)
Shamilian, J.H., Baird, H.S., Wood, T.L.: A retargetable table reader. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. IEEE, New York (1997)
Mohemad, R., Hamdan, A.R., Othman, Z.A., Noor, N.M.: Automatic document structure analysis of structured PDF files. Int. J. New Comput. Architect. Appl. (IJNCAA) 1(2), 404–411 (2011)
Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: a method to extract table information from PDF files. In: IICAI (2005)
Liu, Y., Mitra, P., Giles, C.L., Bai, K.: Automatic extraction of table metadata from digital documents. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, New York (2006)
Wang, Y., Hu, J.: Detecting tables in HTML documents. In: Lopresti, D., Hu, J., Kashi, R. (eds.) DAS 2002. LNCS, vol. 2423, pp. 249–260. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45869-7_29
Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in HTML documents. In: Proceedings of the 11th International Conference on World Wide Web. ACM, New York (2002)
Oro, E., Ruffolo, M.: Xonto: an ontology-based system for semantic information extraction from pdf documents. In: ICTAI 2008: 20th IEEE International Conference on Tools with Artificial Intelligence. IEEE, New York (2008)
Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. In: Proceedings of the 11th International Conference on World Wide Web. ACM, New York (2002)
Silva, A.C.: New metrics for evaluating performance in document analysis tasks - application to the table case. ICDAR (2007)
Liu, Y., Mitra, P., Giles, C.L.: Identifying table boundaries in digital documents via sparse line detection. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, New York (2008)
Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion retrieval. ACM, New York (2003)
Wei, X., Croft, B., McCallum, A.: Table extraction for answer retrieval. Inf. Retrieval 9(5), 589–611 (2006)
Qasim, S.R., Mahmood, H., Shafait, F.: Rethinking table recognition using graph neural networks. In: 2019, International Conference on Document Analysis and Recognition (ICDAR), pp. 142–147 (2019)
Pau, R., Anjan, D., Lutz, G., Alicia, F., Oriol, R., Josep, L.: Table detection in invoice documents by graph neural networks. In: 2019, International Conference on Document Analysis and Recognition (ICDAR), pp. 122–127 (2019)
Holeček, M., Hoskovec, A., Baudiš, P., Klinger, P.: Table understanding in structured documents. In: International Conference on Document Analysis and Recognition Workshops (ICDARW), pp. 158–164 (2019)
Chris, T., Morariu, V., Price, B., Cohen, S., Tony, M.: “Deep splitting and merging for table structure decomposition. In: 2019, International Conference on Document Analysis and Recognition (ICDAR), pp. 114–121 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Y., Huang, Z., Yan, J., Zhou, Y., Ye, F., Liu, X. (2021). GFTE: Graph-Based Financial Table Extraction. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12662. Springer, Cham. https://doi.org/10.1007/978-3-030-68790-8_50
Download citation
DOI: https://doi.org/10.1007/978-3-030-68790-8_50
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68789-2
Online ISBN: 978-3-030-68790-8
eBook Packages: Computer ScienceComputer Science (R0)