Skip to main content

GFTE: Graph-Based Financial Table Extraction

  • Conference paper
  • First Online:
Pattern Recognition. ICPR International Workshops and Challenges (ICPR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12662))

Included in the following conference series:

Abstract

Tabular data is a crucial form of information expression, which can organize data in a standard structure for easy information retrieval and comparison. However, in financial industry and many other fields, tables are often disclosed in unstructured digital files, e.g. Portable Document Format (PDF) and images, which are difficult to be extracted directly. In this paper, to facilitate deep learning based table extraction from unstructured digital files, we publish a standard Chinese dataset named FinTab, which contains more than 1,600 financial tables of diverse kinds and their corresponding structure representation in JSON. In addition, we propose a novel graph-based convolutional neural network model named GFTE as a baseline for future comparison. GFTE integrates image feature, position feature and textual feature together for precise edge prediction and reaches overall good results https://github.com/Irene323/GFTE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Etzioni, O., Fader, A., et al.: Open information extraction: the second generation. Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, AAAI Press, Spain, 2011, pp. 3–10 (2011)

    Google Scholar 

  2. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (1999)

    Google Scholar 

  3. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press New York, NY, USA (2010)

    Google Scholar 

  4. Hurst, M.: Towards a theory of tables. Int. J. Document Anal. Recogn. (IJDAR) 8(2–3), 123–131 (2006)

    Article  Google Scholar 

  5. Long, V., Dale, R., Cassidy, S.: A model for detecting and merging vertically spanned table cells in plain text documents. In: International Conference on Document Analysis and Recognition, New York (2005)

    Google Scholar 

  6. Shigarov, A., Altaev, A., et al.: TabbyPDF: web-based system for PDF table extraction. In: 24th International Conference on Information and Software Technologies (2018)

    Google Scholar 

  7. Institute of Computer Science and Techonology of Peking University, Institute of Digital Publishing of Founder R&D Center, “Marmot Dataset,” China (2011)

    Google Scholar 

  8. Shahab, A.: Table Ground Truth for the UW3 and UNLV datasets. In: German Research Center for Artificial Intelligence (DFKI) (2013)

    Google Scholar 

  9. Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, pp. 1449–1453 (2013)

    Google Scholar 

  10. Gao, L., Huang, Y., Dejean, H., Meunier, J.: ICDAR 2019 Competition on Table Detection and Recognition (cTDaR). In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1510–1515 (2019)

    Google Scholar 

  11. Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: largest dataset ever for document layout analysis. In: International Conference on Document Analysis and Recognition (ICDAR) (2019)

    Google Scholar 

  12. Chi, Z., Huang, H., Xu, H., Yu, H., Yin, W., Mao, X.: Complicated table structure recognition (2019). arXiv preprint arXiv:1908.04729

  13. Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: table benchmark for image-based table detection and recognition. In: The International Conference on Language Resources and Evaluation (2020)

    Google Scholar 

  14. Göbel, M., Hassan, T., Oro, E., Orsi, G., Rastan, R.: Table modelling, extraction and processing. In: Proceedings of the 2016 ACM Symposium on Document Engineering, pp. 1–2 (2016)

    Google Scholar 

  15. Coüasnon, B., Lemaitre, A.: Handbook of Document Image Processing and Recognition. Chap. Recognition of Tables and Forms, pp. 647–677 (2014)

    Google Scholar 

  16. Khusro, S., Latif, A., Ullah, I.: On methods and tools of table detection, extraction and annotation in PDF documents. J. Inf. Sci. 41(1), 41–57 (2015)

    Article  Google Scholar 

  17. Wang, Y.: Document analysis: table structure understanding and zone content classification. PhD thesis, Washington University (2002)

    Google Scholar 

  18. Shamilian, J.H., Baird, H.S., Wood, T.L.: A retargetable table reader. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. IEEE, New York (1997)

    Google Scholar 

  19. Mohemad, R., Hamdan, A.R., Othman, Z.A., Noor, N.M.: Automatic document structure analysis of structured PDF files. Int. J. New Comput. Architect. Appl. (IJNCAA) 1(2), 404–411 (2011)

    Google Scholar 

  20. Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: a method to extract table information from PDF files. In: IICAI (2005)

    Google Scholar 

  21. Liu, Y., Mitra, P., Giles, C.L., Bai, K.: Automatic extraction of table metadata from digital documents. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, New York (2006)

    Google Scholar 

  22. Wang, Y., Hu, J.: Detecting tables in HTML documents. In: Lopresti, D., Hu, J., Kashi, R. (eds.) DAS 2002. LNCS, vol. 2423, pp. 249–260. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45869-7_29

    Chapter  Google Scholar 

  23. Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in HTML documents. In: Proceedings of the 11th International Conference on World Wide Web. ACM, New York (2002)

    Google Scholar 

  24. Oro, E., Ruffolo, M.: Xonto: an ontology-based system for semantic information extraction from pdf documents. In: ICTAI 2008: 20th IEEE International Conference on Tools with Artificial Intelligence. IEEE, New York (2008)

    Google Scholar 

  25. Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. In: Proceedings of the 11th International Conference on World Wide Web. ACM, New York (2002)

    Google Scholar 

  26. Silva, A.C.: New metrics for evaluating performance in document analysis tasks - application to the table case. ICDAR (2007)

    Google Scholar 

  27. Liu, Y., Mitra, P., Giles, C.L.: Identifying table boundaries in digital documents via sparse line detection. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, New York (2008)

    Google Scholar 

  28. Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion retrieval. ACM, New York (2003)

    Google Scholar 

  29. Wei, X., Croft, B., McCallum, A.: Table extraction for answer retrieval. Inf. Retrieval 9(5), 589–611 (2006)

    Article  Google Scholar 

  30. Qasim, S.R., Mahmood, H., Shafait, F.: Rethinking table recognition using graph neural networks. In: 2019, International Conference on Document Analysis and Recognition (ICDAR), pp. 142–147 (2019)

    Google Scholar 

  31. Pau, R., Anjan, D., Lutz, G., Alicia, F., Oriol, R., Josep, L.: Table detection in invoice documents by graph neural networks. In: 2019, International Conference on Document Analysis and Recognition (ICDAR), pp. 122–127 (2019)

    Google Scholar 

  32. Holeček, M., Hoskovec, A., Baudiš, P., Klinger, P.: Table understanding in structured documents. In: International Conference on Document Analysis and Recognition Workshops (ICDARW), pp. 158–164 (2019)

    Google Scholar 

  33. Chris, T., Morariu, V., Price, B., Cohen, S., Tony, M.: “Deep splitting and merging for table structure decomposition. In: 2019, International Conference on Document Analysis and Recognition (ICDAR), pp. 114–121 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yiren Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y., Huang, Z., Yan, J., Zhou, Y., Ye, F., Liu, X. (2021). GFTE: Graph-Based Financial Table Extraction. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12662. Springer, Cham. https://doi.org/10.1007/978-3-030-68790-8_50

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-68790-8_50

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-68789-2

  • Online ISBN: 978-3-030-68790-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics