Skip to main content
Log in

Learning cell embeddings for understanding table layouts

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

There is a large amount of data on the web in tabular form, such as Excel sheets, CSV files, and web tables. Often, tabular data is meant for human consumption, using data layouts that are difficult for machines to interpret automatically. Previous work uses the stylistic features of tabular cells (such as font size, border type, and background color) to classify tabular cells by their role in the data layout of the document (top attribute, data, metadata, etc.). In this paper, we propose a deep neural network model which can embed semantic and contextual information about tabular cells in a low-dimensional cell embedding space. We pre-train this cell embedding model on a large corpus of tabular documents from various domains. We then propose a classification technique based on recurrent neural networks (RNNs) to use our pre-trained cell embeddings, combining them with stylistic features introduced in previous work, in order to improve the performance of cell type classification in complex documents. We evaluate the performance of our system on three datasets containing documents with various data layouts, in two settings: in-domain and cross-domain training. Our evaluation result shows that our proposed cell vector representations in combination with our RNN-based classification technique significantly improve cell type classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://wwwdb.inf.tu-dresden.de/research-projects/deexcelarator/.

  2. http://dbgroup.eecs.umich.edu/project/sheets/datasets.htm.

  3. https://ucr.fbi.gov/crime-in-the-u.s.

  4. Data and code: github.com/majidghgol/TabularCellTypeClassification.

  5. https://wwwdb.inf.tu-dresden.de/research-projects/deexcelarator/.

  6. https://github.com/elviskoci/XCellAnnotator.

References

  1. Abraham R, Erwig M (2006) Inferring templates from spreadsheets. In: Proceedings of the 28th international conference on Software engineering. ACM, pp 182–191

  2. Adelfio MD, Samet H (2013) Schema extraction for tabular data on the web. Proc VLDB Endow 6(6):421–432

    Article  Google Scholar 

  3. Ahsan R, Neamtu R, Rundensteiner E (2016) Towards spreadsheet integration using entity identification driven by a spatial-temporal model. In: Proceedings of the 31st annual ACM symposium on applied computing. ACM, pp 1083–1085

  4. Azunre P, Corcoran C, Dhamani N, Gleason J, Honke G, Sullivan D, Ruppel R, Verma S, Morgan J (2019) Semantic classification of tabular datasets via character-level convolutional neural networks. arXiv preprint arXiv:1901.08456

  5. Bhagavatula CS, Noraset T, Downey D (2015) Tabel: entity linking in web tables. In: International semantic web conference. Springer, pp 425–441

  6. Cer D, Yang Y, Kong Sy, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C et al (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175

  7. Chen Z, Cafarella M (2013) Automatic web spreadsheet data extraction. In: Proceedings of the 3rd international workshop on semantic search over the web. ACM, p 1

  8. Chen Z, Cafarella M (2014) Integrating spreadsheet data via accurate and low-effort extraction. In: Proceedings of the 20th ACM SIGKDD. ACM, pp 1126–1135

  9. Chen Z, Dadiomov S, Wesley R, Xiao G, Cory D, Cafarella M, Mackinlay J (2017) Spreadsheet property detection with rule-assisted active learning. In: Proceedings of the 2017 ACM on conference on information and knowledge management. ACM, pp 999–1008

  10. Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364

  11. Crestan E, Pantel P (2011) Web-scale table census and classification. In: Proceedings of the fourth ACM international conference on Web search and data mining. ACM, pp 545–554

  12. Cunha J, Saraiva J, Visser J (2009) From spreadsheets to relational databases and back. In: Proceedings of the 2009 ACM SIGPLAN workshop on partial evaluation and program manipulation. ACM, pp 179–188

  13. Deng L, Zhang S, Balog K (2019) Table2vec: neural word and entity embeddings for table population and retrieval. arXiv preprint arXiv:1906.00041

  14. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  15. Dou W, Han S, Xu L, Zhang D, Wei J (2018) Expandable group identification in spreadsheets. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering. ACM, pp 498–508

  16. Eberius J, Werner C, Thiele M, Braunschweig K, Dannecker L, Lehner W (2013) Deexcelerator: a framework for extracting relational data from partially structured documents. In: Proceedings of the 22nd ACM international conference on information and knowledge management. ACM, pp 2477–2480

  17. Ghasemi-Gol M, Szekely P (2018) Tabvec: table vectors for classification of web tables. arXiv preprint arXiv:1802.06290

  18. Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp. 3363–3372

  19. Kim Y (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882

  20. Koci E, Thiele M, Lehner W, Romero O (2018) Table recognition in spreadsheets via a graph representation. In: 2018 13th IAPR international workshop on document analysis systems (DAS). IEEE, pp 139–144

  21. Koci E, Thiele M, Romero O, Lehner W (2016) Cell classification for layout recognition in spreadsheets. In: International joint conference on knowledge discovery, knowledge engineering, and knowledge management. Springer, pp 78–100

  22. Koci E, Thiele M, Romero Moral Ó, Lehner W (2016) A machine learning approach for layout inference in spreadsheets. In: IC3K 2016: proceedings of the 8th international joint conference on knowledge discovery, knowledge engineering and knowledge management: volume 1: KDIR. SciTePress, pp 77–88

  23. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360

  24. Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196

  25. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  26. Neishi M, Sakuma J, Tohda S, Ishiwatari S, Yoshinaga N, Toyoda M (2017) A bag of useful tricks for practical neural machine translation: embedding layer initialization and large batch size. In: Proceedings of the 4th workshop on Asian translation (WAT2017), pp 99–109

  27. Nishida K, Sadamitsu K, Higashinaka R, Matsuo Y (2017) Understanding the semantic structures of tables with a hybrid deep neural network architecture. In: AAAI, pp 168–174

  28. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp 1532–1543

  29. Shigarov AO (2015) Table understanding using a rule engine. Expert Syst Appl 42(2):929–937

    Article  Google Scholar 

  30. Shigarov AO, Paramonov VV, Belykh PV, Bondarev AI (2016) Rule-based canonicalization of arbitrary tables in spreadsheets. In: International conference on information and software technologies. Springer, pp 78–91

  31. Su H, Li Y, Wang X, Hao G, Lai Y, Wang W (2017) Transforming a nonstandard table into formalized tables. In: Web information systems and applications conference, 2017 14th. IEEE, pp 311–316

  32. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  33. Wang X (1996) Tabular abstraction, editing, and formatting. PhD thesis, University of Waterloo

  34. Wright P, Fox K (1970) Presenting information in tables. Appl Ergon 1(4):234–242

    Article  Google Scholar 

  35. Wu S, Hsiao L, Cheng X, Hancock B, Rekatsinas T, Levis P, Ré C (2018) Fonduer: knowledge base construction from richly formatted data. In: Proceedings of the 2018 international conference on management of data. ACM, pp 1301–1316

  36. Zhang S, Balog K (2018) Ad hoc table retrieval using semantic similarity. In: Proceedings of the 2018 world wide web conference, pp 1553–1562

Download references

Acknowledgements

This research is supported by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL) under Contract Number FA8650-17-C-7715. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. Reviewers for their helpful comments on the paper. We thank the anonymous reviewers for their helpful comments on the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Majid Ghasemi-Gol.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ghasemi-Gol, M., Pujara, J. & Szekely, P. Learning cell embeddings for understanding table layouts. Knowl Inf Syst 63, 39–64 (2021). https://doi.org/10.1007/s10115-020-01508-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-020-01508-6

Keywords

Navigation