Learning cell embeddings for understanding table layouts

Ghasemi-Gol, Majid; Pujara, Jay; Szekely, Pedro

doi:10.1007/s10115-020-01508-6

Learning cell embeddings for understanding table layouts

Regular Paper
Published: 07 September 2020

Volume 63, pages 39–64, (2021)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

680 Accesses
2 Citations
Explore all metrics

Abstract

There is a large amount of data on the web in tabular form, such as Excel sheets, CSV files, and web tables. Often, tabular data is meant for human consumption, using data layouts that are difficult for machines to interpret automatically. Previous work uses the stylistic features of tabular cells (such as font size, border type, and background color) to classify tabular cells by their role in the data layout of the document (top attribute, data, metadata, etc.). In this paper, we propose a deep neural network model which can embed semantic and contextual information about tabular cells in a low-dimensional cell embedding space. We pre-train this cell embedding model on a large corpus of tabular documents from various domains. We then propose a classification technique based on recurrent neural networks (RNNs) to use our pre-trained cell embeddings, combining them with stylistic features introduced in previous work, in order to improve the performance of cell type classification in complex documents. We evaluate the performance of our system on three datasets containing documents with various data layouts, in two settings: in-domain and cross-domain training. Our evaluation result shows that our proposed cell vector representations in combination with our RNN-based classification technique significantly improve cell type classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

A Review on Word Embedding Techniques for Text Classification

Triplet attention-based deep learning model for hierarchical image classification of household items for robotic applications

Article 15 April 2024

Notes

https://wwwdb.inf.tu-dresden.de/research-projects/deexcelarator/.
http://dbgroup.eecs.umich.edu/project/sheets/datasets.htm.
https://ucr.fbi.gov/crime-in-the-u.s.
Data and code: github.com/majidghgol/TabularCellTypeClassification.
https://wwwdb.inf.tu-dresden.de/research-projects/deexcelarator/.
https://github.com/elviskoci/XCellAnnotator.

References

Abraham R, Erwig M (2006) Inferring templates from spreadsheets. In: Proceedings of the 28th international conference on Software engineering. ACM, pp 182–191
Adelfio MD, Samet H (2013) Schema extraction for tabular data on the web. Proc VLDB Endow 6(6):421–432
Article Google Scholar
Ahsan R, Neamtu R, Rundensteiner E (2016) Towards spreadsheet integration using entity identification driven by a spatial-temporal model. In: Proceedings of the 31st annual ACM symposium on applied computing. ACM, pp 1083–1085
Azunre P, Corcoran C, Dhamani N, Gleason J, Honke G, Sullivan D, Ruppel R, Verma S, Morgan J (2019) Semantic classification of tabular datasets via character-level convolutional neural networks. arXiv preprint arXiv:1901.08456
Bhagavatula CS, Noraset T, Downey D (2015) Tabel: entity linking in web tables. In: International semantic web conference. Springer, pp 425–441
Cer D, Yang Y, Kong Sy, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C et al (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175
Chen Z, Cafarella M (2013) Automatic web spreadsheet data extraction. In: Proceedings of the 3rd international workshop on semantic search over the web. ACM, p 1
Chen Z, Cafarella M (2014) Integrating spreadsheet data via accurate and low-effort extraction. In: Proceedings of the 20th ACM SIGKDD. ACM, pp 1126–1135
Chen Z, Dadiomov S, Wesley R, Xiao G, Cory D, Cafarella M, Mackinlay J (2017) Spreadsheet property detection with rule-assisted active learning. In: Proceedings of the 2017 ACM on conference on information and knowledge management. ACM, pp 999–1008
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364
Crestan E, Pantel P (2011) Web-scale table census and classification. In: Proceedings of the fourth ACM international conference on Web search and data mining. ACM, pp 545–554
Cunha J, Saraiva J, Visser J (2009) From spreadsheets to relational databases and back. In: Proceedings of the 2009 ACM SIGPLAN workshop on partial evaluation and program manipulation. ACM, pp 179–188
Deng L, Zhang S, Balog K (2019) Table2vec: neural word and entity embeddings for table population and retrieval. arXiv preprint arXiv:1906.00041
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Dou W, Han S, Xu L, Zhang D, Wei J (2018) Expandable group identification in spreadsheets. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering. ACM, pp 498–508
Eberius J, Werner C, Thiele M, Braunschweig K, Dannecker L, Lehner W (2013) Deexcelerator: a framework for extracting relational data from partially structured documents. In: Proceedings of the 22nd ACM international conference on information and knowledge management. ACM, pp 2477–2480
Ghasemi-Gol M, Szekely P (2018) Tabvec: table vectors for classification of web tables. arXiv preprint arXiv:1802.06290
Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp. 3363–3372
Kim Y (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882
Koci E, Thiele M, Lehner W, Romero O (2018) Table recognition in spreadsheets via a graph representation. In: 2018 13th IAPR international workshop on document analysis systems (DAS). IEEE, pp 139–144
Koci E, Thiele M, Romero O, Lehner W (2016) Cell classification for layout recognition in spreadsheets. In: International joint conference on knowledge discovery, knowledge engineering, and knowledge management. Springer, pp 78–100
Koci E, Thiele M, Romero Moral Ó, Lehner W (2016) A machine learning approach for layout inference in spreadsheets. In: IC3K 2016: proceedings of the 8th international joint conference on knowledge discovery, knowledge engineering and knowledge management: volume 1: KDIR. SciTePress, pp 77–88
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Neishi M, Sakuma J, Tohda S, Ishiwatari S, Yoshinaga N, Toyoda M (2017) A bag of useful tricks for practical neural machine translation: embedding layer initialization and large batch size. In: Proceedings of the 4th workshop on Asian translation (WAT2017), pp 99–109
Nishida K, Sadamitsu K, Higashinaka R, Matsuo Y (2017) Understanding the semantic structures of tables with a hybrid deep neural network architecture. In: AAAI, pp 168–174
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp 1532–1543
Shigarov AO (2015) Table understanding using a rule engine. Expert Syst Appl 42(2):929–937
Article Google Scholar
Shigarov AO, Paramonov VV, Belykh PV, Bondarev AI (2016) Rule-based canonicalization of arbitrary tables in spreadsheets. In: International conference on information and software technologies. Springer, pp 78–91
Su H, Li Y, Wang X, Hao G, Lai Y, Wang W (2017) Transforming a nonstandard table into formalized tables. In: Web information systems and applications conference, 2017 14th. IEEE, pp 311–316
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wang X (1996) Tabular abstraction, editing, and formatting. PhD thesis, University of Waterloo
Wright P, Fox K (1970) Presenting information in tables. Appl Ergon 1(4):234–242
Article Google Scholar
Wu S, Hsiao L, Cheng X, Hancock B, Rekatsinas T, Levis P, Ré C (2018) Fonduer: knowledge base construction from richly formatted data. In: Proceedings of the 2018 international conference on management of data. ACM, pp 1301–1316
Zhang S, Balog K (2018) Ad hoc table retrieval using semantic similarity. In: Proceedings of the 2018 world wide web conference, pp 1553–1562

Download references

Acknowledgements

This research is supported by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL) under Contract Number FA8650-17-C-7715. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. Reviewers for their helpful comments on the paper. We thank the anonymous reviewers for their helpful comments on the paper.

Author information

Authors and Affiliations

Information Science Institute, University of Southern California, Marina Del Rey, CA, 90292, USA
Majid Ghasemi-Gol, Jay Pujara & Pedro Szekely

Authors

Majid Ghasemi-Gol
View author publications
You can also search for this author in PubMed Google Scholar
Jay Pujara
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Szekely
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Majid Ghasemi-Gol.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ghasemi-Gol, M., Pujara, J. & Szekely, P. Learning cell embeddings for understanding table layouts. Knowl Inf Syst 63, 39–64 (2021). https://doi.org/10.1007/s10115-020-01508-6

Download citation

Received: 13 February 2020
Revised: 12 August 2020
Accepted: 17 August 2020
Published: 07 September 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s10115-020-01508-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning cell embeddings for understanding table layouts

Abstract

Access this article

Similar content being viewed by others

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

A Review on Word Embedding Techniques for Text Classification

Triplet attention-based deep learning model for hierarchical image classification of household items for robotic applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning cell embeddings for understanding table layouts

Abstract

Access this article

Similar content being viewed by others

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

A Review on Word Embedding Techniques for Text Classification

Triplet attention-based deep learning model for hierarchical image classification of household items for robotic applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation