Abstract
There is a large amount of data on the web in tabular form, such as Excel sheets, CSV files, and web tables. Often, tabular data is meant for human consumption, using data layouts that are difficult for machines to interpret automatically. Previous work uses the stylistic features of tabular cells (such as font size, border type, and background color) to classify tabular cells by their role in the data layout of the document (top attribute, data, metadata, etc.). In this paper, we propose a deep neural network model which can embed semantic and contextual information about tabular cells in a low-dimensional cell embedding space. We pre-train this cell embedding model on a large corpus of tabular documents from various domains. We then propose a classification technique based on recurrent neural networks (RNNs) to use our pre-trained cell embeddings, combining them with stylistic features introduced in previous work, in order to improve the performance of cell type classification in complex documents. We evaluate the performance of our system on three datasets containing documents with various data layouts, in two settings: in-domain and cross-domain training. Our evaluation result shows that our proposed cell vector representations in combination with our RNN-based classification technique significantly improve cell type classification performance.
Similar content being viewed by others
Notes
Data and code: github.com/majidghgol/TabularCellTypeClassification.
References
Abraham R, Erwig M (2006) Inferring templates from spreadsheets. In: Proceedings of the 28th international conference on Software engineering. ACM, pp 182–191
Adelfio MD, Samet H (2013) Schema extraction for tabular data on the web. Proc VLDB Endow 6(6):421–432
Ahsan R, Neamtu R, Rundensteiner E (2016) Towards spreadsheet integration using entity identification driven by a spatial-temporal model. In: Proceedings of the 31st annual ACM symposium on applied computing. ACM, pp 1083–1085
Azunre P, Corcoran C, Dhamani N, Gleason J, Honke G, Sullivan D, Ruppel R, Verma S, Morgan J (2019) Semantic classification of tabular datasets via character-level convolutional neural networks. arXiv preprint arXiv:1901.08456
Bhagavatula CS, Noraset T, Downey D (2015) Tabel: entity linking in web tables. In: International semantic web conference. Springer, pp 425–441
Cer D, Yang Y, Kong Sy, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C et al (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175
Chen Z, Cafarella M (2013) Automatic web spreadsheet data extraction. In: Proceedings of the 3rd international workshop on semantic search over the web. ACM, p 1
Chen Z, Cafarella M (2014) Integrating spreadsheet data via accurate and low-effort extraction. In: Proceedings of the 20th ACM SIGKDD. ACM, pp 1126–1135
Chen Z, Dadiomov S, Wesley R, Xiao G, Cory D, Cafarella M, Mackinlay J (2017) Spreadsheet property detection with rule-assisted active learning. In: Proceedings of the 2017 ACM on conference on information and knowledge management. ACM, pp 999–1008
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364
Crestan E, Pantel P (2011) Web-scale table census and classification. In: Proceedings of the fourth ACM international conference on Web search and data mining. ACM, pp 545–554
Cunha J, Saraiva J, Visser J (2009) From spreadsheets to relational databases and back. In: Proceedings of the 2009 ACM SIGPLAN workshop on partial evaluation and program manipulation. ACM, pp 179–188
Deng L, Zhang S, Balog K (2019) Table2vec: neural word and entity embeddings for table population and retrieval. arXiv preprint arXiv:1906.00041
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Dou W, Han S, Xu L, Zhang D, Wei J (2018) Expandable group identification in spreadsheets. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering. ACM, pp 498–508
Eberius J, Werner C, Thiele M, Braunschweig K, Dannecker L, Lehner W (2013) Deexcelerator: a framework for extracting relational data from partially structured documents. In: Proceedings of the 22nd ACM international conference on information and knowledge management. ACM, pp 2477–2480
Ghasemi-Gol M, Szekely P (2018) Tabvec: table vectors for classification of web tables. arXiv preprint arXiv:1802.06290
Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp. 3363–3372
Kim Y (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882
Koci E, Thiele M, Lehner W, Romero O (2018) Table recognition in spreadsheets via a graph representation. In: 2018 13th IAPR international workshop on document analysis systems (DAS). IEEE, pp 139–144
Koci E, Thiele M, Romero O, Lehner W (2016) Cell classification for layout recognition in spreadsheets. In: International joint conference on knowledge discovery, knowledge engineering, and knowledge management. Springer, pp 78–100
Koci E, Thiele M, Romero Moral Ó, Lehner W (2016) A machine learning approach for layout inference in spreadsheets. In: IC3K 2016: proceedings of the 8th international joint conference on knowledge discovery, knowledge engineering and knowledge management: volume 1: KDIR. SciTePress, pp 77–88
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Neishi M, Sakuma J, Tohda S, Ishiwatari S, Yoshinaga N, Toyoda M (2017) A bag of useful tricks for practical neural machine translation: embedding layer initialization and large batch size. In: Proceedings of the 4th workshop on Asian translation (WAT2017), pp 99–109
Nishida K, Sadamitsu K, Higashinaka R, Matsuo Y (2017) Understanding the semantic structures of tables with a hybrid deep neural network architecture. In: AAAI, pp 168–174
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp 1532–1543
Shigarov AO (2015) Table understanding using a rule engine. Expert Syst Appl 42(2):929–937
Shigarov AO, Paramonov VV, Belykh PV, Bondarev AI (2016) Rule-based canonicalization of arbitrary tables in spreadsheets. In: International conference on information and software technologies. Springer, pp 78–91
Su H, Li Y, Wang X, Hao G, Lai Y, Wang W (2017) Transforming a nonstandard table into formalized tables. In: Web information systems and applications conference, 2017 14th. IEEE, pp 311–316
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wang X (1996) Tabular abstraction, editing, and formatting. PhD thesis, University of Waterloo
Wright P, Fox K (1970) Presenting information in tables. Appl Ergon 1(4):234–242
Wu S, Hsiao L, Cheng X, Hancock B, Rekatsinas T, Levis P, Ré C (2018) Fonduer: knowledge base construction from richly formatted data. In: Proceedings of the 2018 international conference on management of data. ACM, pp 1301–1316
Zhang S, Balog K (2018) Ad hoc table retrieval using semantic similarity. In: Proceedings of the 2018 world wide web conference, pp 1553–1562
Acknowledgements
This research is supported by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL) under Contract Number FA8650-17-C-7715. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. Reviewers for their helpful comments on the paper. We thank the anonymous reviewers for their helpful comments on the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ghasemi-Gol, M., Pujara, J. & Szekely, P. Learning cell embeddings for understanding table layouts. Knowl Inf Syst 63, 39–64 (2021). https://doi.org/10.1007/s10115-020-01508-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-020-01508-6