Advertisement

Table Identification and Reconstruction in Spreadsheets

  • Elvis Koci
  • Maik Thiele
  • Oscar Romero
  • Wolfgang Lehner
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10253)

Abstract

Spreadsheets are one of the most successful content generation tools, used in almost every enterprise to perform data transformation, visualization, and analysis. The high degree of freedom provided by these tools results in very complex sheets, intermingling the actual data with formatting, formulas, layout artifacts, and textual metadata. To unlock the wealth of data contained in spreadsheets, a human analyst will often have to understand and transform the data manually. To overcome this cumbersome process, we propose a framework that is able to automatically infer the structure and extract the data from these documents in a canonical form. In this paper, we describe our heuristics-based method for discovering tables in spreadsheets, given that each cell is classified as either header, attribute, metadata, data, or derived. Experimental results on a real-world dataset of 439 worksheets (858 tables) show that our approach is feasible and effectively identifies tables within partially structured spreadsheets.

Keywords

Speadsheet Document Tabular Grid Table Layout Recognition Identification 

Notes

Acknowledgments

This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate “Information Technologies for Business Intelligence - Doctoral College” (IT4BI-DC).

References

  1. 1.
    Abraham, R., Erwig, M.: Header and unit inference for spreadsheets through spatial analyses. In: VL/HCC 2004, pp. 165–172. IEEE (2004)Google Scholar
  2. 2.
    Adelfio, M.D., Samet, H.: Schema extraction for tabular data on the web. VLDB 2013 6(6), 421–432 (2013)Google Scholar
  3. 3.
    Barik, T., Lubick, K., Smith, J., Slankas, J., Murphy-Hill, E.: FUSE: a reproducible, extendable, internet-scale corpus of spreadsheets. In: MSR 2015 (2015)Google Scholar
  4. 4.
    Caldwell, D.R.: Unlocking the mysteries of the bounding box (2005)Google Scholar
  5. 5.
    Chen, Z., Cafarella, M.: Automatic web spreadsheet data extraction. In: SSW 2013, p. 1. ACM (2013)Google Scholar
  6. 6.
    Eberius, J., Werner, C., Thiele, M., Braunschweig, K., Dannecker, L., Lehner, W.: DeExcelerator: a framework for extracting relational data from partially structured documents. In: CIKM 2013, pp. 2477–2480. ACM (2013)Google Scholar
  7. 7.
    Eckerson, W.W., Sherman, R.P.: Strategies for managing spreadmarts. Bus. Intell. J. 13(1), 23–24 (2008)Google Scholar
  8. 8.
    Fisher, M., Rothermel, G.: The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. In: SIGSOFT 2005, vol. 30, pp. 1–5. ACM (2005)Google Scholar
  9. 9.
    Hermans, F., Murphy-Hill, E.: Enron’s spreadsheets and related emails: a dataset and analysis. In: Proceedings of ICSE 2015. IEEE (2015)Google Scholar
  10. 10.
    Koci, E., Thiele, M., Romero, O., Lehner, W.: A machine learning approach for layout inference in spreadsheets. In: KDIR (2016)Google Scholar
  11. 11.
    Mohanty, H., Bhuyan, P., Chenthati, D.: Big Data: A Primer. Springer India, New Delhi (2015)CrossRefGoogle Scholar
  12. 12.
    O’Leary, D.E.: Embedding ai and crowdsourcing in the big data lake. IEEE Intelligent Systems 29(5), 70–73 (2014)CrossRefGoogle Scholar
  13. 13.
    Papadias, D., Theodoridis, Y.: Spatial relations, minimum bounding rectangles, and spatial data structures. International Journal of Geographical Information Science 11(2), 111–138 (1997)CrossRefGoogle Scholar
  14. 14.
    Ting, K.M.: Precision and recall. In: Encyclopedia of machine learning, pp. 781–781. Springer (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Elvis Koci
    • 1
    • 2
  • Maik Thiele
    • 1
  • Oscar Romero
    • 2
  • Wolfgang Lehner
    • 1
  1. 1.Database Technology Group, Department of Computer Science, Technische Universität DresdenDresdenGermany
  2. 2.Departament d’Enginyeria de Serveis i Sistemes d’Informació (ESSI)Universitat Politècnica de Catalunya-BarcelonaTechBarcelonaSpain

Personalised recommendations