Advertisement

Rule-Based Canonicalization of Arbitrary Tables in Spreadsheets

  • Alexey O. ShigarovEmail author
  • Viacheslav V. Paramonov
  • Polina V. Belykh
  • Alexander I. Bondarev
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 639)

Abstract

Arbitrary tables presented in spreadsheets can be an important data source in business intelligence. However, many of them have complex layouts that hinder the process of extracting, transforming, and loading their data in a database. The paper is devoted to the issues of rule-based data transformation from arbitrary tables presented in spreadsheets to a structured canonical form that can be loaded into a database by regular ETL-tools. We propose a system for canonicalization of arbitrary tables presented in spreadsheets as an implementation of our methodology for rule-based table analysis and interpretation. It enables the execution of rules expressed in our specialized rule language called CRL to recover implicit relationships in a table. Our experimental results show that particular CRL-programs can be developed for different sets of tables with similar features to automate table canonicalization with high accuracy.

Keywords

Unstructured data integration Table understanding Table analysis and interpretation Spreadsheet data transformation 

Notes

Acknowledgements

We thank Prof. George Nagy and all members of TANGO research group(http://tango.byu.edu) for providing and discussing the TANGO dataset for our experiments.

This work was financially supported by the Russian Foundation for Basic Research (Grant No. 15-37-20042 and 14-07-00166) and Council for Grants of the President of Russian Federation (Grant No. NSh-8081.2016.9). The presented web-service for table canonicalization is performed on resources of the Shared Equipment Center of Integrated information and computing network of Irkutsk Research and Educational Complex(http://net.icc.ru).

References

  1. 1.
    Unstructured information management architecture (UIMA) version 1.0 (2009). http://docs.oasis-open.org/uima/v1.0/uima-v1.0.html
  2. 2.
    Abraham, R., Erwig, M.: UCheck: A spreadsheet type checker for end users. J. Vis. Lang. Comput. 18(1), 71–95 (2007)CrossRefGoogle Scholar
  3. 3.
    Astrakhantsev, N., Turdakov, D., Vassilieva, N.: Semi-automatic data extraction from tables. In: Selected Papers of the 15th All-Russian Scientific Conference on Digital Libraries: Advanced Methods and Technologies, Digital Collections, pp. 14–20 (2013)Google Scholar
  4. 4.
    Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: WebTables: Exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)CrossRefGoogle Scholar
  5. 5.
    Chambers, C., Erwig, M.: Automatic detection of dimension errors in spreadsheets. J. Vis. Lang. Comput. 20(4), 269–283 (2009)CrossRefGoogle Scholar
  6. 6.
    Chen, Z., Cafarella, M.: Automatic web spreadsheet data extraction. In: Proceedings 3rd International Workshop on Semantic Search Over the Web, pp. 1: 1–1: 8. ACM, New York, NY, USA (2013)Google Scholar
  7. 7.
    Chen, Z., Cafarella, M.: Lntegrating spreadsheet data via accurate and low-effort extraction. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1126–1135. ACM, New York, NY, USA (2014)Google Scholar
  8. 8.
    Cunha, J., Saraiva, J.A., Visser, J.: From spreadsheets to relational databases and back. In: Proceedings ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation, pp. 179–188. ACM, New York, PEPM 2009, NY, USA (2009)Google Scholar
  9. 9.
    Embley, D.W., Krishnamoorthy, M.S., Nagy, G., Seth, S.: Converting heterogeneous statistical tables on the web to searchable databases. Int. J. Doc. Anal. Recogn. 19, 1–20 (2016)CrossRefGoogle Scholar
  10. 10.
    Embley, D.W., Seth, S., Nagy, G.: Transforming web tables to a relational database. In: Proceedings 22nd International Conference on Pattern Recognition, pp. 2781–2786. ICPR 2014, IEEE Comp. Soc., Washington, DC, USA (2014)Google Scholar
  11. 11.
    Embley, D., Tao, C., Liddle, S.: Automating the extraction of data from HTML tables with unknown structure. Data Knowl. Eng. 54(1), 3–28 (2005)CrossRefGoogle Scholar
  12. 12.
    Galkin, M., Mouromtsev, D., Auer, S.: Identifying web tables: Supporting a neglected type of content on the web. In: Proceedings of the 6th International Conference Knowledge Engineering and Semantic Web, Moscow, Russia. Communications in Computer and Information Science, vol. 518, pp. 48–62 (2015)Google Scholar
  13. 13.
    Gatterbauer, W., Bohunsky, P., Herzog, M., Krpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: Proceedings 16th International Conference on World Wide Web, pp. 71–80. New York, US (2007)Google Scholar
  14. 14.
    Govindaraju, V., Zhang, C., Ré, C.: Understanding tables in context using standard NLP toolkits. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL. vol. 2: Short Papers, pp. 658–664 (2013)Google Scholar
  15. 15.
    Hung, V.: Spreadsheet-Based Complex Data Transformation. Ph.D. thesis, School of Computer Science and Engineering, University of New South Wales, Sydney, Australia (2011)Google Scholar
  16. 16.
    Hung, V., Benatallah, B., Saint-Paul, R.: Spreadsheet-based complex data transformation. In: Proceedings 20th ACM International Conference on Information and Knowledge Management, pp. 1749–1754. ACM, New York, CIKM 2011, NY, USA (2011)Google Scholar
  17. 17.
    Kim, Y.S., Lee, K.H.: Extracting logical structures from html tables. Comput. Stand. Interfaces 30(5), 296–308 (2008)CrossRefGoogle Scholar
  18. 18.
    Kudinov, P.Y.: Extracting statistics indicators from tables of basic structure. Pattern Recogn. Image Anal. 21(4), 630–636 (2011)CrossRefGoogle Scholar
  19. 19.
    Nagy, G., Embley, D., Seth, S.: End-to-end conversion of html tables for populating a relational database. In: Proceedings 11th IAPR International Workshop on Document Analysis Systems, pp. 222–226. IEEE Computer Society, Tours Loire Valley, France, April 2014Google Scholar
  20. 20.
    Pivk, A., Cimiano, P., Sure, Y.: From tables to frames. Web Semant. 3(2–3), 132–146 (2005)CrossRefGoogle Scholar
  21. 21.
    Pivk, A.: Thesis: Automatic ontology generation from web tabular structures. AI Commun. 19(1), 83–85 (2006)MathSciNetGoogle Scholar
  22. 22.
    Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovič, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(3), 567–595 (2007)CrossRefGoogle Scholar
  23. 23.
    Seth, S., Nagy, G.: Segmenting tables via indexing of value cells by table headers. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 887–891, August 2013Google Scholar
  24. 24.
    Shigarov, A.: Rule-based table analysis and interpretation. In: Proceedings of the 21st International Conference on Information and Software Technologies. Communications in Computer and Information Science, vol. 538, pp. 175–186 (2015)Google Scholar
  25. 25.
    Shigarov, A.: Table understanding using a rule engine. Expert Syst. Appl. 42(2), 929–937 (2015)CrossRefGoogle Scholar
  26. 26.
    Tijerino, Y., Embley, D., Lonsdale, D., Ding, Y., Nagy, G.: Towards ontology generation from tables. World Wide Web: Int. Web Inf. Syst. 8(3), 261–285 (2005)CrossRefGoogle Scholar
  27. 27.
    Wang, J., Wang, H., Wang, Z., Zhu, K.Q.: Understanding tables on the web. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 141–155. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-34002-4_11 CrossRefGoogle Scholar
  28. 28.
    Wang, X.: Tabular Abstraction, Editing, and Formatting. Ph.D. thesis, University of Waterloo, Waterloo, Ontario, Canada (1996)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Alexey O. Shigarov
    • 1
    Email author
  • Viacheslav V. Paramonov
    • 1
  • Polina V. Belykh
    • 1
  • Alexander I. Bondarev
    • 1
  1. 1.Matrosov Institute for System Dynamics and Control Theory of SB RASIrkutskRussia

Personalised recommendations