Advertisement

Converting heterogeneous statistical tables on the web to searchable databases

  • David W. EmbleyEmail author
  • Mukkai S. Krishnamoorthy
  • George Nagy
  • Sharad Seth
Original Paper

Abstract

Much of the world’s quantitative data reside in scattered web tables. For a meaningful role in Big Data analytics, the facts reported in these tables must be brought into a uniform framework. Based on a formalization of header-indexed tables, we proffer an algorithmic solution to end-to-end table processing for a large class of human-readable tables. The proposed algorithms transform header-indexed tables to a category table format that maps easily to a variety of industry-standard data stores for query processing. The algorithms segment table regions based on the unique indexing of the data region by header paths, classify table cells, and factor header category structures of two-dimensional as well as the less common multidimensional tables. Experimental evaluations substantiate the algorithmic approach to processing heterogeneous tables. As demonstrable results, the algorithms generate queryable relational database tables and semantic-web triple stores. Application of our algorithms to 400 web tables randomly selected from diverse sources shows that the algorithmic solution automates end-to-end table processing.

Keywords

Document analysis Table segmentation Table analysis Table header factoring End-to-end table processing Table headers Queries over table data 

Notes

Acknowledgments

Mukkai Krishnamoorthy acknowledges the help of Dr. Ravi Palla with Protégé. Prof. Andreas Dengel (DFKI) gave us excellent advice not only for improving the presentation but also for one of the algorithms.

References

  1. 1.
    Cafarella, W.J., Halevy, A., Wang, D.Z., Wu, E. , Zhang, Y.: Webtables: exploring the power of tables on the web. In: VLDB ’08, Auckland, New Zealand (2008)Google Scholar
  2. 2.
    Galkin, M., Mouromtsev, D., Auer, S.: Identifying web tables—supporting a neglected type of content on the web. In: International Conference on Knowledge Engineering and Semantic Web (KESW). arXiv:1503.06598 [cs.IR] (2015)
  3. 3.
    Wang, X.: Tabular abstraction, editing, and formatting, Ph.D. thesis, University of Waterloo (1996)Google Scholar
  4. 4.
    Frier, B.: Roman life expectancy: Ulpian’s evidence. Harv. Stud. Classic. Philol. 86, 213–251 (1982)CrossRefGoogle Scholar
  5. 5.
    Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Int. J. Doc. Anal. Recognit. 7(1), 1–16 (2004)CrossRefGoogle Scholar
  6. 6.
    Laurentini, A., Viada, P.: Identifying and understanding tabular material in compound documents. In: Proceedings of the Eleventh International Conference on Pattern Recognition (ICPR’92), The Hague, pp. 405–409 (1992)Google Scholar
  7. 7.
    Turolla, E., Belaid, Y., Belaid, A.: Form item extraction based on line searching. In: Kasturi, R., Tombre, K. (eds.) Graphics Recognition—Methods and Applications. Lecture Notes in Computer Science, vol. 1072, pp. 69–79. Springer, Berlin (1996)Google Scholar
  8. 8.
    Chandran, S., Kasturi, R.: Structural recognition of tabulated data. In: Proceedings of the Second International Conference on Document Analysis and Recognition (ICDAR’93), Tsukuba Science City, Japan, pp. 516–519 (1993)Google Scholar
  9. 9.
    Itonori, K.: A table structure recognition based on textblock arrangement and ruled line position. In: Proceedings of the Second International Conference on Document Analysis and Recognition (ICDAR’93), Tsukuba Science City, Japan, pp. 765–768 (1993)Google Scholar
  10. 10.
    Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: Proceedings of the 26th Annual International ACM Y. SIGIR Conference on Research and Development in Information Retrieval, pp. 235–242 (2003)Google Scholar
  11. 11.
    Hirayama, Y.: A method for table structure analysis using DP matching. In: Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR’95), Montreal, Canada, pp. 583–586 (1995)Google Scholar
  12. 12.
    Handley, J.C.: Document recognition. In: Dougherty, E.R. (ed.) Electronic Imaging Technology, chap. 8. SPIE—The International Society for Optical Engineering (1999)Google Scholar
  13. 13.
    Zuyev, K.: Table image segmentation. In: Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’97), pp. 705–708 (1997)Google Scholar
  14. 14.
    Cesarini, F., Marinai, S., Sarti, L., Soda, G.: Trainable table location in document images. Procs. 16th Int’l Conf on Pattern Recognition 3(236–240), 2002 (2002)Google Scholar
  15. 15.
    Wang, Y., Hu, J.: A machine learning approach to table detection on the web. In: WWW Conference, Honolulu, pp. 242–250 (2002)Google Scholar
  16. 16.
    Abu-Tarif, A.: Table processing and table understanding, Master’s thesis, Rensselaer Polytechnic Institute, May (1998)Google Scholar
  17. 17.
    Rastan, R., Paik, H.-Y., Shepherd, J.: TEXUS: A task-based approach for table extraction and understanding. In: Proceedings of the ACM Conference on Document Engineering, Lausanne, vol. 15, pp. 25–34, Sept (2015)Google Scholar
  18. 18.
    Pyreddy, P., Croft, W.B.: TINTIN, a system for retrieval in text tables. Technical Report UM-CS-1997-002, University of Massachusetts, Amherst (1997)Google Scholar
  19. 19.
    Kieninger, T.G.: Table structure recognition based on robust block segmentation. In: Proceedings of Document Recognition V (IS&T/SPIE Electronic Imaging’98), San Jose, CA, vol. 3305, pp. 22–32 (1998)Google Scholar
  20. 20.
    Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Table structure recognition and its evaluation. In: Kantor, P.B., Lopresti, D.P., Zhou, J. (eds.) Proceedings of Document Recognition and Retrieval VIII(IS&T/SPIE Electronic Imaging), San Jose, CA, vol. 4307, pp. 44–55. (2001)Google Scholar
  21. 21.
    W3, HTML: The Markup Language (an HTML language reference). Retrieved 25 Sept 2015. http://www.w3.org/TR/html-markup/syntax.html#doctype-syntax
  22. 22.
    Creativyst, The Comma Separated Value (CSV) File Forma. http://creativyst.com/Doc/Articles/CSV/CSV01.htm
  23. 23.
    Gatterbauer, W., Bohunsky, P., Krüpl, B., Pollak, B., Herzog, M.: Towards Domain Independent Information Extraction from Web Tables. In: WWW, Banff, Alberta, Canada, 8–12 May 2007Google Scholar
  24. 24.
    Amano, A., Asada, N.: Graph grammar based analysis system of complex table form document. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (2003)Google Scholar
  25. 25.
    Bing, L., Zao, J., Hong, X.: New method for logical structure extraction of form document image. In: Proceedings of Document Recognition and Retrieval VI (IS&T/SPIE Electronic Imaging ’99), San Jose, CA, vol. 3651, pp. 183–193 (1999)Google Scholar
  26. 26.
    Kieninger, T., Dengel, A.: A paper-to-HTML table converting system. In: Proceedings of Document Analysis Systems, (DAS) 98, Nagano, Japan (1998)Google Scholar
  27. 27.
    Coüasnon, B., Camillerapp, J., Leplumey, I.: Making handwritten archives documents accessible to public with a generic system of document image analysis. In: Proceedings of the International Workshop on Document Image Analysis for Libraries, Palo Alto, CA, pp. 270–277 (2004)Google Scholar
  28. 28.
    Martinat, I., Coüasnon, B., Camillerapp, J.: An adaptative recognition system using a table description language for hierarchical table structures in archival documents. In: Graphics Recognition: Recent Advances and Perspectives. Lecture Note in Computer Science, vol. 5046, pp. 9–20. Springer (2008)Google Scholar
  29. 29.
    Lemaitre, A., Camillerapp, J., Coüasnon, B.: Multiresolution cooperation improves document structure recognition. Int. J. Doc. Anal. Recognit. (IJDAR) 11(2), 97–109 (2008)CrossRefGoogle Scholar
  30. 30.
    Klein, B., Agne, S., Dengel, A.: On benchmarking of invoice analysis systems. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006, LNCS, vol 3872, pp 312–323. Springer, Heidelberg (2006)Google Scholar
  31. 31.
    Klein, B., Dengel, A.: Problem-adaptable document analysis and understanding for high-volume applications. IJDAR 6(3), 167–180 (2003)CrossRefGoogle Scholar
  32. 32.
    Hamza, H., Belaid, Y., Belaid, A.: A case-based reasoning approach for invoice structure extraction. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition, ICDAR 2007, vol. 1, pp. 327–331 (2007)Google Scholar
  33. 33.
    Watanabe, T., Quo, Q.L., Sugie, N.: Layout recognition of multikinds of table-form documents. IEEE Trans. Pattern Anal. Mach. Intell. 17(4), 432–445 (1995)CrossRefGoogle Scholar
  34. 34.
    Shamalian, H., Baird, H.S., Wood, T.L.: A retargetable table reader. In: Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’97), pp. 158–163 (1997)Google Scholar
  35. 35.
    Fang, J., Mitra, P., Tang, Z., Giles, L.: Table header detection and classification. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, vol. 599–605 (2012)Google Scholar
  36. 36.
    Shigarov, A.O.: Table understanding using a rule engine. Expert Syst. Appl. 42(2), 929–937 (2015)CrossRefGoogle Scholar
  37. 37.
    Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. In: IEEE Intelligent Systems (2009)Google Scholar
  38. 38.
    Venetis, P., Halevy, A., Madhavan, J., Pasca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recovering semantics of tables on the web. In: Proceedings of the LDB Endowment, vol. 4, 9 edn. (2011)Google Scholar
  39. 39.
    Gonzalez, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R., Shen, W., Goldberg-Kidony, J.: Google fusion tables: web-centered data management and collaboration. In: SIGMOD’10, Indianapolis, Indiana, USA, 6–11 June 2010Google Scholar
  40. 40.
    Adelfio, M.D., Samet, H.: Schema extraction for tabular data on the web. In: Proceedings of The 39th International Conference on Very Large Data Bases, (Proceedings of the VLDB Endowment, vol. 6, 6 edn.), Riva del Garda, Trento, Italy 26–30 August 2013Google Scholar
  41. 41.
    Long, V.: An agent-based approach to table recognition and interpretation, Macquarie University Ph.D. dissertation, May (2010)Google Scholar
  42. 42.
    Astrakhantsev, N.: Extracting objects and their attributes from tables in text documents. In: Turdakov, D., Simanovsky, A. (eds.) Proceedings of the Seventh Spring Researchers Colloquium on Databases and Information Systems, SYRCoDIS 2011, Moscow, Russia, CEUR Workshop Proceedings 735 CEUR-WS.org 2011 pp. 34–37 (2011)Google Scholar
  43. 43.
    Hurst, M., Douglas, S.L: Layout and language: preliminary investigations in recognizing the structure of tables. In: Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’97), pp. 1043–047 (1997)Google Scholar
  44. 44.
    Hurst, M.: Towards a theory of tables. Int. J. Doc. Anal. Recognit. 8(2–3), 66–86 (2006). (Springer, Heidelberg)Google Scholar
  45. 45.
    Hurst, M.: The interpretation of tables in texts, Ph.D. thesis, University of Edinburgh, (2000)Google Scholar
  46. 46.
    Costa e Silva, A., Jorge, A.M., Torgo, L.: Design of an end-to-end method to extract information from tables. Int. J. Doc. Anal. Recognit. 8(2), 144–171 (2006)Google Scholar
  47. 47.
    Kim, Y.-S., Lee, K.-Y.: Extracting logical structures from HTML tables. Comput. Stand. Interfaces 30(5), 296–308 (2008)CrossRefGoogle Scholar
  48. 48.
    Pivk, A., et al.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60, 567–595 (2007)CrossRefGoogle Scholar
  49. 49.
    Chen, Z., Cafarella, M.: Automatic web spreadsheet data extraction. In: Proceedings of the 3rd International Workshop on Semantic Search over the Web (SSW 2013), Riva del Garda, Trento, Italy, 30 Aug (2013)Google Scholar
  50. 50.
    Astrakev, N., Turdakov, D., Vassilieva, N.: Semi-automatic data extraction from tables. In: Proceedings of the 15th All-Russian Conference on Digital Libraries: Advanced Methods and Technologies, Digital Collection—RCDL, Yaroslavl, Russia (2013)Google Scholar
  51. 51.
    Kasar, T., Bhowmik, T.K., Belaid, A.: Table information extraction and structure recognition using query patterns. In: Proceedings 13th International Conference on Document Analysis and Recognition, ICDAR 2015, vol. 1, pp. 1086–1080 (2015)Google Scholar
  52. 52.
    Lopresti, D., Nagy, G.: Automated table processing: an (opinionated) survey. In: Proceedings of IAPR Workshop on Graphics Recognition (GREC99), Jaipur, India, pp. 109–134, Sept (1999)Google Scholar
  53. 53.
    Hu, J., Kashi, R., Lopresti, D., Wilfong, G., Nagy, G.: Why table ground-truthing is hard. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 129–133. IEEE Computer Society Press, Seattle, WA, Sept (2001)Google Scholar
  54. 54.
    Embley, D.W., Lopresti, D., Nagy, G.: Notes on contemporary table recognition. In: Bunke, H., Spitz, A.L., (eds.) Proceedings of the 7th International Workshop on Document Analysis Systems VII DAS 2006, vol. 3872, LNCS, pp. 164–175, Springer, Nelson, New Zealand, 13–15 Feb (2006)Google Scholar
  55. 55.
    Embley, D.W., Lopresti, D., Hurst, M., Nagy, G.: Table processing paradigms: a research survey. In: International Journal of Document Analysis and Recognition, vol. 8, 2–3 edn., pp. 66–86. Springer, June (2006)Google Scholar
  56. 56.
    Embley, D., Tao, C., Liddle, S.: Automating the extraction of data from HTML tables with unknown structure. Data Knowl. Eng. 54(1), 3–28 (2005)CrossRefGoogle Scholar
  57. 57.
    Tao, C., Embley, D.W.: Automatic hidden-web table interpretation, conceptualization, and semantic annotation. Data Knowl. Eng. 68(7), 683–703 (2009)CrossRefGoogle Scholar
  58. 58.
    Jandhyala, R.C., Krishnamoorthy, M., Nagy, G., Padmanabhan, R., Seth, S., Silversmith, W.: From tessellations to table interpretation. In: Carette, J. et al. (eds.) Proceedings of the 8th International Conference on Mathematical Knowledge Management, MKM 2009, Grand Bend, Ontario, Calculemus/MKM 2009, LNAI 5625, pp. 422–437. Springer, Berlin (2009)Google Scholar
  59. 59.
    Nagy, G.: Learning the characteristics of critical cells from web tables. In: Proceedings of the ICPR, Tsukuba, Japan, Nov (2012)Google Scholar
  60. 60.
    Embley, D.W., Krishnamoorthy, M., Nagy, G., Seth, S.: Factoring Web Tables. In: Mehrotra, K.G. et al. (eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 253–263. Springer, Berlin (2011)Google Scholar
  61. 61.
    Nagy, G., Tamhankar, M.: VeriClick, an efficient tool for table format verification. In: Proceedings of the SPIE 8297, Document Recognition and Retrieval XIX, 82970M, 23 Jan 2012Google Scholar
  62. 62.
    Seth, S., Nagy, G.: Segmenting Tables via indexing of value cells by table headers. In: Proceedings of the ICDAR 2013, Washington, DC, Aug (2013)Google Scholar
  63. 63.
    Nagy, G., Embley, D.W., Seth, S.: End-to-end conversion of HTML tables for populating a relational database. In: Proceedings of the DAS 2014, Tours, France (2014)Google Scholar
  64. 64.
    Embley, D.W., Seth, S., Nagy, G. : Transforming Web tables to a relational database. In: Proceedings of the ICPR 2014, Stockholm, Sweden (2014)Google Scholar
  65. 65.
    Embley, D.W., Seth, S., Krishnamoorthy, M., Nagy, G.: Clustering header categories extracted from web tables. In: Proceedings SPIE/IST Document Recognition and Retrieval, San Francisco, CA, Feb (2015)Google Scholar
  66. 66.
    U.S. Government Printing Office, Style Manual: An official guide to the form and style of Federal Government printing, section 13, 281–299. http://www.gpoaccess.gov/stylemanual/index.html (2008)
  67. 67.
    Balbiani, P., Condotta, J.-F., Farinas Del Cero, L.: Tractability results in the block algebra. J. Logic Comput. 12(5), 885–909 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  68. 68.
    Allen, J.F.: Maintaining knowledge about temporal intervals. Commun. ACM 26(11), 832–843 (1983)CrossRefzbMATHGoogle Scholar
  69. 69.
    Padmanabhan, R., Jandhyala, R.C., Krishnamoorthy, M., Nagy, G., Seth, S., Silversmith, W.: Interactive conversion of large web tables. GREC 25–36, 2009 (2009)Google Scholar
  70. 70.
  71. 71.
    W3C Semantic Web: Resource Description Framework (RDF). Retrieved 1/31/2015 from www.w3.org/RDF/ (2014)
  72. 72.
    W3C Semantic Web: Web Ontology Language (OWL). Retrieved 1/31/2015 from www.w3.org/OWL (2013)

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • David W. Embley
    • 1
    Email author
  • Mukkai S. Krishnamoorthy
    • 2
  • George Nagy
    • 2
  • Sharad Seth
    • 3
  1. 1.Computer Science DepartmentBrigham Young UniversityProvoUSA
  2. 2.Rensselaer Polytechnic InstituteTroyUSA
  3. 3.University of Nebraska LincolnLincolnUSA

Personalised recommendations