Disentangling the Structure of Tables in Scientific Literature

  • Nikola Milosevic
  • Cassie Gregson
  • Robert Hernandez
  • Goran Nenadic
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9612)

Abstract

Within the scientific literature, tables are commonly used to present factual and statistical information in a compact way, which is easy to digest by readers. The ability to “understand” the structure of tables is key for information extraction in many domains. However, the complexity and variety of presentation layouts and value formats makes it difficult to automatically extract roles and relationships of table cells. In this paper, we present a model that structures tables in a machine readable way and a methodology to automatically disentangle and transform tables into the modelled data structure. The method was tested in the domain of clinical trials: it achieved an F-score of 94.26 % for cell function identification and 94.84 % for identification of inter-cell relationships.

Keywords

Table mining Text mining Data management Data modelling Natural language processing 

References

  1. 1.
    Alley, M.: The Craft of Scientific Writing. Springer Science & Business Media, New York (1996)CrossRefGoogle Scholar
  2. 2.
    Attwood, T.K., Kell, D.B., McDermott, P., Marsh, J., Pettifer, S., Thorne, D.: Utopia documents: linking scholarly literature with research data. Bioinformatics 26(18), i568–i574 (2010)CrossRefGoogle Scholar
  3. 3.
    Bodenreider, O.: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32(Suppl 1), D267–D270 (2004)CrossRefGoogle Scholar
  4. 4.
    Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. VLDB Endowment 1(1), 538–549 (2008)CrossRefGoogle Scholar
  5. 5.
    Chavan, M.M., Shirgave, S.: A methodology for extracting head contents from meaningful tables in web pages. In: 2011 International Conference on Communication Systems and Network Technologies (CSNT), pp. 272–277. IEEE (2011)Google Scholar
  6. 6.
    Divoli, A., Wooldridge, M.A., Hearst, M.A.: Full text and figure display improves bioscience literature search. PloS One 5(4), e9619 (2010)CrossRefGoogle Scholar
  7. 7.
    Doush, I.A., Pontelli, E.: Non-visual navigation of spreadsheets. Univ. Access Inf. Soc. 12(2), 143–159 (2013)CrossRefGoogle Scholar
  8. 8.
    Hearst, M.A., Divoli, A., Guturu, H., Ksikes, A., Nakov, P., Wooldridge, M.A., Ye, J.: Biotext search engine: beyond abstract search. Bioinformatics 23(16), 2196–2197 (2007)CrossRefGoogle Scholar
  9. 9.
    Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: A system for understanding and reformulating tables. In: Proceedings of the Fourth IAPR International Workshop on Document Analysis Systems, pp. 361–372 (2000)Google Scholar
  10. 10.
    Hurst, M.F.: The interpretation of tables in texts. Ph.D. Thesis, University of Edinburgh (2000)Google Scholar
  11. 11.
    Jensen, L.J., Saric, J., Bork, P.: Literature mining for the biologist: from information retrieval to biological discovery. Nat. Rev. Genet. 7(2), 119–129 (2006)CrossRefGoogle Scholar
  12. 12.
    Jung, S.W., Kwon, H.C.: A scalable hybrid approach for extracting head components from web tables. IEEE Trans. Knowl. Data Eng. 18(2), 174–187 (2006)CrossRefGoogle Scholar
  13. 13.
    Kieninger, T., Dengel, A.R.: The T-Recs table recognition and analysis system. In: Lee, S.-W., Nakano, Y. (eds.) DAS 1998. LNCS, vol. 1655, pp. 255–270. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  14. 14.
    Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endowment 3(1–2), 1338–1347 (2010)CrossRefGoogle Scholar
  15. 15.
    Milosevic, N., Gregson, C., Hernandez, R., Nenadic, G.: Extracting patient data from tables in clinical literature: Case study on extraction of BMI, weight and number of patients. In: Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2016), vol. 5, pp. 223–228 (2016)Google Scholar
  16. 16.
    Mulwad, V., Finin, T., Syed, Z., Joshi, A.: Using linked data to interpret tables. In: Proceedings of the First International Conference on Consuming Linked Data, vol. 665, pp. 109–120. CEUR-WS.org (2010)Google Scholar
  17. 17.
    Ng, H.T., Lim, C.Y., Koo, J.L.T.: Learning to recognize tables in free text. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 443–450. ACL (1999)Google Scholar
  18. 18.
    Quercini, G., Reynaud, C.: Entity discovery and annotation in tables. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 693–704. ACM (2013)Google Scholar
  19. 19.
    Son, J.W., Lee, J.A., Park, S.B., Song, H.J., Lee, S.J., Park, S.Y.: Discriminating meaningful web tables from decorative tables using a composite kernel. In: 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2008, vol. 1, pp. 368–371. IEEE (2008)Google Scholar
  20. 20.
    Spasić, I., Livsey, J., Keane, J.A., Nenadić, G.: Text mining of cancer-related information: review of current status and future directions. Int. J. Med. Inf. 83(9), 605–623 (2014)CrossRefGoogle Scholar
  21. 21.
    Tengli, A., Yang, Y., Ma, N.L.: Learning table extraction from examples. In: Proceedings of the 20th International Conference on Computational Linguistics, pp. 987–994. ACL (2004)Google Scholar
  22. 22.
    Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. In: Proceedings of the 11th International Conference on World Wide Web, pp. 242–250. ACM (2002)Google Scholar
  23. 23.
    Wei, X., Croft, B., McCallum, A.: Table extraction for answer retrieval. Inf. Retrieval 9(5), 589–611 (2006)CrossRefGoogle Scholar
  24. 24.
    Wong, W., Martinez, D., Cavedon, L.: Extraction of named entities from tables in gene mutation literature. In: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, pp. 46–54. ACL (2009)Google Scholar
  25. 25.
    Yesilada, Y., Stevens, R., Goble, C., Hussein, S.: Rendering tables in audio: the interaction of structure and reading styles. In: ACM SIGACCESS Accessibility and Computing, pp. 16–23. No. 77–78. ACM (2004)Google Scholar
  26. 26.
    Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: a method to extract table information from pdf files. In: IICAI, pp. 1773–1785 (2005)Google Scholar
  27. 27.
    Zhu, F., Patumcharoenpol, P., Zhang, C., Yang, Y., Chan, J., Meechai, A., Vongsangnak, W., Shen, B.: Biomedical text mining and its applications in cancer research. J. Biomed. Inf. 46(2), 200–211 (2013)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Nikola Milosevic
    • 1
  • Cassie Gregson
    • 2
  • Robert Hernandez
    • 2
  • Goran Nenadic
    • 1
    • 3
  1. 1.School of Computer ScienceUniversity of ManchesterManchesterUK
  2. 2.AstraZeneca LtdCambridgeUK
  3. 3.Health EResearch CentreManchesterUK

Personalised recommendations