Dr. Inventor Framework: Extracting Structured Information from Scientific Publications

  • Francesco RonzanoEmail author
  • Horacio Saggion
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9356)


Even if research communities and publishing houses are putting increasing efforts in delivering scientific articles as structured texts, nowadays a considerable part of on-line scientific literature is still available in layout-oriented data formats, like PDF, lacking any explicit structural or semantic information. As a consequence the bootstrap of textual analysis of scientific papers is often a time-consuming activity. We present the first version of the Dr. Inventor Framework, a publicly available collection of scientific text mining components useful to prevent or at least mitigate this problem. Thanks to the integration and the customization of several text mining tools and on-line services, the Dr. Inventor Framework is able to analyze scientific publications both in plain text and PDF format, making explicit and easily accessible core aspects of their structure and semantics. The facilities implemented by the Framework include the extraction of structured textual contents, the discursive characterization of sentences, the identifications of the structural elements of both papers header and bibliographic entries and the generation of graph based representations of text excerpts. The Framework is distributed as a Java library. We describe in detail the scientific mining facilities included in the Framework and present two use cases where the Framework is respectively exploited to boost scientific creativity and to generate RDF graphs from scientific publications.


Scientific text mining Scientific information extraction Software framework 


  1. 1.
    Huh, S.: Coding practice of the Journal Article Tag Suite extensible markup language. Sci. Editing 1(2), 105–112 (2014)CrossRefGoogle Scholar
  2. 2.
    Cunningham, H., Maynard, D., Bontcheva, K.: Text Processing with GATE. Gateway Press CA, Murphys (2011)Google Scholar
  3. 3.
    Ramakrishnan, C., Patnia, A., Hovy, E.H., Burns, G.A.: Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol. Med. 7(1), 7 (2012)CrossRefGoogle Scholar
  4. 4.
    Tkaczyk, D., Szostek, P., Dendek, P.J., Fedoryszak, M., Bolikowski, L.: CERMINE-automatic extraction of metadata and references from scientific literature. In: 11th IAPR International Workshop on Document Analysis Systems (DAS), pp. 217–221. IEEE (2014)Google Scholar
  5. 5.
    Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: an open-source CRF reference string parsing package. In: LREC Proceedings (2008)Google Scholar
  6. 6.
    Constantin, A., Pettifer, S., Voronkov., A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM Symposium on Document Engineering. ACM (2013)Google Scholar
  7. 7.
    Abu-Jbara, A., Radev., D.: Coherent citation-based summarization of scientific papers. In: Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1. Association for Computational Linguistics (2011)Google Scholar
  8. 8.
    Abu-Jbara, A., Radev., D.: Reference scope identification in citing sentences. In: North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (2012)Google Scholar
  9. 9.
    Bohnet, B.: Very high accuracy and fast dependency parsing is not a contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics (2010)Google Scholar
  10. 10.
    Fisas, B., Saggion, H., Ronzano, F.: On the discursive structure of computer graphics research papers. In: Proceedings of the Linguistic Annotation Workshop, NA-ACL (2015)Google Scholar
  11. 11.
    Schlkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, Cambridge (2002)Google Scholar
  12. 12.
    O’Donoghue, D., Abgaz, Y., Hurley, D., Ronzano F., Saggion, H.: Stimulating and simulating creativity with Dr inventor. In: International Conference of Scientific Creativity (2015)Google Scholar
  13. 13.
    Gentner, D.: StructureMapping: a theoretical framework for analogy. Cogn. Sci. 7(2), 155–170 (1983)CrossRefGoogle Scholar
  14. 14.
    Teufel, S., Siddharthan, A., Batchelor, C.: Towards discipline-independent argumentative zoning: evidence from chemistry and computational linguistics. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, vol. 3, pp. 1493–1502. Association for Computational Linguistics (2009)Google Scholar
  15. 15.
    Liakata, M., Teufel, S., Siddharthan, A., Batchelor, C.R.: Corpora for the conceptualisation and zoning of scientific papers. In: LREC (2010)Google Scholar
  16. 16.
    Agarwal, S., Yu, H.: Automatically classifying sentences in full-text biomedical articles into Introduction, Methods Results and Discussion. Bioinformatics 25(23), 3174–3180 (2009)CrossRefGoogle Scholar
  17. 17.
    Guo, Y., Reichart, R., Korhonen, A.: Improved information structure analysis of scientific documents through discourse and lexical constraints. In: HLT-NAACL, pp. 928–937 (2013)Google Scholar
  18. 18.
    Saggion, H.: SUMMA a robust and adaptable summarization tool. Traitement Automatique des Langues 49, 103–125 (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.TALN Research GroupUniversitat Pompeu FabraBarcelonaSpain

Personalised recommendations