Knowledge Extraction and Modeling from Scientific Publications

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9792)


During the last decade the amount of scientific articles available online has substantially grown in parallel with the adoption of the Open Access publishing model. Nowadays researchers, as well as any other interested actor, are often overwhelmed by the enormous and continuously growing amount of publications to consider in order to perform any complete and careful assessment of scientific literature. As a consequence, new methodologies and automated tools to ease the extraction, semantic representation and browsing of information from papers are necessary. We propose a platform to automatically extract, enrich and characterize several structural and semantic aspects of scientific publications, representing them as RDF datasets. We analyze papers by relying on the scientific Text Mining Framework developed in the context of the European Project Dr. Inventor. We evaluate how the Framework supports two core scientific text analysis tasks: rhetorical sentence classification and extractive text summarization. To ease the exploration of the distinct facets of scientific knowledge extracted by our platform, we present a set of tailored Web visualizations. We provide on-line access to both the RDF datasets and the Web visualizations generated by mining the papers of the 2015 ACL-IJCNLP Conference.


Scientific knowledge extraction Knowledge modeling RDF Software framework 


  1. 1.
    Munroe, R.: The rise of open access. Science 342(6154), 58–59 (2013). CrossRefGoogle Scholar
  2. 2.
    Björk, B.C., Laakso, M., Welling, P., Paetau, P.: Anatomy of green open access. J. Assoc. Inf. Sci. Technol. 65(2), 237–250 (2014)CrossRefGoogle Scholar
  3. 3.
    Solomon, D.J., Laakso, M., Björk, B.C.: A longitudinal comparison of citation rates and growth among open access journals. J. Inf. 7(3), 642–650 (2013)CrossRefGoogle Scholar
  4. 4.
    Lewis, D.W.: The inevitability of open access. Coll. Res. Libr. 73(5), 493–506 (2012)CrossRefGoogle Scholar
  5. 5.
    Huh, S.: Coding practice of the journal article tag suite extensible markup language. Sci. Editing 1(2), 105–112 (2014)CrossRefGoogle Scholar
  6. 6.
    Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM Symposium on Document Engineering, pp. 177–180. ACM (2013)Google Scholar
  7. 7.
    Bohnet, B.: Very high accuracy and fast dependency parsing is not a contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 89–97. Association for Computational Linguistics (2010)Google Scholar
  8. 8.
    Tkaczyk, D., Szostek, P., Dendek, P.J., Fedoryszak, M., Bolikowski, L.: CERMINE-automatic extraction of metadata and references from scientific literature. In: 2014 11th IAPR International Workshop on Document Analysis Systems (DAS), pp. 217–221. IEEE (2014)Google Scholar
  9. 9.
    Ramakrishnan, C., Patnia, A., Hovy, E.H., Burns, G.A.: Layout-aware text extraction from full-text PDF of scientific articles. Sour. Code Biol. Med. 7(1), 7 (2012)CrossRefGoogle Scholar
  10. 10.
    Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. Inf. Process. Manage. 42(4), 963–979 (2006)CrossRefGoogle Scholar
  11. 11.
    Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 219–228. ACM (2013)Google Scholar
  12. 12.
    Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: an open-source CRF reference string parsing package. In: LREC (2008)Google Scholar
  13. 13.
    Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. In: Multimedia Storage and Retrieval Innovations for Digital Library Systems, vol. 270 (2012)Google Scholar
  14. 14.
    Liakata, M., Saha, S., Dobnik, S., Batchelor, C., Rebholz-Schuhmann, D.: Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics 28(7), 991–1000 (2012)CrossRefGoogle Scholar
  15. 15.
    Teufel, S.: The structure of scientific articles: applications to citation indexing and summarization. Comput. Linguist. 38(2), 443–445 (2012)CrossRefGoogle Scholar
  16. 16.
    Nakov, P.I., Schwartz, A.S., Hearst, M.: Citances: citation sentences for semantic analysis of bioscience text. In: Proceedings of the SIGIR 2004 Workshop on Search and Discovery in Bioinformatics, pp. 81–88 (2004)Google Scholar
  17. 17.
    Abu-Jbara, A., Ezra, J., Radev, D.R.: Purpose and polarity of citation: towards NLP-based bibliometrics. In: HLT-NAACL, pp. 596–606 (2013)Google Scholar
  18. 18.
    Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 500–509. Association for Computational Linguistics (2011)Google Scholar
  19. 19.
    Ronzano, F., Saggion, H.: Taking advantage of citances: citation scope identification and citation-based summarization. In: Text Analytics Conference (2014)Google Scholar
  20. 20.
    Smit, E., Van Der Graaf, M.: Journal article mining: the scholarly publishers’ perspective. Learn. Publ. 25(1), 35–46 (2012)CrossRefGoogle Scholar
  21. 21.
    Ciancarini, P., Iorio, A., Nuzzolese, A.G., Peroni, S., Vitali, F.: Semantic annotation of scholarly documents and citations. In: Baldoni, M., Baroglio, C., Boella, G., Micalizio, R. (eds.) AI*IA 2013. LNCS (LNAI), vol. 8249, pp. 336–347. Springer, Cham (2013). doi: 10.1007/978-3-319-03524-6_29 CrossRefGoogle Scholar
  22. 22.
    Sateli, B., Witte, R.: What’s in this paper?: Combining rhetorical entities with linked open data for semantic literature querying. In: Proceedings of the 24th International Conference on World Wide Web Companion, pp. 1023–1028 (2015)Google Scholar
  23. 23.
    Shotton, D.: Semantic publishing: the coming revolution in scientific journal publishing. Learn. Publ. 22(2), 85–94 (2009)CrossRefGoogle Scholar
  24. 24.
    Iorio, A.D., Lange, C., Dimou, A., Vahdati, S.: Semantic publishing challenge – assessing the quality of scientific output by information extraction and interlinking. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 65–80. Springer, Cham (2015). doi: 10.1007/978-3-319-25518-7_6 CrossRefGoogle Scholar
  25. 25.
    Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, Ł.: CERMINE: automatic extraction of structured metadata from scientific literature. Int. J. Doc. Anal. Recogn. (IJDAR) 18(4), 317–335 (2015)CrossRefGoogle Scholar
  26. 26.
    Ronzano, F., Saggion, H.: Dr. Inventor framework: extracting structured information from scientific publications. In: Japkowicz, N., Matwin, S. (eds.) DS 2015. LNCS (LNAI), vol. 9356, pp. 209–220. Springer, Cham (2015). doi: 10.1007/978-3-319-24282-8_18 CrossRefGoogle Scholar
  27. 27.
    Cunningham, H., Tablan, V., Roberts, A., Bontcheva, K.: Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS Comput. Biol. 9(2), e1002854 (2013)CrossRefGoogle Scholar
  28. 28.
    Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, Cambridge (2002)Google Scholar
  29. 29.
    Fisas, B., Ronzano, F., Saggion, H.: On the discoursive structure of computer graphics research papers. In: The 9th Linguistic Annotation Workshop held in Conjuncion with NAACL 2015, p. 42 (2015)Google Scholar
  30. 30.
    Fisas, B., Ronzano, F., Saggion, H.: A multi-layered annotated corpus of scientific papers. In: The Language Resource and Evaluation Conference (2016)Google Scholar
  31. 31.
    Mihalcea, R.: Graph-based ranking algorithms for sentence extraction, applied to text summarization. In: Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, p. 20. Association for Computational Linguistics (2004)Google Scholar
  32. 32.
    Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, vol. 8 (2004)Google Scholar
  33. 33.
    Moro, A., Cecconi, F., Navigli, R.: Multilingual word sense disambiguation and entity linking for everybody. In: Proceedings of ISWC (P&D), pp. 25–28 (2014)Google Scholar
  34. 34.
    Saggion, H.: SUMMA: a robust and adaptable summarization tool. Traitement Automatique des Langues 49(2), 103–125 (2008)Google Scholar
  35. 35.
    Ronzano, F., Fisas, B., Bosque, G.C., Saggion, H.: On the automated generation of scholarly publishing linked datasets: the case of CEUR-WS proceedings. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 177–188. Springer, Cham (2015). doi: 10.1007/978-3-319-25518-7_15 CrossRefGoogle Scholar
  36. 36.
    Peroni, S.: The semantic publishing and referencing ontologies. In: Peroni, S. (ed.) Semantic Web Technologies and Legal Scholarly Publishing. Law, Governance and Technology Series, vol. 15, pp. 121–193. Springer, Heidelberg (2014)Google Scholar
  37. 37.
    Thakker, D., Osman, T., Lakin, P.: Gate jape grammar tutorial. Nottingham Trent University, UK, Phil Lakin, UK, Version 1 (2009)Google Scholar
  38. 38.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)zbMATHGoogle Scholar
  39. 39.
    O’Donoghue, D.P., Abgaz, Y., Hurley, D., Ronzano, F., Saggion, H.: Stimulating and simulating creativity with Dr. Inventor. In: The Proceedings of the International Conference on Computational Creativity (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Natural Language Processing Group (TALN)Universitat Pompeu FabraBarcelonaSpain

Personalised recommendations