An Automatic Workflow for the Formalization of Scholarly Articles’ Structural and Semantic Elements

  • Bahar Sateli
  • René WitteEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 641)


We present a workflow for the automatic transformation of scholarly literature to a Linked Open Data (LOD) compliant knowledge base to address Task 2 of the Semantic Publishing Challenge 2016. In this year’s task, we aim to extract various contextual information from full-text papers using a text mining pipeline that integrates LOD-based Named Entity Recognition (NER) and triplification of the detected entities. In our proposed approach, we leverage an existing NER tool to ground named entities, such as geographical locations, to their LOD resources. Combined with a rule-based approach, we demonstrate how we can extract both the structural (e.g., floats and sections) and semantic elements (e.g., authors and their respective affiliations) of the provided dataset’s documents. Finally, we integrate the LODeXporter, our flexible exporting module to represent the results as semantic triples in RDF format. As the result, we generate a scalable, TDB-based knowledge base that is interlinked with the LOD cloud, and a public SPARQL endpoint for the task’s queries. Our submission won the second place at the SemPub2016 challenge Task 2 with an average 0.63 F-score.


  1. 1.
    Sateli, B., Witte, R.: Automatic construction of a semantic knowledge base from CEUR workshop proceedings. In: Gandon, F., et al. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 129–141. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-25518-7_11 CrossRefGoogle Scholar
  2. 2.
    Sateli, B., Witte, R.: Semantic representation of scientific literature: bringing claims, contributions and named entities onto the linked open data cloud. PeerJ Comput. Sci. 1, e37 (2015). doi: 10.7717/peerj-cs.37 CrossRefGoogle Scholar
  3. 3.
    Shotton, D., Peroni, S.: DoCO, the Document Components Ontology (2011)Google Scholar
  4. 4.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., Funk, A., Roberts, A., Damljanovic, D., Heitz, T., Greenwood, M.A., Saggion, H., Petrak, J., Li, Y., Peters, W.: Text Processing with GATE (Version 6). University of Sheffield, Department of Computer Science, Sheffield (2011)Google Scholar
  5. 5.
    Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM Symposium on Document Engineering (DocEng 2013), pp. 177–180. ACM, New York (2013)Google Scholar
  6. 6.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: a framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 2002) (2002)Google Scholar
  7. 7.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  8. 8.
    Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems, pp. 1–8. ACM (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Semantic Software Lab, Department of Computer Science and Software EngineeringConcordia UniversityMontréalCanada

Personalised recommendations