Automatic Construction of a Semantic Knowledge Base from CEUR Workshop Proceedings

Sateli, Bahar; Witte, René

doi:10.1007/978-3-319-25518-7_11

Bahar Sateli¹⁴ &
René Witte¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 548))

Included in the following conference series:

Semantic Web Evaluation Challenges

788 Accesses
7 Citations

Abstract

We present an automatic workflow that performs text segmentation and entity extraction from scientific literature to primarily address Task 2 of the Semantic Publishing Challenge 2015. The goal of Task 2 is to extract various information from full-text papers to represent the context in which a document is written, such as the affiliation of its authors and the corresponding funding bodies. Our proposed solution is composed of two subsystems: (i) A text mining pipeline, developed based on the GATE framework, which extracts structural and semantic entities, such as authors’ information and references, and produces semantic (typed) annotations; and (ii) a flexible exporting module, the LODeXporter, which translates the document annotations into RDF triples according to custom mapping rules. Additionally, we leverage existing Named Entity Recognition (NER) tools to extract named entities from text and ground them to their corresponding resources on the Linked Open Data cloud, thus, briefly covering Task 3 objectives, which involves linking of detected entities to resources in existing open datasets. The output of our system is an RDF graph stored in a scalable TDB-based storage with a public SPARQL endpoint for the task’s queries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Semantic Publishing Challenge 2015, https://github.com/ceurws/lod/wiki/SemPub2015.
2.
Tokens are smallest, meaningful units of text, such as words, numbers or symbols.
3.
Argumentation Zoning (AZ) Corpus, http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html.
4.
The root or lemma of a word is its canonical form without any inflectional endings.
5.
Task 2 Dataset, https://github.com/ceurws/lod/wiki/Task2#data-source.
6.
Resource Description Framework (RDF), http://www.w3.org/RDF/.
7.
Best Practices for Publishing Linked Data, http://www.w3.org/TR/ld-bp/.
8.
Discourse Elements Ontology (DEO), http://purl.org/spar/deo.
9.
PUBlication Ontology, http://lod.semanticsoftware.info/pubo/pubo.rdf.
10.
Document Components Ontology (DoCO), http://purl.org/spar/doco.
11.
Originally called the “RDF Mapper”, it is now an independent open source project available at http://www.semanticsoftware.info/lodexporter.
12.
Xpdf, http://www.foolabs.com/xpdf/.
13.
Several of our named entity extraction rules are extensions of GATE’s ANNIE plugin [5].
14.
Apache Jena, http://jena.apache.org.
15.
Apache TDB, http://jena.apache.org/documentation/tdb/.
16.
Precision is the fraction of extracted annotations that are relevant.
17.
Recall is the fraction of relevant annotations that are extracted.
18.
F-measure is the harmonic mean between Precision and Recall.
19.
The zero recall for our Q2.5 was due to an error in the mapping rules, where an entity was mapped to two different classes. Apart from that, the annotations were correctly extracted.

References

Sateli, B., Witte, R.: What’s in this paper? Combining rhetorical entities with linked open data for semantic literature querying. In: Semantics, Analytics, Visualisation: Enhancing Scholarly Data (SAVE-SD 2015), Florence, Italy, ACM (2015)
Google Scholar
Constantin, A., Peroni, S., Pettifer, S., David, S., Vitali, F.: The Document Components Ontology (DoCO). The Semantic Web Journal (2015) (in press). http://www.semantic-web-journal.net/system/files/swj1016_0.pdf
Groza, T., Handschuh, S., Möller, K., Decker, S.: SALT - semantically annotated for scientific publications. In: Franconi, E., Kifer, M., May, W. (eds.) ESWC 2007. LNCS, vol. 4519, pp. 518–532. Springer, Heidelberg (2007)
Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., Funk, A., Roberts, A., Damljanovic, D., Heitz, T., Greenwood, M.A., Saggion, H., Petrak, J., Li, Y., Peters, W.: Text Processing with GATE (Version 6). University of Sheffield, Department of Computer Science (2011)
Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: a framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 2002) (2002)
Google Scholar
Sateli, B., Witte, R.: Supporting researchers with a semantic literature management Wiki. In: The 4th Workshop on Semantic Publishing (SePublica 2014). CEUR Workshop Proceedings, vol. 1155, Anissaras, Crete, Greece. CEUR-WS.org (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Semantic Software Lab, Department of Computer Science and Software Engineering, Concordia University, Montréal, Canada
Bahar Sateli & René Witte

Authors

Bahar Sateli
View author publications
You can also search for this author in PubMed Google Scholar
René Witte
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to René Witte .

Editor information

Editors and Affiliations

Inria, Sophia Antipolis, France
Fabien Gandon
INRIA Sophia-Antipolis Méditerranée, Sophia Antipolis, France
Elena Cabrio
Université Paris-Sorbonne, Paris, France
Milan Stankovic
École des Mines de Saint-Étienne, Saint-Étienne, France
Antoine Zimmermann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sateli, B., Witte, R. (2015). Automatic Construction of a Semantic Knowledge Base from CEUR Workshop Proceedings. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds) Semantic Web Evaluation Challenges. SemWebEval 2015. Communications in Computer and Information Science, vol 548. Springer, Cham. https://doi.org/10.1007/978-3-319-25518-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-25518-7_11
Published: 07 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25517-0
Online ISBN: 978-3-319-25518-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics