Abstract
We introduce CoNLL-RDF, a direct rendering of the CoNLL format in RDF, accompanied by a formatter whose output mimicks CoNLL’s original TSV-style layout. CoNLL-RDF represents a middle ground that accounts for the needs of NLP specialists (easy to read, easy to parse, close to conventional representations), but that also facilitates LLOD integration by applying off-the-shelf Semantic Web technology to CoNLL corpora and annotations. The CoNLL-RDF infrastructure is published as open source. We also provide SPARQL update scripts for selected use cases as described in this paper.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
As summarized by Mark Johnson in his ACL-IJCLNP 2012 keynote on the future of computational linguistics, “[s]tandard data formats (...) I’m not sure these are important: if someone can use a parser, they can probably also write a Python wrapper” [12, slide 8].
- 2.
- 3.
- 4.
- 5.
- 6.
For the sake of processability, we use only a minimal fragment of NIF. We do neither adopt its full semantic model nor its URI formation constraints; yet, it is possible to transform CoNLL-RDF to NIF using SPARQL update and to provide NIF-compliant URIs if information about the original spacing (which is not preserved in CoNLL) is provided externally.
- 7.
- 8.
It should be noted that addressing elements in a ragged array requires great care, as it is not guaranteed that a given index exists for every sentence, e.g., in case of SRL annotations which differ in length per sentence. In this regard, hash maps are more permissive.
- 9.
CoNLL-RDF provides a clear representation of annotation layers: CoNLL has been described as a ‘hybrid standoff format’ [11] in the sense that every column represents a self-contained annotation layer that refers to a common segmentation (tokens).
- 10.
- 11.
- 12.
- 13.
- 14.
NLTK provides numerous corpus readers specialized for different CoNLL variants, cf. http://www.nltk.org/howto/corpus.html.
- 15.
References
Beckett, D., Berners-Lee, T., Prud’hommeaux, E., Carothers, G.: RDF 1.1 Turtle. (2014). https://www.w3.org/TR/turtle/
Brants, S., Hansen, S.: Developments in the TIGER annotation scheme and their realization in the corpus. In: LREC (2002)
Chiarcos, C., Sukhreva, M.: OLiA - Ontologies of Linguistic Annotation. Semant. Web J. 518, 379–386 (2015)
Chiarcos, C., Fäth, C., Renner-Westermann, H., Abromeit, F., Dimitrova, V.: Lin\(|\)gu\(|\)is\(|\)tik: building the linguist’s pathway to bibliographies, libraries, language resources and linked open data. In: Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Language Resources Association (ELRA), Paris, France (May 2016)
Chiarcos, C., McCrae, J., Cimiano, P., Fellbaum, C.: Towards open data for linguistics: linguistic linked data. In: Oltramari, A., Vossen, P., Qin, L., Hovy, E. (eds.) New Trends of Research in Ontologies and Lexical Resources, pp. 7–25. Springer, Heidelberg (2013)
Chiarcos, C., Nordhoff, S., Hellmann, S.: Linked Data in Linguistics. Springer, Heidelberg (2012)
Cimiano, P., McCrae, J., Buitelaar, P.: Lexicon model for ontologies (2016). https://www.w3.org/2016/05/ontolex/
Das, S., Sundara, S., Cyganiak, R.: R2RML: RDB to RDF mapping language (2012). https://www.w3.org/TR/r2rml
Declerck, T., Buitelaar, P., Wunner, T., McCrae, J., Montiel-Ponsoda, E., de Cea, A.: Lemon: an ontology-lexicon model for the multilingual semantic web. In: W3C Workshop: The Multilingual Web - Where Are We? Madrid, Spain, October 2010
Hellmann, S., Lehmann, J., Auer, S., Brümmer, M.: Integrating NLP using linked data. In: Proceedings of 12th International Semantic Web Conference, 21–25 October 2013, Sydney, Australia (2013). http://persistence.uni-leipzig.org/nlp2rdf/
Ide, N., Chiarcos, C., Stede, M., Cassidy, S.: Designing annotation schemes: from model to representation. In: Ide, N., Pustejovsky, J. (eds.) Handbook of Linguistic Annotation: Text, Speech, and Language Technology. Springer, Dordrecht (2017, in press)
Johnson, M.: Computational linguistics. Where do we go from here? Invited talk at the 50th Annual Meeting of the Association of Computational Linguistics (ACL-IJCNLP 2012), Jeju, Korea (2012). http://web.science.mq.edu.au/ mjohnson/papers/Johnson12next50.pdf. Accessed 13 July 2016
Lezius, W., Biesinger, H., Gerstenberger, C.: TigerXML quick reference guide (2002)
Nivre, J., Agić, Ž., Ahrenberg, L., et. al.: Universal dependencies 1.4 (2016). http://hdl.handle.net/11234/1-1827
Sanderson, R., Ciccarese, P., Van de Sompel, H.: Open annotation data model (2013). http://www.openannotation.org/spec/core
Sanderson, R., Ciccarese, P., Young, B.: Web annotation data model (2017). https://www.w3.org/TR/annotation-model
Sérasset, G.: DBnary: Wiktionary as a lemon-based multilingual lexical resource in RDF. Semant. Web J. 648 (2014). http://kaiko.getalp.org/about-dbnary/
Acknowledgments
The research of Christian Chiarcos was supported by the BMBF-funded Research Group ‘Linked Open Dictionaries (LiODi)’ (2015–2020). The research of Christian Fäth was conducted in the context of DFG-funded projects ‘Virtuelle Fachbibliothek’ (2015–2016) and ‘Fachinformationsdienst Linguistik’ (2017–2019).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Chiarcos, C., Fäth, C. (2017). CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way. In: Gracia, J., Bond, F., McCrae, J., Buitelaar, P., Chiarcos, C., Hellmann, S. (eds) Language, Data, and Knowledge. LDK 2017. Lecture Notes in Computer Science(), vol 10318. Springer, Cham. https://doi.org/10.1007/978-3-319-59888-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-59888-8_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59887-1
Online ISBN: 978-3-319-59888-8
eBook Packages: Computer ScienceComputer Science (R0)