Skip to main content

CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10318))

Abstract

We introduce CoNLL-RDF, a direct rendering of the CoNLL format in RDF, accompanied by a formatter whose output mimicks CoNLL’s original TSV-style layout. CoNLL-RDF represents a middle ground that accounts for the needs of NLP specialists (easy to read, easy to parse, close to conventional representations), but that also facilitates LLOD integration by applying off-the-shelf Semantic Web technology to CoNLL corpora and annotations. The CoNLL-RDF infrastructure is published as open source. We also provide SPARQL update scripts for selected use cases as described in this paper.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    As summarized by Mark Johnson in his ACL-IJCLNP 2012 keynote on the future of computational linguistics, “[s]tandard data formats (...) I’m not sure these are important: if someone can use a parser, they can probably also write a Python wrapper” [12, slide 8].

  2. 2.

    http://www.signll.org/conll.

  3. 3.

    https://www.w3.org/TR/sparql11-overview.

  4. 4.

    http://ilk.uvt.nl/conll/.

  5. 5.

    http://conll.cemantix.org/2011/data.html.

  6. 6.

    For the sake of processability, we use only a minimal fragment of NIF. We do neither adopt its full semantic model nor its URI formation constraints; yet, it is possible to transform CoNLL-RDF to NIF using SPARQL update and to provide NIF-compliant URIs if information about the original spacing (which is not preserved in CoNLL) is provided externally.

  7. 7.

    https://github.com/UniversalDependencies/UD_German.

  8. 8.

    It should be noted that addressing elements in a ragged array requires great care, as it is not guaranteed that a given index exists for every sentence, e.g., in case of SRL annotations which differ in length per sentence. In this regard, hash maps are more permissive.

  9. 9.

    CoNLL-RDF provides a clear representation of annotation layers: CoNLL has been described as a ‘hybrid standoff format’ [11] in the sense that every column represents a self-contained annotation layer that refers to a common segmentation (tokens).

  10. 10.

    http://www.openannotation.org/, https://www.w3.org/annotation/.

  11. 11.

    http://purl.org/powla.

  12. 12.

    https://www.w3.org/TR/r2rml, https://www.w3.org/TR/rdb-direct-mapping.

  13. 13.

    https://www.w3.org/TR/csv2rdf/.

  14. 14.

    NLTK provides numerous corpus readers specialized for different CoNLL variants, cf. http://www.nltk.org/howto/corpus.html.

  15. 15.

    http://acoli.informatik.uni-frankfurt.de/resources.html.

References

  1. Beckett, D., Berners-Lee, T., Prud’hommeaux, E., Carothers, G.: RDF 1.1 Turtle. (2014). https://www.w3.org/TR/turtle/

  2. Brants, S., Hansen, S.: Developments in the TIGER annotation scheme and their realization in the corpus. In: LREC (2002)

    Google Scholar 

  3. Chiarcos, C., Sukhreva, M.: OLiA - Ontologies of Linguistic Annotation. Semant. Web J. 518, 379–386 (2015)

    Article  Google Scholar 

  4. Chiarcos, C., Fäth, C., Renner-Westermann, H., Abromeit, F., Dimitrova, V.: Lin\(|\)gu\(|\)is\(|\)tik: building the linguist’s pathway to bibliographies, libraries, language resources and linked open data. In: Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Language Resources Association (ELRA), Paris, France (May 2016)

    Google Scholar 

  5. Chiarcos, C., McCrae, J., Cimiano, P., Fellbaum, C.: Towards open data for linguistics: linguistic linked data. In: Oltramari, A., Vossen, P., Qin, L., Hovy, E. (eds.) New Trends of Research in Ontologies and Lexical Resources, pp. 7–25. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  6. Chiarcos, C., Nordhoff, S., Hellmann, S.: Linked Data in Linguistics. Springer, Heidelberg (2012)

    Book  Google Scholar 

  7. Cimiano, P., McCrae, J., Buitelaar, P.: Lexicon model for ontologies (2016). https://www.w3.org/2016/05/ontolex/

  8. Das, S., Sundara, S., Cyganiak, R.: R2RML: RDB to RDF mapping language (2012). https://www.w3.org/TR/r2rml

  9. Declerck, T., Buitelaar, P., Wunner, T., McCrae, J., Montiel-Ponsoda, E., de Cea, A.: Lemon: an ontology-lexicon model for the multilingual semantic web. In: W3C Workshop: The Multilingual Web - Where Are We? Madrid, Spain, October 2010

    Google Scholar 

  10. Hellmann, S., Lehmann, J., Auer, S., Brümmer, M.: Integrating NLP using linked data. In: Proceedings of 12th International Semantic Web Conference, 21–25 October 2013, Sydney, Australia (2013). http://persistence.uni-leipzig.org/nlp2rdf/

  11. Ide, N., Chiarcos, C., Stede, M., Cassidy, S.: Designing annotation schemes: from model to representation. In: Ide, N., Pustejovsky, J. (eds.) Handbook of Linguistic Annotation: Text, Speech, and Language Technology. Springer, Dordrecht (2017, in press)

    Google Scholar 

  12. Johnson, M.: Computational linguistics. Where do we go from here? Invited talk at the 50th Annual Meeting of the Association of Computational Linguistics (ACL-IJCNLP 2012), Jeju, Korea (2012). http://web.science.mq.edu.au/ mjohnson/papers/Johnson12next50.pdf. Accessed 13 July 2016

  13. Lezius, W., Biesinger, H., Gerstenberger, C.: TigerXML quick reference guide (2002)

    Google Scholar 

  14. Nivre, J., Agić, Ž., Ahrenberg, L., et. al.: Universal dependencies 1.4 (2016). http://hdl.handle.net/11234/1-1827

  15. Sanderson, R., Ciccarese, P., Van de Sompel, H.: Open annotation data model (2013). http://www.openannotation.org/spec/core

  16. Sanderson, R., Ciccarese, P., Young, B.: Web annotation data model (2017). https://www.w3.org/TR/annotation-model

  17. Sérasset, G.: DBnary: Wiktionary as a lemon-based multilingual lexical resource in RDF. Semant. Web J. 648 (2014). http://kaiko.getalp.org/about-dbnary/

Download references

Acknowledgments

The research of Christian Chiarcos was supported by the BMBF-funded Research Group ‘Linked Open Dictionaries (LiODi)’ (2015–2020). The research of Christian Fäth was conducted in the context of DFG-funded projects ‘Virtuelle Fachbibliothek’ (2015–2016) and ‘Fachinformationsdienst Linguistik’ (2017–2019).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Chiarcos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Chiarcos, C., Fäth, C. (2017). CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way. In: Gracia, J., Bond, F., McCrae, J., Buitelaar, P., Chiarcos, C., Hellmann, S. (eds) Language, Data, and Knowledge. LDK 2017. Lecture Notes in Computer Science(), vol 10318. Springer, Cham. https://doi.org/10.1007/978-3-319-59888-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59888-8_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59887-1

  • Online ISBN: 978-3-319-59888-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics