Treating Dictionaries as a Linked-Data Corpus

  • Peter Bouda
  • Michael Cysouw


In this paper we describe a practical approach to the challenge of linguistic retrodigitization. We propose to distinguish strictly between a base digitization and separate interpretation of the sources. The base digitization only includes a literal electronic transcript of the source. All sources are thus simply treated as strings of characters, i.e. as unstructured corpora. The often complex structure as found in many dictionaries and grammars will subsequently (and possibly much later) be added as Linked Data in the form of standoff annotation. A further advantage of this approach is that the complete digitization and interpretation can be performed collaboratively without a complex organizational superstructure.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Bánski P, Przepiórkowski A (2009) Stand-off TEI annotation: The case of the National Corpus of Polish. In: Proceedings of the Third Linguistic Annotation Workshop (LAW III), pp 65–67 Google Scholar
  2. Cayless HA, Soroka A (2010) On implementing string-range() for TEI. In: Proceedings of Balisage: The Markup Conference 2010 Google Scholar
  3. Lee K, Romary L (2010) Towards interoperability of ISO standards for Language Resource Management. In: Proceedings of the 2nd International Conference on Global Interoperability for Language Resources Google Scholar
  4. Schmidt D (2010) The inadequacy of embedded markup for cultural heritage texts. Literary and Linguistic Computing pp 337–356 Google Scholar
  5. Thiesen W, Thiesen E (1998) Diccionario Bora-Castellano Castellano-Bora. Instituto Lingüístico de Verano Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.Research Unit “Quantitative Language Comparison”Ludwig Maximilians UniversityMunichGermany

Personalised recommendations