By all these lovely tokens... Merging conflicting tokenizations


Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday’s NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences from a corpus-linguistic perspective.

Fig. 1
Fig. 2
Fig. 3


  1. 1.

    German double surname consisting of Herzog and von der Heide.

  2. 2.

    Alternatively, transformation rules to map annotations from tok 2 to tok 1 would have to be developed. Unfortunately, this does not guarantee information preservation, and, additionally, it requires manual work, as such transformations are annotation-specific. Thus, it is not an option for the fully automated merging of tokenizations.

  3. 3.

    For a more detailed description of PAULA, see∼d1/paula/doc.

  4. 4.

    In principle, the use of XLinks and XPointers in PAULA 1.0 allows to represent multiple tokenizations by preparing separate files for individual tokenizations (this would correspond to the ‘no merging’ strategy described in Sect. 2) The requirement for a unique layer of totally ordered tokens in PAULA 1.0 arises from its integration in a larger architecture that also comprises the converter framework SaltNPepper (Zipser and Romary 2010) and the linguistic data base ANNIS (Chiarcos et al. 2008; Zeldes et al. 2009). In ANNIS, for example, corpus queries for precedence, token distance and co-extensionality of nodes from different layers of annotations are defined with reference to the token layer. This is necessary because simple references to the primary data may not be sufficient to determine the relative position of two elements in the annotation if multiple layers of primary data exist. Further extensions of ANNIS and PAULA may allow for multiple layers (and types) of primary data of this kind, e.g., textual data besides an audio or video stream (multi-media corpora), or multiple layers of textual data that may represent translations of each other (parallel corpora) or temporally overlapping utterances produced by different speakers (dialogue corpora).

  5. 5.

    Although PAULA supports overlapping markables within one single layer, even with identical extension, this is a reasonable assumption: In practice, however, this occurs extremely rarely, whereas for larger sequences of primary data, there are no markables defined at all.

  6. 6.

    Again, this is a practical simplification. Theoretically, the number of layers is infinite.

  7. 7.

    We use TIGER-QL (Lezius 2002) for the examples. As compared to our query language ANNIS-QL it is slightly less verbose, but it does not support queries across multiple layers of annotations.

  8. 8.

    If we performed the merging in a way that the resulting tokenization is identical to the source tokenization, there will be no difference, of course. But even if this is not the case, all information from the original annotation is preserved, so that it would still be possible to reproduce the results obtained for the original tokenization, although with slightly more complicated queries. For example, the original TIGER-QL query #x .3 #y that retrieves pairs of nodes separated by 3 tokens, can be reproduced on the merged annotation project with explicit temporary variables for the intermediate nodes (assuming that the original tokenization is identified by cat=/ptb_tok/ ):

    #x . #tmp1: [cat=/ptb_tok/] & #tmp1 . #tmp2: [cat=/ptb_tok/] & #tmp2 . #tmp3: [cat=/ptb_tok/] & #tmp3 . #y

  9. 9.

    Of course, the original token layer is preserved. The TIGER-QL query #x >#y: [cat=/ptb_tok|ptb_cat/] would retrieve the same results as the query #x > #y on the unmerged PTB annotation, if #x and #y represent variables for nodes, cat=/ptb_cat/ designates PTB node labels and cat=/ptb_tok/ identifies the representation of the original token layer in the merged annotation project.

  10. 10.

    As the examples in footnotes 2 and 4 show, such equivalent queries on the merged project may be somewhat more complicated than queries over the unmerged annotation projects. However, this is to be considered a small price as compared to the capability to formulate queries across multiple layers of annotation that were just impossible prior to the merging. Alternatively, it is also possible to adopt one of the original tokenizations as the privileged tokenization just by renaming the layers, even after the merging has been performed.

  11. 11.

    Similarly, phonological units that are not expressed in the primary data can be subject to annotations, e.g., short e and o in various Arabic-based orthographies, e.g., the Ajami orthography of Hausa. A term with zero extension at the position of a short vowel can be annotated as having the phonological value e or o without having character status.

  12. 12. ,August 6, 2011.

  13. 13.

    This can be compensated by marking the base segmentation differently from alternative segmentations. At the moment, it is, however, not clear to us how this would be represented in the XML format, as segmentations are not specified within GrAF, but defined separately from the annotations. A consistent conception would encode structural information on the structural level, and only linguistic annotation and metadata on the contents level, but it is not yet clear whether LAF/GrAF dummy nodes provide such a clear conceptual separation.


The title of our paper is taken from the poem September by Helen Hunt Jackson. The poem does not only provide us with a nice title, but also with a number of typical tokenization issues, e.g., the tokenization of golden-rod (in some versions actually spelled goldenrod or golden rod), brook-side and meadow-nook (with analogous spelling alternatives), the genitives gentian’s and grapes’, as well as the short form 'T is (or 'Tis) for it is. Our research was conducted in the context of the Collaborative Research Center (SFB) 632 “Information Structure” (Potsdam/Berlin), funded by the Deutsche Forschungsgemeinschaft (DFG). This paper has taken great benefit from competent hints to problems and examples provided by Ekaterina Buyko, Timo Baumann, Stavros Skopeteas, Pavel Logačev, Elena Karvovskaya, and Halyna Finzen. We would also like to thank the attendees and the program committee of the Third Linguistic Annotation Workshop, two anonymous reviewers and our colleagues Amir Zeldes and Florian Zipser for their comments and feedback.

