Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday’s NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences from a corpus-linguistic perspective.
This is a preview of subscription content, access via your institution.
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
German double surname consisting of Herzog and von der Heide.
Alternatively, transformation rules to map annotations from tok 2 to tok 1 would have to be developed. Unfortunately, this does not guarantee information preservation, and, additionally, it requires manual work, as such transformations are annotation-specific. Thus, it is not an option for the fully automated merging of tokenizations.
For a more detailed description of PAULA, see http://www.sfb632.uni-potsdam.de/∼d1/paula/doc.
In principle, the use of XLinks and XPointers in PAULA 1.0 allows to represent multiple tokenizations by preparing separate files for individual tokenizations (this would correspond to the ‘no merging’ strategy described in Sect. 2) The requirement for a unique layer of totally ordered tokens in PAULA 1.0 arises from its integration in a larger architecture that also comprises the converter framework SaltNPepper (Zipser and Romary 2010) and the linguistic data base ANNIS (Chiarcos et al. 2008; Zeldes et al. 2009). In ANNIS, for example, corpus queries for precedence, token distance and co-extensionality of nodes from different layers of annotations are defined with reference to the token layer. This is necessary because simple references to the primary data may not be sufficient to determine the relative position of two elements in the annotation if multiple layers of primary data exist. Further extensions of ANNIS and PAULA may allow for multiple layers (and types) of primary data of this kind, e.g., textual data besides an audio or video stream (multi-media corpora), or multiple layers of textual data that may represent translations of each other (parallel corpora) or temporally overlapping utterances produced by different speakers (dialogue corpora).
Although PAULA supports overlapping markables within one single layer, even with identical extension, this is a reasonable assumption: In practice, however, this occurs extremely rarely, whereas for larger sequences of primary data, there are no markables defined at all.
Again, this is a practical simplification. Theoretically, the number of layers is infinite.
We use TIGER-QL (Lezius 2002) for the examples. As compared to our query language ANNIS-QL it is slightly less verbose, but it does not support queries across multiple layers of annotations.
If we performed the merging in a way that the resulting tokenization is identical to the source tokenization, there will be no difference, of course. But even if this is not the case, all information from the original annotation is preserved, so that it would still be possible to reproduce the results obtained for the original tokenization, although with slightly more complicated queries. For example, the original TIGER-QL query #x .3 #y that retrieves pairs of nodes separated by 3 tokens, can be reproduced on the merged annotation project with explicit temporary variables for the intermediate nodes (assuming that the original tokenization is identified by cat=/ptb_tok/ ):
#x . #tmp1: [cat=/ptb_tok/] & #tmp1 . #tmp2: [cat=/ptb_tok/] & #tmp2 . #tmp3: [cat=/ptb_tok/] & #tmp3 . #y
Of course, the original token layer is preserved. The TIGER-QL query #x >#y: [cat=/ptb_tok|ptb_cat/] would retrieve the same results as the query #x > #y on the unmerged PTB annotation, if #x and #y represent variables for nodes, cat=/ptb_cat/ designates PTB node labels and cat=/ptb_tok/ identifies the representation of the original token layer in the merged annotation project.
As the examples in footnotes 2 and 4 show, such equivalent queries on the merged project may be somewhat more complicated than queries over the unmerged annotation projects. However, this is to be considered a small price as compared to the capability to formulate queries across multiple layers of annotation that were just impossible prior to the merging. Alternatively, it is also possible to adopt one of the original tokenizations as the privileged tokenization just by renaming the layers, even after the merging has been performed.
Similarly, phonological units that are not expressed in the primary data can be subject to annotations, e.g., short e and o in various Arabic-based orthographies, e.g., the Ajami orthography of Hausa. A term with zero extension at the position of a short vowel can be annotated as having the phonological value e or o without having character status.
This can be compensated by marking the base segmentation differently from alternative segmentations. At the moment, it is, however, not clear to us how this would be represented in the XML format, as segmentations are not specified within GrAF, but defined separately from the annotations. A consistent conception would encode structural information on the structural level, and only linguistic annotation and metadata on the contents level, but it is not yet clear whether LAF/GrAF dummy nodes provide such a clear conceptual separation.
Brants, T. (2000). TnT—A statistical part-of-speech tagger. In Proceedings of the sixth applied natural language processing (ANLP-2000), Seattle, WA, pp. 224–231.
Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4), 543–565.
Burnard, L. (2007). Reference guide for the British national corpus (XML Edition). http://www.natcorp.ox.ac.uk/XMLedition/URG/bnctags.html (August 6, 2011).
Carletta, J., Evert, S., Heid, U., Kilgour, J., Robertson, J., & Voormann, H. (2003). The NITE XML toolkit: Flexible annotation for multi-modal language data. Behavior Research Methods, Instruments, and Computers, 35(3), 353–363.
Carlson, L., Marcu, D., & Okurowski, M. E. (2003), Building a discourse-tagged corpus in the framework of rhetorical structure theory. In J. van Kuppevelt & R. W. Smith (Eds.), Current and new directions in discourse and dialogue, text, speech, and language technology; 22 (pp. 85–112). Dordrecht: Kluwer.
Cheng, L., & Demirdache, H. (1990). Superiority violations. In L. Cheng & H. Demirdache (Eds.), Papers on Wh-movement, MIT working papers in linguistics; 13, MITWPL, pp. 27–46.
Chiarcos, C., Dipper, S., Götze, M., Leser, U., Lüdeling, A., Ritz, J., & Stede, M. (2008). A flexible framework for integrating annotations from different tools and tagsets. TAL (Traitement automatique des langues), 49(2), 217–246.
Christ, O. (1994). A modular and flexible architecture for an integrated corpus query system. In Proceedings of the 3rd conference on computational lexicography and text research (COMPLEX 94), Budapest, Hungary, pp. 23–32.
Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). GATE: An architecture for development of robust HLT applications. In Proceedings of the 40th anniversary meeting of the association for computational linguistics (ACL-2002), Philadelphia, Pennsylvania, pp. 168–175.
Dipper, S. (2005). XML-based stand-off representation and exploitation of multi-level linguistic annotation. In Proceedings of berliner XML tage 2005 (BXML 2005), Berlin, Germany, pp. 39–50.
Dipper, S., & Götze, M. (2005) Accessing heterogeneous linguistic data – Generic XML-based representation and flexible visualization. In Proceedings of the 2nd language and technology conference (L&T’05), Poznan, Poland, pp. 23–30.
Ferrucci, D., & Lally, A. (2004). UIMA: An architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(3/4), 327–348.
Guo, J. (1997). Critical tokenization and its properties. Computational Linguistics, 4(23), 569–596.
Henderson, J. C. (2000). A DTD for reference key annotation of EDT entities and RDC relations in the ACE evaluations (v. 5.2.0, 2000/01/05). http://projects.ldc.upenn.edu/ace/annotation/apf.v5.2.0.dtd. Accessed 6 August 2011.
Heycock, C. (1995). Asymmetries in reconstruction. Linguistic Inquiry, 26(4), 547–570.
Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2006) OntoNotes: The 90% solution. In Proceedings of the human language technology conference of the NAACL (HLT 2006), New York City, USA, pp. 57–60.
Ide, N. (2008). The American national corpus: Then, now and tomorrow. Keynote paper presented at the HCSNet workshop on designing the Australian national corpus, 4–5 December, UNSW, Sydney, Australia.
Ide, N., & Suderman, K. (2007). GrAF: A graph-based format for linguistic annotations. In Proceedings of the linguistic annotation workshop (LAW) 2007, Prague, Czech Republic, pp. 1–8.
Jiampojamarn, S., & Kondrak, G. (2009). Online discriminative training for grapheme-to-phoneme conversion. In Proceedings of the 10th annual conference of the international speech communication association (Interspeech 2009), Brighton, pp. 1303–1306.
Junghanns, U., & Zybatow, G., (1995). Fokus im Russischen. In Proceedings of the Göttingen focus workshop at the 17th annual conference of the German linguistic society (DGfS 1995), Göttingen, Germany, pp. 113–136.
Kaplan, R., & Newman, P. (1997). Lexical resource reconciliation in the xerox linguistic environment. In Proceedings of the ACL’97 workshop on computational environments for grammar development and linguistic engineering, Madrid, Spain, pp. 54–61.
Kingsbury, P., & Palmer, M. (2002). From TreeBank to PropBank. In Proceedings of the third international conference on language resources and evalution (LREC 2002), Las Palmas, Spain, pp. 1989–1993.
Kohler, K. (1996). Labelled data bank of spoken standard German. The Kiel Corpus of read/spontaneous speech. In Proceedings of the fourth international conference on spoken language processing (ICSLP’96), Philadelphia, pp. 1938–1941.
König, E., & Lezius, W. (2000). A description language for syntactically annotated corpora. In Proceedings of the 18th international conference on computational linguistics (COLING 2000), Saarbrücken, Germany, pp. 1056–1060.
Lezius, W. (2002). TIGERSearch. Ein Suchwerkzeug für Baumbanken. In Proceedings of the 6th Konferenz zur Verarbeitung natürlicher Sprache (KONVENS 2002), Saarbrücken, Germany, pp. 107-114.
Manning, C., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.
Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The PennTreeBank. Computational Linguistics, 19, 313–330.
Meyers, A., Reeves, R., Macleod, C., Szekely, R., Zilinska, V., & Young, B. (2004). The NomBank project: An interim report. In HLT-NAACL workshop on frontiers in corpus Annotation, Boston, Massachusetts, pp. 24–31.
Müller, S. (2005). Zur Analyse der scheinbar mehrfachen Vorfeldbesetzung. Linguistische Berichte, 203, 297–330.
Müller, C., & Strube, M. (2006). Multi-level annotation of linguistic data with MMAX2. In S. Braun, K. Kohn, & J. Mukherjee (Eds.), Corpus technology and language pedagogy: New resources, new tools, new methods (pp. 197–214). Frankfurt, Germany: Peter Lang.
Poesio, M., & Artstein, R. (2008). Anaphoric annotation in the ARRAU corpus. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, & D. Tapias (Eds.), Proceedings of the sixth international language resources and evaluation (LREC 2008), Marrakech, Morocco.
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., & Webber, B. (2008). The Penn Discourse TreeBank 2.0. In Proceedings of the sixth international language resources and evaluation (LREC 2008), Marrakech, Morocco.
Pustejovsky, J., Hanks, P., Saurí, R., See, A., Gaizauskas, R., Setzer, A., Radev, D., Beth Sundheim, D. D., Ferro, L., & Lazo, M. (2003). The TIMEBANK corpus. In Corpus linguistics, pp. 647–656.
Rehm, G., Schonefeld, O., Witt, A., Chiarcos, C., & Lehmberg, T. (2008). SPLICR: A sustainability platform for linguistic corpora and resources. In A. Storrer, A. Geyken, A. Siebert, & K. M. Würzner (Eds.), Text resources and lexical knowledge (pp. 85–96). Berlin, Germany: Mouton de Gruyter.
Sampson, G. R. (1999). CHRISTINE corpus, stage I: Documentation. http://www.grsampson.net/ChrisDoc.htm.
Schmidt, T. (2004). Transcribing and annotating spoken language with EXMARaLDA. In Proceedings of the LREC 2004 workshop on XML based richly annotated corpora, Lisboa, Portugal.
Sekerina, I. (1997). The syntax and processing of scrambling constructions in Russian. PhD thesis, The City University of New York.
Stede, M., Bieler, H., Dipper, S., & Suriyawongkul, A. (2006). Summar: Combining linguistics and statistics for text summarization. In Proceedings of the 17th European conference on artificial intelligence (ECAI-06), Riva del Garda, Italy, pp. 827–828.
Vilain, M., Burger, J., Aberdeen, J., Connolly, D., & Hirschman, L. (1995) A model-theoretic coreference scoring scheme. In MUC6: Proceedings of the 6th conference on message understanding, Morristown, NJ, USA, pp. 45–52.
Wolf, F., & Gibson, E. (2005). Representing discourse coherence: A corpus-based study. Computational Linguistics, 31(2), 249–287.
Wu, D. (1998). A position statement on chinese segmentation. In Proceedings of the Chinese language processing workshop, University of Pennsylvania, Pennsylvania, Philadelphia.
Yamamoto, K., Kudo. T., Konagaya, A., & Matsumoto, Y. (2003). Protein name tagging for biomedical annotation in text. In Proceedings of the ACL 2003 workshop on natural language processing in biomedicine, Morristown, NJ, USA, pp. 65–72.
Zeldes, A., Ritz, J., Lüdeling, A., & Chiarcos, C. (2009). ANNIS: A search tool for multi-layer annotated corpora. In Proceedings of corpus linguistics 2009, Liverpool, UK.
Zipser, F., & Romary, L. (2010). A model oriented approach to the mapping of annotation formats using standards. In Proceedings of the 7th international conference on language resources and evaluation (LREC 2010), Valetta, Malta.
The title of our paper is taken from the poem September by Helen Hunt Jackson. The poem does not only provide us with a nice title, but also with a number of typical tokenization issues, e.g., the tokenization of golden-rod (in some versions actually spelled goldenrod or golden rod), brook-side and meadow-nook (with analogous spelling alternatives), the genitives gentian’s and grapes’, as well as the short form 'T is (or 'Tis) for it is. Our research was conducted in the context of the Collaborative Research Center (SFB) 632 “Information Structure” (Potsdam/Berlin), funded by the Deutsche Forschungsgemeinschaft (DFG). This paper has taken great benefit from competent hints to problems and examples provided by Ekaterina Buyko, Timo Baumann, Stavros Skopeteas, Pavel Logačev, Elena Karvovskaya, and Halyna Finzen. We would also like to thank the attendees and the program committee of the Third Linguistic Annotation Workshop, two anonymous reviewers and our colleagues Amir Zeldes and Florian Zipser for their comments and feedback.
About this article
Cite this article
Chiarcos, C., Ritz, J. & Stede, M. By all these lovely tokens... Merging conflicting tokenizations. Lang Resources & Evaluation 46, 53–74 (2012). https://doi.org/10.1007/s10579-011-9161-0
- Linguistic annotation
- Multi-layer annotation
- Conflicting tokenizations
- Tokenization alignment
- Corpus linguistics