You Don’t Have to Think Twice if You Carefully Tokenize
Most of the currently used tokenizers only segment a text into tokens and combine them to sentences. But this is not the way, we think a tokenizer should work. We believe that a tokenizer should support the following analysis components in the best way it can.
We present a tokenizer with a high focus on transparency. First, the tokenizer decisions are encoded in such a way that the original text can be reconstructed. This supports the identification of typical errors and – as a consequence – a faster creation of better tokenizer versions. Second, all detected relevant information that might be important for subsequent analysis components are made transparent by XML-tags and special information codes for each token. Third, doubtful decisions are also marked by XML-tags. This is helpful for off-line applications like corpora building, where it seems to be more appropriate to check doubtful decisions in a few minutes manually than working with incorrect data over years.
Unable to display preview. Download preview PDF.
- Aberdeen, J., Burger, J., Day, D., Hirschman, L., Robinson, P., Vilain, C.: MITRE: Description of the Alembic System as Used for MUC-6. In: Proceedings of the Sixth Message Understanding Conference (MUC6), Columbia, Maryland (1995)Google Scholar
- Grover, C., Matheson, C., Mikheev, A., Moens, M.: LT TTT –a flexible tokenisation tool. In: LREC 2000 – Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, Greece (2000)Google Scholar
- Klatt, S.: Pattern-matching Easy-first Planning. In: Drewery, A., Kruijff, G., Zuber, R. (eds.) The Proceedings of the Second ESSLLI Student Session, Aixen- Provence, France, 9th European Summer School in Logic, Language and Information (1997)Google Scholar
- Klatt, S.: Combining a Rule-Based Tagger with a Statistical Tagger for Annotating German Texts. In: Busemann, S. (ed.) KONVENS 2002. 6. Konferenz zur Verarbeitung natürlicher Sprache, Saarbrücken, Germany (2002)Google Scholar
- Mikheev, A.: Tagging Sentence Boundaries. Technical report, University of Edinburgh (2000)Google Scholar
- Schmid, H.: Unsupervised Learning of Period Disambiguation for Tokenisation. Technical report, University of Stuttgart (2000)Google Scholar
- Sperberg-McQueen, C.M., Burnard, L.: Guidelines for Electronic Text Encoding and Interchange: Volumes 1 and 2: P4. University Press of Virginia (2003)Google Scholar