Skip to main content

Machine Learning for High-Quality Tokenization Replicating Variable Tokenization Schemes

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7816))

  • 2404 Accesses

Abstract

In this work, we investigate the use of sequence labeling techniques for tokenization, arguably the most foundational task in NLP, which has been traditionally approached through heuristic finite-state rules. Observing variation in tokenization conventions across corpora and processing tasks, we train and test multiple CRF binary sequence labelers and obtain substantial reductions in tokenization error rate over off-the-shelf standard tools. From a domain adaptation perspective, we experimentally determine the effects of training on mixed gold-standard data sets and make a tentative recommendation for practical usage. Furthermore, we present a perspective on this work as a feedback mechanism to resource creation, i.e. error detection in annotated corpora. To investigate the limits of our approach, we study an interpretation of the tokenization problem that shows stark contrasts to ‘classic’ schemes, presenting many more token-level ambiguities to the sequence labeler (reflecting use of punctuation and multi-word lexical units). In this setup, we also look at partial disambiguation by presenting a token lattice to downstream processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adolphs, P., Oepen, S., Callmeier, U., Crysmann, B., Flickinger, D., Kiefer, B.: Some fine points of hybrid natural language parsing. In: Proceedings of the 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco (2008)

    Google Scholar 

  2. Curran, J.R., Clark, S., Vadas, D.: Multi-tagging for lexicalized-grammar parsing. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Meeting of the Association for Computational Linguistics, pp. 697–704. Association for Computational Linguistics, Sydney (2006)

    Google Scholar 

  3. Dridan, R., Oepen, S.: Tokenization. Returning to a long solved problem. A survey, contrastive experiment, recommendations, and toolkit. In: Proceedings of the 50th Meeting of the Association for Computational Linguistics, Jeju, Republic of Korea, pp. 378–382 (July 2012)

    Google Scholar 

  4. Flickinger, D.: On building a more efficient grammar by exploiting types. Natural Language Engineering 6(1), 15–28 (2000)

    Article  Google Scholar 

  5. Flickinger, D., Zhang, Y., Kordoni, V.: DeepBank: A dynamically annotated treebank of the Wall Street Journal. In: Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories (TLT 2011), Lisbon, Portugal (2012)

    Google Scholar 

  6. Foster, J.: “cba to check the spelling”: Investigating parser performance on discussion forum posts. In: Human Language Technology Conference: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 381–384. Association for Computational Linguistics, Los Angeles (2010)

    Google Scholar 

  7. Green, S., de Marneffe, M.C., Bauer, J., Manning, C.D.: Multiword expression identification with tree substitution grammars: A parsing tour de force with french. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 725–735. Association for Computational Linguistics, Edinburgh (2011)

    Google Scholar 

  8. Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: Ontonotes. The 90% solution. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, New York City, USA, pp. 57–60 (June 2006)

    Google Scholar 

  9. Kaplan, R.M.: A method for tokenizing text. Festschrift for Kimmo Koskenniemi on his 60th birthday. In: Arppe, A., Carlson, L., Lindén, K., Piitulainen, J., Suominen, M., Vainio, M., Westerlund, H., Yli-Jyrä, A. (eds.) Inquiries into Words, Constraints and Contexts, pp. 55–64. CSLI Publications, Stanford (2005)

    Google Scholar 

  10. Kim, J.D., Ohta, T., Teteisi, Y., Tsujii, J.: GENIA corpus — a semantically annotated corpus for bio-textmining. Bioinformatics 19, i180–i182 (2003)

    Article  Google Scholar 

  11. Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings of the 48th Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 504–513 (July 2010)

    Google Scholar 

  12. Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpora of English: The Penn Treebank. Computational Linguistics 19, 313–330 (1993)

    Google Scholar 

  13. Maršík, J., Bojar, O.: TrTok: A Fast and Trainable Tokenizer for Natural Languages. Prague Bulletin of Mathematical Linguistics 98, 75–85 (2012)

    Google Scholar 

  14. Oepen, S., Flickinger, D., Toutanova, K., Manning, C.D.: LinGO Redwoods. A rich and dynamic treebank for HPSG. Research on Language and Computation 2(4), 575–596 (2004)

    Article  Google Scholar 

  15. Øvrelid, L., Velldal, E., Oepen, S.: Syntactic scope resolution in uncertainty analysis. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 1379–1387. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  16. Petrov, S., McDonald, R.: Overview of the 2012 shared task on parsing the web. In: Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language, SANCL (2012)

    Google Scholar 

  17. Pollard, C., Sag, I.A.: Head-Driven Phrase Structure Grammar. Studies in Contemporary Linguistics. Contemporary Linguistics. The University of Chicago Press, Chicago (1994)

    Google Scholar 

  18. Surdeanu, M., Johansson, R., Meyers, A., Màrquez, L., Nivre, J.: The CoNLL 2008 shared task on joint parsing of syntactic and semantic dependencies. In: Proceedings of the 12th Conference on Natural Language Learning, Manchester, England, pp. 159–177 (2008)

    Google Scholar 

  19. Tomanek, K., Wermter, J., Hahn, U.: Sentence and token splitting based on conditional random fields. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, Melbourne, Australia, pp. 49–57 (2007)

    Google Scholar 

  20. Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a Robust Part-of-Speech Tagger for Biomedical Text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  21. Yoshida, K., Tsuruoka, Y., Miyao, Y., Tsujii, J.: Ambiguous part-of-speech tagging for improving accuracy and domain portability of syntactic parsers. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 1783–1788. Morgan Kaufmann Publishers Inc., Hyderabad (2007)

    Google Scholar 

  22. Ytrestøl, G.: Cuteforce. Deep deterministic HPSG parsing. In: Proceedings of the 12th International Conference on Parsing Technologies, Dublin, Ireland, pp. 186–197 (2011)

    Google Scholar 

  23. Zhang, Y., Krieger, H.U.: Large-scale corpus-driven PCFG approximation of an HPSG. In: Proceedings of the 12th International Conference on Parsing Technologies, Dublin, Ireland, pp. 198–208 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fares, M., Oepen, S., Zhang, Y. (2013). Machine Learning for High-Quality Tokenization Replicating Variable Tokenization Schemes. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37247-6_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37246-9

  • Online ISBN: 978-3-642-37247-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics