Advertisement

Large Scale Syntactic Annotation of Written Dutch: Lassy

  • Gertjan van Noord
  • Gosse Bouma
  • Frank Van Eynde
  • Daniël de Kok
  • Jelmer van der Linde
  • Ineke Schuurman
  • Erik Tjong Kim Sang
  • Vincent Vandeghinste
Chapter
Part of the Theory and Applications of Natural Language Processing book series (NLP)

Abstract

This chapter presents the Lassy Small and Lassy Large treebanks, as well as related tools and applications. Lassy Small is a corpus of written Dutch texts (1,000,000 words) which has been syntactically annotated with manual verification and correction. Lassy Large is a much larger corpus (over 500,000,000 words) which has been syntactically annotated fully automatically. In addition, various browse and search tools for syntactically annotated corpora have been developed and made available. Their potential for applications in corpus linguistics and information extraction has been illustrated and evaluated in a series of case studies.

References

  1. 1.
    Bouma, G.: Starting a Sentence in Dutch. Ph.D. thesis, University of Groningen (2008)Google Scholar
  2. 2.
    Bouma, G., Spenader, J.: The distribution of weak and strong object reflexives in Dutch. In: van Eynde, F., Frank, A., Smedt, K.D., van Noord, G. (eds.) Proceedings of the Seventh International Workshop on Treebanks and Linguistic Theories (TLT 7), no. 12 in LOT Occasional Series, pp. 103–114. Netherlands Graduate School of Linguistics, Utrecht, The Netherlands (2009)Google Scholar
  3. 3.
    Haspelmath, M.: A frequentist explanation of some universals of reflexive marking (2004). Draft of a paper presented at the Workshop on Reciprocals and Reflexives, BerlinGoogle Scholar
  4. 4.
    Hendriks, P., Spenader, J., Smits, E.J.: Frequency-based constraints on reflexive forms in Dutch. In: Proceedings of the 5th International Workshop on Constraints and Language Processing, pp. 33–47. Roskilde, Denmark (2008).http://www.ruc.dk/dat_en/research/reports
  5. 5.
    Hoekstra, H., Moortgat, M., Schouppe, M., Schuurman, I., van der Wouden, T.: CGN Syntactische Annotatie (2004).http://www.tst-centrale.org/images/stories/producten/documentatie/cgn_website/doc_Dutch/topics/annot/syntax/syn_prot.pdf
  6. 6.
    Lai, C., Bird, S.: Querying and updating treebanks: a critical survey and requirements analysis. In: In Proceedings of the Australasian Language Technology Workshop, pp. 139–146. Sydney, Australia (2004)Google Scholar
  7. 7.
    Oostdijk, N., Reynaert, M., Monachesi, P., van Noord, G., Ordelman, R., Schuurman, I., Vandeghinste, V.: From D-Coi to SoNaR. In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco (2008)Google Scholar
  8. 8.
    Pajas, P., Štěpánek, J.: Recent advances in a feature-rich framework for treebank annotation. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 673–680. Coling 2008 Organizing Committee, Manchester, UK (2008).http://www.aclweb.org/anthology/C08-1085
  9. 9.
    Prins, R., van Noord, G.: Reinforcing parser preferences through tagging. Traitement Automatique des Langues 44 (3), 121–139 (2003)Google Scholar
  10. 10.
    Reinhart, T., Reuland, E.: Reflexivity. Linguist. Inq. 24, 656–720 (1993)Google Scholar
  11. 11.
    Smits, E.J., Hendriks, P., Spenader, J.: Using very large parsed corpora and judgement data to classify verb reflexivity. In: Branco, A. (ed.) Anaphora: Analysis, Algorithms and Applications, pp. 77–93. Springer, Berlin (2007)CrossRefGoogle Scholar
  12. 12.
    van den Bosch, A., Busser, B., Canisius, S., Daelemans, W.: An efficient memory-based morphosyntactic tagger and parser for Dutch. In: Dirix,P., Schuurman, I., Vandeghinste, V., van Eynde, F. (eds.) Computational Linguistics in the Netherlands 2006. Selected Papers from The Seventeenth CLIN meeting, LOT Occassional Series, pp. 99–114. LOT Netherlands Graduate School of Linguistics, Utrecht, The Netherlands. Leuven, Belgium (2007)Google Scholar
  13. 13.
    van Eerten, L.: Over het Corpus Gesproken Nederlands. Nederlandse Taalkunde 12 (3), 194–215 (1997)Google Scholar
  14. 14.
    Van Eynde, F.: Part Of Speech Tagging En Lemmatisering Van Het D-Coi Corpus (2005).http://www.let.rug.nl/~vannoord/Lassy/POS_manual.pdf
  15. 15.
    van Noord, G.: A t L ast P arsing I s N ow O perational. In: TALN 2006 Verbum Ex Machina, Actes De La 13e Conference sur Le Traitement Automatique des Langues naturelles, Leuven, pp. 20–42 (2006)Google Scholar
  16. 16.
    van Noord, G.: Learning efficient parsing. In: EACL 2009, The 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, pp. 817–825 (2009)Google Scholar
  17. 17.
    van Noord, G., Malouf, R.: Wide coverage parsing with stochastic attribute value grammars (2005). Draft available from the authors. A preliminary version of this paper was published in the Proceedings of the IJCNLP workshop Beyond Shallow Analyses, Hainan, China (2004)Google Scholar
  18. 18.
    van Noord, G., Schuurman, I., Bouma, G.: Lassy syntactische annotatie, revision 19455 (2011).http://www.let.rug.nl/vannoord/Lassy/sa-man_lassy.pdf

Copyright information

© The Author(s) 2013

Open Access. This chapter is distributed under the terms of the Creative Commons Attribution Noncommercial License, which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Authors and Affiliations

  • Gertjan van Noord
    • 1
  • Gosse Bouma
    • 1
  • Frank Van Eynde
    • 2
  • Daniël de Kok
    • 1
  • Jelmer van der Linde
    • 1
  • Ineke Schuurman
    • 2
  • Erik Tjong Kim Sang
    • 1
  • Vincent Vandeghinste
    • 2
  1. 1.University of GroningenGroningenThe Netherlands
  2. 2.KU LeuvenLeuvenBelgium

Personalised recommendations