Advertisement

The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch

  • Nelleke OostdijkEmail author
  • Martin Reynaert
  • Véronique Hoste
  • Ineke Schuurman
Chapter
Part of the Theory and Applications of Natural Language Processing book series (NLP)

Abstract

The construction of a large and richly annotated corpus of written Dutch was identified as one of the priorities of the STEVIN programme. Such a corpus, sampling texts from conventional and new media, is invaluable for scientific research and application development. The present chapter describes how in two consecutive STEVIN-funded projects, viz. D-Coi and SoNaR, the Dutch reference corpus was developed. The construction of the corpus has been guided by (inter)national standards and best practices. At the same time through the achievements and the experiences gained in the D-Coi and SoNaR projects, a contribution was made to their further advancement and dissemination.

Notes

Acknowledgements

Thanks are due to our collaborators in these projects (in random order): Paola Monachesi, Gertjan van Noord, Franciska de Jong, Roeland Ordelman, Vincent Vandeghinste, Jantine Trapman, Thijs Verschoor, Lydia Rura, Orphée De Clercq, Wilko Apperloo, Peter Beinema, Frank Van Eynde, Bart Desmet, Gert Kloosterman, Hendri Hondorp, Tanja Gaustad van Zaanen, Eric Sanders, Maaske Treurniet, Henk van den Heuvel, Arjan van Hessen, and Anne Kuijs.

References

  1. 1.
    Aston, G., Burnard, L.: The BNC Handbook. Exploring the British National Corpus with SARA. Edinburgh University Press, Edinburgh (1998)Google Scholar
  2. 2.
    Braasch, A., Farse, H., Jongejan, B., Navaretta, C., Olsen, S., Pedersen, B.: Evaluation and Validation of the D-Coi Pilot Corpus. Center for Sprokteknologi, Copenhagen (2008)Google Scholar
  3. 3.
    Carletta, J.C.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22 (2), 249–254 (1996)Google Scholar
  4. 4.
    Chinchor, N., Robinson, P.: MUC-7 Named Entity Task Definition (version 3.5) (1998)Google Scholar
  5. 5.
    Daelemans, W., Strik, H.: Het Nederlands in de taal-en Spraaktechnologie: prioriteiten Voor Basisvoorzieningen. Nederlandse Taalunie, The Hague (2002)Google Scholar
  6. 6.
    Daelemans, W., van den Bosch, A.: Memory-Based Language Processing. Cambridge University Press, Cambridge (2005)CrossRefGoogle Scholar
  7. 7.
    Daelemans, W., Zavrel, J., van der Sloot, K., van den Bosch, A.: TiMBL: tilburg memory based learner, version 5.1.0, reference guide. Technical Report ILK 04-02, ILK Research Group, Tilburg University (2004)Google Scholar
  8. 8.
    De Clercq, O., Reynaert, M.: SoNaR acquisition manual version 1.0. Technical Report LT3 10-02, LT3 Research Group – Hogeschool Gent (2010).http://lt3.hogent.be/en/publications/
  9. 9.
    De Clercq, O., Hoste, V., Hendrickx, I.: Cross-domain Dutch coreference resolution. In: Proceedings of the 8th International Conference on Recent Advances in Natural Language Processing. RANLP 2011, Hissar, Bulgaria (2011)Google Scholar
  10. 10.
    Desmet, B., Hoste, V.: Named entity recognition through classifier combination. In: Computational Linguistics in the Netherlands 2010: Selected Papers from the Twentieth CLIN Meeting, Utrecht (2010)Google Scholar
  11. 11.
    Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, R., Strassel, S., Weischedel, R.: The automatic content extraction (ACE) program tasks, data, and evaluation. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, Lisbon, pp. 837–840. LREC-2004 (2004)Google Scholar
  12. 12.
    Hendrickx, I., Bouma, G., Coppens, F., Daelemans, W., Hoste, V., Kloosterman, G., Mineur, A.M., Vloet, J.V.D., Verschelde, J.L.: A coreference corpus and resolution system for Dutch. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, Marrakech, pp. 144–149. LREC-2008 (2008)Google Scholar
  13. 13.
    Herceg, P.M., Ball, C.N.: A comparative study of PDF generation methods: measuring loss of fidelity when converting Arabic and Persian MS Word files to PDF. Technical Report MTR110043, Mitre (2011).http://www.mitre.org/work/tech_papers/2011/11_0753/11_0753.pdf
  14. 14.
    Hoekstra, H., Moortgat, M., Renmans, B., Schouppe, M., Schuurman, I., Van der Wouden, T.: CGN syntactische annotatie.http://www.ccl.kuleuven.be/Papers/sa-man_DEF.pdf(2004)
  15. 15.
    Hoste, V.: Optimization issues in machine learning of coreference resolution. Ph.D. thesis, Antwerp University (2005)Google Scholar
  16. 16.
    Ide, N., Macleod, C., Fillmore, C., Jurafsky, D.: The American national corpus: an outline of the project. In: Proceedings of International Conference on Artificial and Computational Intelligence. ACIDCA-2000, Monastir (2000)Google Scholar
  17. 17.
    Johnson, C.R., Fillmore, C.J., Petruck, M.R.L., Baker, C.F., Ellsworth, M.J., Ruppenhofer, J., Wood, E.J.: FrameNet: theory and practice. ICSI Technical Report tr-02-009 (2002)Google Scholar
  18. 18.
    Karttunen, L.: Discourse Referents. Syntax and Semantics, vol. 7. Academic, New York (1976)Google Scholar
  19. 19.
    Kučova, L., Hajičova, E.: Coreferential relations in the Prague dependency treebank. In: Proceedings of DAARC 2004, Azores, pp. 97–102 (2004)Google Scholar
  20. 20.
    Leveling, J., Hartrumpf, S.: On metonymy recognition for geographic information retrieval. Int. J. Geogr. Inf. Sci. 22 (3), 289–299 (2008)CrossRefGoogle Scholar
  21. 21.
    Markert, K., Nissim, M.: Towards a corpus annotated for metonymies: the case of location names. In: Proceedings of the Third International Conference on Language Resources and Evaluation, Las Palmas, pp. 1385–1392. LREC-2002 (2002)Google Scholar
  22. 22.
    Martens, S.: Varro: an algorithm and toolkit for regular structure discovery in treebanks. In: Proceedings of Coling 2010, Beijing, pp. 810–818 (2010)Google Scholar
  23. 23.
    Martens, S.: Quantifying linguistic regularity. Ph.D. thesis, KU Leuven (2011)Google Scholar
  24. 24.
    Monachesi, P., Stevens, G., Trapman, J.: Adding semantic role annotation to a corpus of written Dutch. In: Proceedings of the Linguistic Annotation Workshop (Held in Conjunction with ACL 2007), Prague (2007)Google Scholar
  25. 25.
    Oostdijk, N.: The spoken dutch corpus. Outline and first evaluation. In: Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, pp. 887–894. LREC-2000 (2000)Google Scholar
  26. 26.
    Oostdijk, N.: Dutch language corpus initiative, pilot corpus. Corpus description. TR-D-COI-06-09 (2006)Google Scholar
  27. 27.
    Oostdijk, N.: A reference corpus of written Dutch. Corpus design. TR-D-COI-06f (2006)Google Scholar
  28. 28.
    Oostdijk, N., Boves, L.: User requirements analysis for the design of a reference corpus of written Dutch. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, pp. 1206–1211. LREC-2006 (2006)Google Scholar
  29. 29.
    Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: a corpus annotated with semantic roles. Comput. Linguist. J. 31 (1) (2005)Google Scholar
  30. 30.
    Poesio, M., Artstein, R.: Anaphoric annotation in the ARRAU corpus. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, Marrakech, pp. 1170–1174. LREC-2008 (2008)Google Scholar
  31. 31.
    Recasens, M., Marti, M.A.: AnCora-CO: coreferentially annotated corpora for Spanish and Catalan. Lang. Resour. Eval. 44 (4), 315–345 (2010)CrossRefGoogle Scholar
  32. 32.
    Reynaert, M.: Corpus-induced corpus cleanup. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC-2006, Trento, pp. 87–92 (2006)Google Scholar
  33. 33.
    Reynaert, M.: Non-interactive OCR post-correction for giga-scale digitization projects. In: Gelbukh, A. (ed.) Proceedings of the Computational Linguistics and Intelligent Text Processing 9th International Conference, CICLing 2008, vol. 4919, pp. 617–630. Springer, Berlin (2008)Google Scholar
  34. 34.
    Reynaert, M.: Character confusion versus focus word-based correction of spelling and OCR variants in corpora. Int. J. Doc. Anal. Recognit. 1–15 (2010).http://dx.doi.org/10.1007/s10032-010-0133-5, doi:10.1007/s10032-010-0133-5CrossRefGoogle Scholar
  35. 35.
    Sanders, E.: Collecting and analysing chats and tweets in SoNaR. In: Proceedings of the Eighth International Conference of Language Resources and Evaluation, Istanbul, pp. 2253–2256. LREC-2012 (2012)Google Scholar
  36. 36.
    Sauri, R., Littman, J., Knippen, B., Gaizauskas, R., Setzer, A., Pustejovsky, J.: TimeML annotation guidelines, version 1.2.1.http://timeml.org/site/publications/specs.html(2006)
  37. 37.
    Schuurman, I.: Spatiotemporal annotation on top of an existing treebank. In: De Smedt, K., Hajic, J., Kuebler, S. (eds.) Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories, Bergen, pp. 151–162 (2007)Google Scholar
  38. 38.
    Schuurman, I.: Which New York, which Monday? The role of background knowledge and intended audience in automatic disambiguation of spatiotemporal expressions. In: Proceedings of CLIN 17, Leuven (2007)Google Scholar
  39. 39.
    Schuurman, I.: Spatiotemporal annotation using MiniSTEx: how to deal with alternative, foreign, vague and obsolete names? In: Proceedings of the Sixth Conference on International Language Resources and Evaluation (LREC’08), Marrakech (2008)Google Scholar
  40. 40.
    Schuurman, I., Vandeghinste, V.: Cultural aspects of spatiotemporal analysis in multilingual applications. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA), Valletta (2010)Google Scholar
  41. 41.
    Schuurman, I., Vandeghinste, V.: Spatiotemporal annotation: interaction between standards and other formats. In: IEEE-ICSC Workshop on Semantic Annotation for Computational Linguistic Resources, Palo Alto (2011)Google Scholar
  42. 42.
    Schuurman, I., Schouppe, M., Van der Wouden, T., Hoekstra, H.: CGN, an annotated corpus of spoken Dutch. In: Proceedings of the Fourth International Conference on Linguistically Interpreed Corpora, Budapest, pp. 101–112. LINC-2003 (2003)Google Scholar
  43. 43.
    SpatialML: Annotation Scheme for Marking Spatial Expressions in Natural Language. MITRE (2007). Version 2.0, LDC, UpennGoogle Scholar
  44. 44.
    Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, pp. 2142–2147. LREC-2006 (2006)http://arxiv.org/ftp/cs/papers/0609/0609058.pdf
  45. 45.
    Tjong Kim Sang, E.: Introduction to the CoNLL-2002 shared task: language-independent named entity recognition. In: Proceedings of the 6th Conference on Natural Language Learning, Taipei, pp. 155–158 (2002)Google Scholar
  46. 46.
    Trapman, J., Monachesi, P.: Manual for semantic annotation in D-Coi. Technical Report, Utrecht University (2006)Google Scholar
  47. 47.
    Treurniet, M., De Clercq, O., Oostdijk, N., Van den Heuvel, H.: Collecting a corpus of Dutch SMS. In: Proceedings of the Eighth International Conference of Language Resources and Evaluation, Istanbul, pp. 2268–2273. LREC-2012 (2012)Google Scholar
  48. 48.
    Van den Bosch, A., Schuurman, I., Vandeghinste, V.: Transferring PoS-tagging and lemmatisation tools from spoken to written Dutch corpus development. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa. LREC-2006 (2006)Google Scholar
  49. 49.
    Van den Bosch, A., Busser, B., Canisius, S., Daelemans, W.: An efficient memory-based morphosyntactic tagger and parser for Dutch. In: Dirix, P., Schuurman, I., Vandeghinste, V., Van Eynde, F. (eds.) Computational Linguistics in the Netherlands: Selected Papers from the Seventeenth CLIN Meeting, Leuven, pp. 99–114 (2007)Google Scholar
  50. 50.
    Van Eynde, F.: Part of speech tagging en lemmatisering. Protocol voor annotatoren in D-Coi. Centrum voor Computerlinguïstiek, Leuven.http://www.let.rug.nl/vannoord/Lassy/POS-manual.pdfinternal document
  51. 51.
    van Gompel, M.: Folia: format for linguistic annotation.http://ilk.uvt.nl/folia/folia.pdf(2011)
  52. 52.
    Van Noord, G.: At last parsing is now operational. In: Verbum Ex Machina, Actes De La 13e Conference sur Le Traitement Automatique des Langues Naturelles, Leuven, pp. 20–42. TALN-2006 (2006)Google Scholar
  53. 53.
    Van Noord, G., Schuurman, I., Vandeghinste, V.: Syntactic annotation of large corpora in STEVIN. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, pp. 1811–1814. LREC-2006 (2006)Google Scholar
  54. 54.
    Van Rijsbergen, C.: Information Retrieval. Buttersworth, London (1979)Google Scholar
  55. 55.
    Vilain, M., Burger, J., Aberdeen, J., Connolly, D., Hirschman, L.: A model-theoretic coreference scoring scheme. In: Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, pp. 45–52 (1995)Google Scholar
  56. 56.
    Weischedel, R., Pradhan, S., Ramshaw, L., Palmer, M., Xue, N., Marcus, M., Taylor, A., Greenberg, C., Hovy, E., Belvin, R., Houston, A.: OntoNotes Release 3.0. LDC2009T24. Linguistic Data Consortium (2009)Google Scholar
  57. 57.
    Woordenlijst Nederlandse Taal: SDU Uitgevers, The Hague (1995)Google Scholar

Copyright information

© The Author(s) 2013

Open Access. This chapter is distributed under the terms of the Creative Commons Attribution Noncommercial License, which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Authors and Affiliations

  • Nelleke Oostdijk
    • 1
    Email author
  • Martin Reynaert
    • 2
  • Véronique Hoste
    • 3
  • Ineke Schuurman
    • 4
  1. 1.Radboud University NijmegenNijmegenThe Netherlands
  2. 2.Tilburg UniversityTilburgThe Netherlands
  3. 3.University College Ghent and Ghent UniversityGhentBelgium
  4. 4.KU LeuvenLeuvenBelgium

Personalised recommendations