Research on Language and Computation

, Volume 6, Issue 2, pp 113–137 | Cite as

On Detecting Errors in Dependency Treebanks

  • Adriane Boyd
  • Markus Dickinson
  • W. Detmar Meurers
Article

Abstract

Dependency relations between words are increasingly recognized as an important level of linguistic representation that is close to the data and at the same time to the semantic functor-argument structure as a target of syntactic analysis and processing. Correspondingly, dependency structures play an important role in parser evaluation and for the training and evaluation of tools based on dependency treebanks. Gold standard dependency treebanks have been created for some languages, most notably Czech, and annotation efforts for other languages are under way. At the same time, general techniques for detecting errors in dependency annotation have not yet been developed. We address this gap by exploring how a technique proposed for detecting errors in constituency-based syntactic annotation can be adapted to systematically detect errors in dependency annotation. Building on an analysis of key properties and differences between constituency and dependency annotation, we discuss results for dependency treebanks for Swedish, Czech, and German. Complementing the focus on detecting errors in dependency treebanks to improve these gold standard resources, the discussion of dependency error detection for different languages and annotation schemes also raises questions of standardization for some aspects of dependency annotation, in particular regarding the locality of annotation, the assumption of a single head for each dependency relation, and phenomena such as coordination.

Keywords

Corpus annotation Dependency grammar Error detection Prague Dependency Treebank Talbanken Dependency Treebank Tiger Dependency Bank 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abeillé, A. (eds) (2003) Treebanks: Building and using syntactically annotated corpora. Kluwer, DordrechtGoogle Scholar
  2. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The berkeley FrameNet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (ACL/COLING-98) (Vol. 1, pp. 86–90). Montreal, Quebec, Canada: Association for Computational Linguistics. http://aclweb.org/anthology/P98-1013.
  3. Bick, E. (2006). LingPars, a linguistically inspired, language-independent machine learner for dependency treebanks. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X) (pp. 171–175). New York City: Association for Computational Linguistics. http://aclweb.org/anthology/W06-2923.
  4. Blaheta, D. (2002). Handling noisy training and testing data. In Proceedings of the 7th conference on Empirical Methods in Natural Language Processing (EMNLP-02) (pp. 111–116). http://www.cs.brown.edu/~dpb/papers/dpb-emnlp02.html.
  5. Boyd, A., Dickinson, M., & Meurers, D. (2007). Increasing the recall of corpus annotation error detection. In Proceedings of the Sixth Workshop on Treebanks and Linguistic Theories (TLT-07). Bergen, Norway. http://purl.org/dm/papers/boyd-et-al-07b.html.
  6. Brants, S., Dipper, S., Hansen, S., Lezius, W., & Smith, G. (2002). The TIGER treebank. In Proceedings of the Workshop on Treebanks and Linguistic Theories (TLT-02). Sozopol, Bulgaria. http://www.bultreebank.org/proceedings/paper03.pdf.
  7. Canisius, S., Bogers, T., van den Bosch, A., Geertzen, J., & Tjong Kim Sang, E. (2006). Dependency parsing by inference over high-recall dependency predictions. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X) (pp. 176–180). New York City: Association for Computational Linguistics. http://aclweb.org/anthology/W06-2924.
  8. Carroll, J., Minnen, G., & Briscoe, T. (2003). Parser evaluation using a grammatical relation annotation scheme. In Abeillé (2003), Chap. 17, pp. 299–316.Google Scholar
  9. Dickinson, M. (2005). Error detection and correction in annotated corpora. Ph.D. thesis, The Ohio State University. http://www.ohiolink.edu/etd/view.cgi?osu1123788552.
  10. Dickinson, M., & Meurers, W. D. (2003a). Detecting errors in part-of-speech annotation. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-03) (pp. 107–114). Budapest, Hungary. http://aclweb.org/anthology/E03-1068.
  11. Dickinson, M., & Meurers, W. D. (2003b). Detecting inconsistencies in treebanks. In Proceedings of the Second Workshop on Treebanks and Linguistic Theories (TLT-03) (pp. 45–56). Växjö, Sweden. http://purl.org/dm/papers/dickinson-meurers-tlt03.html.
  12. Dickinson, M., & Meurers, W. D. (2005a). Detecting errors in discontinuous structural annotation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05) (pp. 322–329). http://aclweb.org/anthology/P05-1040.
  13. Dickinson, M., & Meurers, W. D. (2005b). Prune diseased branches to get healthy trees! How to find erroneous local trees in a treebank and why it matters. In Proceedings of the Fourth Workshop on Treebanks and Linguistic Theories (TLT-05). Barcelona, Spain. http://purl.org/dm/papers/dickinson-meurers-tlt05.html.
  14. Einarsson, J. (1976a). Talbankens skriftsprøkskonkordans. Tech. rep., Lund University, Dept. of Scandinavian Languages.Google Scholar
  15. Einarsson, J. (1976b). Talbankens talsprøkskonkordans. Tech. rep., Lund University, Dept. of Scandinavian Languages.Google Scholar
  16. Eisner, J. M. (1996). Three new probabilistic models for dependency parsing: An exploration. In Proceedings of COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics (pp. 340–345). Copenhagen. http://aclweb.org/anthology/C96-1058.
  17. Eskin, E. (2000). Detecting errors within a corpus using anomaly detection. In Proceedings of the First Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-00). Seattle, Washington. http://aclweb.org/anthology/A00-2020.
  18. Forst, M., Bertomeu, N., Crysmann, B., Fouvry, F., Hansen-Schirra, S., & Kordoni, V. (2004). Towards a dependency-based gold standard for German parsers. The TIGER dependency bank. In S. Hansen-Schirra, S. Oepen, & H. Uszkoreit (Eds.), 5th International Workshop on Linguistically Interpreted Corpora (LINC-04) at COLING (pp. 31–38). Geneva, Switzerland: COLING. http://aclweb.org/anthology/W04-1905.
  19. Habash, N., Gabbard, R., Rambow, O., Kulick, S., & Marcus, M. (2007). Determining case in Arabic: Learning complex linguistic behavior requires complex linguistic features. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 1084–1092). http://aclweb.org/anthology/D07-1116.
  20. Hajič, J., Panevová, J., Buráňová, E., Urešová, Z., & Bémová, A. (1999). Annotations at analytical layer. Instructions for annotators. Tech. rep., ÚFAL MFF UK, Prague, Czech Republic. http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/pdf/a-man-en.pdf (English translation by Zdeněk Kirschner).
  21. Hajič, J., Böhmová, A., Hajičová, E., & Vidová-Hladká, B. (2003). The Prague Dependency Treebank: A three-level annotation scenario. In Abeillé (2003), Chap. 7, pp. 103–127. http://ufal.mff.cuni.cz/pdt2.0/publications/HajicHajicovaAl2000.pdf.
  22. Hajič, J., Vidová-Hladká, B., & Pajas, P. (2001). The Prague Dependency Treebank: Annotation structure and support. In Proceedings of the IRCS Workshop on Linguistic Databases (pp. 105–114). University of Pennsylvania, Philadelphia. http://ufal.mff.cuni.cz/pdt2.0/publications/HajicHladkaPajas2001.pdf.
  23. Hana, J., & Zeman, D. (2005). A manual for morphological annotation, (2nd ed.). Tech. Rep. 27, ÚFAL MFF UK, Prague, Czech Republic. http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf.
  24. Havelka, J. (2007). Beyond projectivity: Multilingual evaluation of constraints and measures on non-projective structures. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL-07) (pp. 608–615). Prague, Czech Republic: Association for Computational Linguistics. http://aclweb.org/anthology/P07-1077.
  25. Hogan, D. (2007). Coordinate noun phrase disambiguation in a generative parsing model. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL-07) (pp. 680–687). Prague, Czech Republic: Association for Computational Linguistics. http://aclweb.org/anthology/P07-1086.
  26. Hudson R.A. (1990) English word grammar. Blackwell, OxfordGoogle Scholar
  27. King, T. H., Crouch, R., Riezler, S., Dalrymple, M., & Kaplan, R. M. (2003). The PARC 700 dependency bank. In Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora, held at the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-03). Budapest. http://www2.parc.com/isl/groups/nltt/fsbank/.
  28. Klein, D., & Manning, C. D. (2002). A generative constituent-context model for improved grammar induction. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics (ACL-02) (pp. 128–135). Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. http://aclweb.org/anthology/P02-1017.
  29. Lin, D. (2003). Dependency-based evaluation of MINIPAR. In Abeillé (2003), Chap. 18, pp. 317–332. http://www.cfilt.iitb.ac.in/archives/minipar_evaluation.pdf.
  30. Manning C.D., Schütze H. (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, MAGoogle Scholar
  31. McDonald, R., & Pereira, F. (2006). Online learning of approximate dependency parsing algorithms. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06). Trento. http://aclweb.org/anthology/E06-1011.
  32. Mel’čuk, I. (1988). Dependency syntax: Theory and practice. SUNY series in linguistics. Albany, NY: State University Press of New York.Google Scholar
  33. Meurers, W. D., & Müller, S. (2008). Corpora and syntax (Article 44). In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook. Berlin: Mouton de Gruyter, Handbooks of Linguistics and Communication Science. http://purl.org/dm/papers/meurers-mueller-07.html.
  34. Mikulová, M., Bémová, A., Hajič, J., Hajičová, E., Havelka, J. et~al. (2006). Annotation on the tectogrammatical layer in the prague dependency treebank. Annotation manual. Tech. rep., ÚFAL MFF UK, Prague, Czech Republic. http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/t-layer/pdf/t-man-en.pdf (English translation).
  35. Mintz T.H. (2003) Frequent frames as a cue for grammatical categories in child directed speech. Cognition 90: 91–117CrossRefGoogle Scholar
  36. Mintz, T. H. (2006). Finding the verbs: Distributional cues to categories available to young learners. In K. Hirsh-Pasek & R. M. Golinkoff (Eds.), Action meets word: How children learn verbs (pp. 31–63). New York: Oxford University Press.Google Scholar
  37. Nilsson, J., & Hall, J. (2005). Reconstruction of the Swedish treebank Talbanken. MSI report 05067, Växjö University: School of Mathematics and Systems Engineering. http://w3.msi.vxu.se/~jni/papers/msi_report05067.pdf.
  38. Nivre, J. (2005). Dependency grammar and dependency parsing. MSI report 05133, Växjö University: School of Mathematics and Systems Engineering. http://stp.lingfil.uu.se/~nivre/docs/05133.pdf.
  39. Nivre J. (2006) Inductive dependency parsing. Springer, BerlinGoogle Scholar
  40. Nivre, J., Nilsson, J., & Hall, J. (2006). Talbanken05: A Swedish treebank with phrase structure and dependency annotation. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC-06). Genoa, Italy. http://stp.lingfil.uu.se/~nivre/docs/talbanken05.pdf.
  41. Padro, L., & Marquez, L. (1998). On the evaluation and comparison of taggers: The effect of noise in testing corpora. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (ACL/COLING-98) (Vol. 1, pp. 997–1002). San Francisco, California. http://aclweb.org/anthology/P98-2164.
  42. Palmer, M., Gildea, D., & Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1), 71–105. http://aclweb.org/anthology/J05-1004.Google Scholar
  43. Sgall, P., Hajičová, E., & Panevová, J. (1986). The meaning of the sentence and its semantic and pragmatic aspects. Prague, Czech Republic/Dordrecht, Netherlands: Academia/Reidel Publishing Company.Google Scholar
  44. Tapanainen, P., & Järvinen, T. (1997). A non-projective dependency parser. In Proceedings of the 5th Conference on Applied Natural Language Processing (ANLP-97). Washington, D.C. http://aclweb.org/anthology/A97-1011.
  45. Taylor, A., Marcus, M., & Santorini, B. (2003). The Penn Treebank: An overview. In Abeillé (2003), Chap. 1, pp. 5–22.Google Scholar
  46. Telljohann, H., Hinrichs, E. W., Kübler, S., & Zinsmeister, H. (2005). Stylebook for the Tübingen treebank of written German (TüBa-D/Z). Tech. rep., Seminar für Sprachwissenschaft, Universität Tübingen, Germany.Google Scholar
  47. Tesnière L. (1959) Éléments de syntaxe structurale. Editions Klincksieck, ParisGoogle Scholar
  48. Ule, T., & Simov, K. (2004). Unexpected productions may well be errors. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LCREC). Lisbon, Portugal. http://www.sfs.uni-tuebingen.de/~ule/Paper/us04lrec.pdf.
  49. van Halteren, H. (2000). The detection of inconsistency in manually tagged text. In A. Abeillé, T. Brants, & H. Uszkoreit (Eds.), Proceedings of the Second Workshop on Linguistically Interpreted Corpora (LINC-00). Luxembourg. Workshop information at http://www.coli.uni-sb.de/linc2000/.
  50. van Halteren, H., Daelemans, W., & Zavrel, J. (2001). Improving accuracy in word class tagging through the combination of machine learning systems. Computational Linguistics, 27(2), 199–229. http://aclweb.org/anthology/J01-2002.

Copyright information

© Springer Science+Business Media B.V. 2008

Authors and Affiliations

  • Adriane Boyd
    • 1
  • Markus Dickinson
    • 2
  • W. Detmar Meurers
    • 3
  1. 1.Department of LinguisticsThe Ohio State UniversityColumbusUSA
  2. 2.Department of LinguisticsIndiana UniversityBloomingtonUSA
  3. 3.Seminar für SprachwissenschaftUniversität TübingenTübingenGermany

Personalised recommendations