Building a Parallel Bilingual Syntactically Annotated Corpus

  • Jan Cuřín
  • Martin Čmejrek
  • Jiří Havelka
  • Vladislav Kuboň
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3248)


This paper describes a process of building a bilingual syntactically annotated corpus, the PCEDT (Prague Czech-English Dependency Treebank). The corpus is being created at Charles University, Prague, and the release of this corpus as Linguistic Data Consortium data collection is scheduled for the spring of 2004. The paper discusses important decisions made prior to the start of the project and gives an overview of all kinds of resources included in the PCEDT.


Institutional Investor Machine Translation Mathematical Linguistics Dependency Tree Annotation Scheme 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Al-Onaizan, Y., Cuřín, J., Jahr, M., Knight, K., Lafferty, J., Melamed, D., Och, F.J., Purdy, D., Smith, N.A., Yarowsky, D.: The Statistical Machine Translation. Technical report (1999), NLP WS 1999 Final ReportGoogle Scholar
  2. 2.
    Hajič, J., Panevová, J., Buráňová, E., Urešová, Z., Bémová, A., Štěpánek, J., Pajas, P., Kárník, J.: A Manual for Analytic Layer Tagging of the Prague Dependency Treebank, Prague, Czech Republic (2001)Google Scholar
  3. 3.
    Hajičová, E., Panevová, J., Sgall, P.: A manual for tectogrammatic tagging of the prague dependency treebank. Technical Report TR-2000-09, ÚFAL MFF UK, Prague, Czech Republic (2000)Google Scholar
  4. 4.
    Hajič, J., Hladká, B.: Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In: Proceedings of COLING-ACL Conference, Montreal, Canada, pp. 483–490 (1998)Google Scholar
  5. 5.
    Hajič, J., Brill, E., Collins, M., Hladká, B., Jones, D., Kuo, C., Ramshaw, L., Schwartz, O., Tillmann, C., Zeman, D.: Core Natural Language Processing Technology Applicable to Multiple Languages. Technical Report Research Note 37, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD (1998)Google Scholar
  6. 6.
    Charniak, E.: A Maximum-Entropy-Inspired Parser. Technical Report CS-99-12 (1999)Google Scholar
  7. 7.
    Böhmová, A.: Automatic procedures in tectogrammatical tagging. The Prague Bulletin of Mathematical Linguistics 76 (2001)Google Scholar
  8. 8.
    Žabokrtský, Z., Sgall, P., Džeroski, S.: Machine Learning Approach to Automatic Functor Assignment in the Prague Dependency Treebank. In: Proceedings of LREC 2002, Las Palmas de Gran Canaria, Spain, vol. V, pp. 1513–1520 (2002)Google Scholar
  9. 9.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. Technical Report RC22176, IBM (2001)Google Scholar
  10. 10.
    Och, F.J., Ney, H.: A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics 29, 19–51 (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Jan Cuřín
    • 2
  • Martin Čmejrek
    • 2
  • Jiří Havelka
    • 1
    • 2
  • Vladislav Kuboň
    • 1
  1. 1.Institute of Formal and Applied LinguisticsCharles University in Prague 
  2. 2.Center for Computational LinguisticsCharles University in Prague 

Personalised recommendations