Sinica Treebank

  • Chu-Ren Huang
  • Keh-Jiann Chen


Sinica Treebank is both the first Chinese treebank (released in 2000 simultaneously with the Penn Chinese Treebank) and the first treebank fully annotated with thematic role information. As such, the construction of the Sinica Treebank deals with both theory and modeling issues in innovative ways. It deals with challenges posed by the lack of conventions to mark word-break and ends-of-sentence in Chinese texts. The solution was based on maximal resources sharing, as the Sinica Treebank is built upon PoS tagged Sinica Corpus, and rely heavily on the grammatical information of the CKIP lexicon encoded in Information-based Case Grammar (ICG). We discuss the design criteria and annotation guidelines of the Sinica Treebank as well as the three design criteria of: Maximal Resource Sharing, Minimal Structural Complexity, and Optimal Semantic Information.


Chinese Sinica Treebank Thematic role annotation Information-based Case Grammar 


  1. 1.
    Abeille, A. (ed.): Treebanks Building and Using Parsed Corpora. Language And Speech Series. Springer, Dordrecht (2003)Google Scholar
  2. 2.
    Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 69–72. Association for Computational Linguistics (2006)Google Scholar
  3. 3.
    Bohmova, A., Hajicova, E.: In: Abeille, A. (ed.) How Much of the Underlying Syntactic Structure Can be Tagged Automatically?, pp. 31–40 (2003)Google Scholar
  4. 4.
    Brants, T., Skut, W., Uszkoreit, H.: In: Abeille, A. (ed.) Syntactic Annotations of a German Newspaper Corpus, pp. 69–76 (2003)Google Scholar
  5. 5.
    Chen, F.Y., Tsai, P.F., Chen, K.J., Huang, C.R.: Sinica Treebank. [in Chinese] Computational Linguistics and Chinese Language Processing 4.2, pp. 87–103 (2000)Google Scholar
  6. 6.
    Chen, K.-J.: Design concepts for chinese parsers. In: Proceedings of the 3rd International Conference on Chinese Information Processing, pp. 1–22 (1992)Google Scholar
  7. 7.
    Chen, K.-J.: A model for robust chinese parser. In: Computational Linguistics and Chinese Language Processing 1.1, pp. 183–204 (1996)Google Scholar
  8. 8.
    Chen, K.-J., Liu, S.H.: Word identification for mandarin Chinese sentences. In: Proceedings of COLING-92, pp. 101–105 (1992)Google Scholar
  9. 9.
    Chen, K.-J., Huang, C.-R.: Features constraints in chinese language parsing. In: Proceedings of ICCPOL ’94, pp. 223–228 (1994)Google Scholar
  10. 10.
    Chen, K.-J., Huang, C.-R.: Information-based case grammar: a unification-based formalism for parsing Chinese. In: Huang, C.-R., Chen, K.-J., Benjamin, K.T’. (eds.) Readings in Chinese Natural Language Processing. Journal of Chinese Linguistics Monograph Series, no. 9, pp. 23–45. JCL, Berkeley (1996)Google Scholar
  11. 11.
    Chen, K.-J., Hsieh, Y.-M.: Chinese treebanks and grammar extraction. In: Su, K.-Y., Tsujii, J., Lee, J.-H., et al. (ed.) Proceedings of the First International Joint Conference on Natural Language Processing – IJCNLP 2004, Revised Selected Papers, Hainan Island, China, 22–24 Mar 2004. Lecture Notes in Computer Science, pp. 655–661 (2005)Google Scholar
  12. 12.
    Chen, K.-J., Liu, S.H., Chang, L.P., Chin, Y.H.: A practical tagger for Chinese corpora. In: Proceedings of ROCLING VII, pp. 111–126 (1994)Google Scholar
  13. 13.
    Chen, K.-J., Huang, C.-R., Chang, L.-P., Hsu, H.-L.: Sinica corpus: design methodology for balanced corpora. In: Proceedings of the 11th Pacific Asia Conference on Language, Information, and Computation (PACLIC II), Seoul Korea, pp. 167–176 (1996)Google Scholar
  14. 14.
    Chen, K.-J., Huang, C.-R., Chen, F.-Y., Luo, C.-C., Chang, M.-C., Chen, C.-J., Gao, Z.-M.: In: Abeille, A. (ed.) Sinica Treebank: Design Criteria, Representational Issues and Implementation, pp. 231–248 (2003)Google Scholar
  15. 15.
    CKIP (Chinese Knowledge Information Processing). The Categorical Analysis of Chinese. [in Chinese] CKIP Technical Report 93-05. Nankang: Academia Sinic (1993)Google Scholar
  16. 16.
    Gazdar, G., Klein, E., Pullum, G.K., Sag, I.A.: Generalized Phrase Structure Grammar. Blackwell, Cambridge, Harvard University Press, Cambridge (1985)Google Scholar
  17. 17.
    Huang, C.-R.: Coordination Schemas and Chinese NP Coordination in GPSG. Cahiers de Linguistique Asie Orientale XV.1, pp. 107–127 (1986)Google Scholar
  18. 18.
    Huang, C.-R., Chen, K.-J., Chen, F.-Y., Chang, L.-L.: Segmentation standard for Chinese natural language processing. In: Computational Linguistics and Chinese Language Processing 2.2, pp. 47–62 (1997)Google Scholar
  19. 19.
    Huang, C.-R., Chen, K.-J., Chen, F.-Y., Chen, K.-J., Gao, Z.-M., Chen, K.-Y.: Sinica treebank: design criteria, annotation guidelines, and on-line interface. In: Proceedings of 2\(^{nd}\) Chinese Language Processing Workshop (Held in conjunction with the 38\(^{th}\) Annual Meeting of the Association for Computational Linguistics, ACL-2000), Hong Kong, pp. 29–37 (2000)Google Scholar
  20. 20.
    Huang, C.-R., Kilgarriff, A., Wu, Y., Chiu, C.-M., Smith, S., Rychly, P., Bai, M., Chen, K.-J.: Chinese Sketch Engine and the extraction of grammatical collocations. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 48–55 (2005)Google Scholar
  21. 21.
    Huang, C.-R., Heish, S.-K., Chen, K.-J.: Mandarin Chinese words and parts of speech: A corpus-based study. Routledge, London (2017)Google Scholar
  22. 22.
    Lee, S.Y.M., Li, S., Huang, C.-R.: Annotating events in an emotion corpus. In: Proceedings of LREC, pp. 3511–3516 (2014)Google Scholar
  23. 23.
    Lin, F.-W.: Some Reflections on the Thematic System of Information-based Case Grammar (ICG). [In Chinese.] CKIP Technical Report No. 92-01. Nankang: Academia Sinica (1992)Google Scholar
  24. 24.
    Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: The PENN Treebank. Computational Linguistics 19.2, pp. 313–330 (1993)Google Scholar
  25. 25.
    Oepen, S., Toutanova, K., Shieber, S., Manning, C., Flickinger, D., Brants, T.: The LinGO Redwoods treebank motivation and preliminary applications. In: Proceedings of the 19th international conference on Computational linguistics-II, pp. 1–5 (2002)Google Scholar
  26. 26.
    Pollard, C., Sag, I.A.: Head-Driven Phrase Structure Grammar. Center for the Study of Language and Information. Chicago Press, Stanford (1994)Google Scholar
  27. 27.
    Pustejovsky, J.: The Generative Lexicon. MIT Press, Cambridge (1985)Google Scholar
  28. 28.
    Sag, I., Gazdar, G., Wasow, T., Weisler, S.: Coordination and how to distinguish categories. Natural Language and Linguistic Theories 3, pp. 117–171 (1985)Google Scholar
  29. 29.
    Tseng, S.-S., Chang, M.-Y., Hsieh, C.-C., Chen, K.J.: Approaches on an experimental Chinese electronic dictionary. In: Proceedings of 1988 International Conference on Computer Processing of Chinese and Oriental Languages, pp. 371–374 (1988)Google Scholar
  30. 30.
    Uszkoreit, H.: Categorial Unification Grammars. In: Proceedings of COLING’86. Bonn: University of Bonn. Also appeared as Report No. CSLI-86-66. Stanford: Center for the Study of Language and Information (1986)Google Scholar
  31. 31.
    Xia, F.: The Segmentation Guidelines for the Penn Chinese Treebank (3.0). IRCS Report 00-06. University of Pennsylvania, Philadelphia, PA (2000)Google Scholar
  32. 32.
    Xia, F.: The Part-of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0). IRCS Report 00-07. University of Pennsylvania, Philadelphia, PA (2000)Google Scholar
  33. 33.
    Xia, F., Palmer, M., Xue, N., Okurowski, M.E., Kovarik, J., Chiou, F.-D., Huang, S., Kroch, T., Marcus, M.: Developing guidelines and ensuring consistency for chinese text annotation. In: Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece (2000)Google Scholar
  34. 34.
    Xia, F., Han, C., Palmer, M., Joshi, A.: Comparing lexicalized treebank grammars extracted from Chinese, Korean, and English. In: Proceedings of 2\(^{nd}\) Chinese Language Processing Workshop (Held in conjunction with the 38\(^{th}\) Annual Meeting of the Association for Computational Linguistics, ACL-2000), pp. 52–59. Hong Kong (2000)Google Scholar
  35. 35.
    Xue, N., Xia, F.: The Bracketing Guidelines for the Penn Chinese Treebank (3.0). IRCS Report 00-07. University of Pennsylvania, Philadelphia, PA (2000)Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2017

Authors and Affiliations

  1. 1.The Hong Kong Polytechnic UniversityKowloonHong Kong
  2. 2.Academia SinicaTaipeiTaiwan

Personalised recommendations