Skip to main content

Sinica Treebank

  • Chapter
  • First Online:
Handbook of Linguistic Annotation

Abstract

Sinica Treebank is both the first Chinese treebank (released in 2000 simultaneously with the Penn Chinese Treebank) and the first treebank fully annotated with thematic role information. As such, the construction of the Sinica Treebank deals with both theory and modeling issues in innovative ways. It deals with challenges posed by the lack of conventions to mark word-break and ends-of-sentence in Chinese texts. The solution was based on maximal resources sharing, as the Sinica Treebank is built upon PoS tagged Sinica Corpus, and rely heavily on the grammatical information of the CKIP lexicon encoded in Information-based Case Grammar (ICG). We discuss the design criteria and annotation guidelines of the Sinica Treebank as well as the three design criteria of: Maximal Resource Sharing, Minimal Structural Complexity, and Optimal Semantic Information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abeille, A. (ed.): Treebanks Building and Using Parsed Corpora. Language And Speech Series. Springer, Dordrecht (2003)

    Google Scholar 

  2. Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 69–72. Association for Computational Linguistics (2006)

    Google Scholar 

  3. Bohmova, A., Hajicova, E.: In: Abeille, A. (ed.) How Much of the Underlying Syntactic Structure Can be Tagged Automatically?, pp. 31–40 (2003)

    Google Scholar 

  4. Brants, T., Skut, W., Uszkoreit, H.: In: Abeille, A. (ed.) Syntactic Annotations of a German Newspaper Corpus, pp. 69–76 (2003)

    Google Scholar 

  5. Chen, F.Y., Tsai, P.F., Chen, K.J., Huang, C.R.: Sinica Treebank. [in Chinese] Computational Linguistics and Chinese Language Processing 4.2, pp. 87–103 (2000)

    Google Scholar 

  6. Chen, K.-J.: Design concepts for chinese parsers. In: Proceedings of the 3rd International Conference on Chinese Information Processing, pp. 1–22 (1992)

    Google Scholar 

  7. Chen, K.-J.: A model for robust chinese parser. In: Computational Linguistics and Chinese Language Processing 1.1, pp. 183–204 (1996)

    Google Scholar 

  8. Chen, K.-J., Liu, S.H.: Word identification for mandarin Chinese sentences. In: Proceedings of COLING-92, pp. 101–105 (1992)

    Google Scholar 

  9. Chen, K.-J., Huang, C.-R.: Features constraints in chinese language parsing. In: Proceedings of ICCPOL ’94, pp. 223–228 (1994)

    Google Scholar 

  10. Chen, K.-J., Huang, C.-R.: Information-based case grammar: a unification-based formalism for parsing Chinese. In: Huang, C.-R., Chen, K.-J., Benjamin, K.T’. (eds.) Readings in Chinese Natural Language Processing. Journal of Chinese Linguistics Monograph Series, no. 9, pp. 23–45. JCL, Berkeley (1996)

    Google Scholar 

  11. Chen, K.-J., Hsieh, Y.-M.: Chinese treebanks and grammar extraction. In: Su, K.-Y., Tsujii, J., Lee, J.-H., et al. (ed.) Proceedings of the First International Joint Conference on Natural Language Processing – IJCNLP 2004, Revised Selected Papers, Hainan Island, China, 22–24 Mar 2004. Lecture Notes in Computer Science, pp. 655–661 (2005)

    Google Scholar 

  12. Chen, K.-J., Liu, S.H., Chang, L.P., Chin, Y.H.: A practical tagger for Chinese corpora. In: Proceedings of ROCLING VII, pp. 111–126 (1994)

    Google Scholar 

  13. Chen, K.-J., Huang, C.-R., Chang, L.-P., Hsu, H.-L.: Sinica corpus: design methodology for balanced corpora. In: Proceedings of the 11th Pacific Asia Conference on Language, Information, and Computation (PACLIC II), Seoul Korea, pp. 167–176 (1996)

    Google Scholar 

  14. Chen, K.-J., Huang, C.-R., Chen, F.-Y., Luo, C.-C., Chang, M.-C., Chen, C.-J., Gao, Z.-M.: In: Abeille, A. (ed.) Sinica Treebank: Design Criteria, Representational Issues and Implementation, pp. 231–248 (2003)

    Google Scholar 

  15. CKIP (Chinese Knowledge Information Processing). The Categorical Analysis of Chinese. [in Chinese] CKIP Technical Report 93-05. Nankang: Academia Sinic (1993)

    Google Scholar 

  16. Gazdar, G., Klein, E., Pullum, G.K., Sag, I.A.: Generalized Phrase Structure Grammar. Blackwell, Cambridge, Harvard University Press, Cambridge (1985)

    Google Scholar 

  17. Huang, C.-R.: Coordination Schemas and Chinese NP Coordination in GPSG. Cahiers de Linguistique Asie Orientale XV.1, pp. 107–127 (1986)

    Google Scholar 

  18. Huang, C.-R., Chen, K.-J., Chen, F.-Y., Chang, L.-L.: Segmentation standard for Chinese natural language processing. In: Computational Linguistics and Chinese Language Processing 2.2, pp. 47–62 (1997)

    Google Scholar 

  19. Huang, C.-R., Chen, K.-J., Chen, F.-Y., Chen, K.-J., Gao, Z.-M., Chen, K.-Y.: Sinica treebank: design criteria, annotation guidelines, and on-line interface. In: Proceedings of 2\(^{nd}\) Chinese Language Processing Workshop (Held in conjunction with the 38\(^{th}\) Annual Meeting of the Association for Computational Linguistics, ACL-2000), Hong Kong, pp. 29–37 (2000)

    Google Scholar 

  20. Huang, C.-R., Kilgarriff, A., Wu, Y., Chiu, C.-M., Smith, S., Rychly, P., Bai, M., Chen, K.-J.: Chinese Sketch Engine and the extraction of grammatical collocations. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 48–55 (2005)

    Google Scholar 

  21. Huang, C.-R., Heish, S.-K., Chen, K.-J.: Mandarin Chinese words and parts of speech: A corpus-based study. Routledge, London (2017)

    Google Scholar 

  22. Lee, S.Y.M., Li, S., Huang, C.-R.: Annotating events in an emotion corpus. In: Proceedings of LREC, pp. 3511–3516 (2014)

    Google Scholar 

  23. Lin, F.-W.: Some Reflections on the Thematic System of Information-based Case Grammar (ICG). [In Chinese.] CKIP Technical Report No. 92-01. Nankang: Academia Sinica (1992)

    Google Scholar 

  24. Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: The PENN Treebank. Computational Linguistics 19.2, pp. 313–330 (1993)

    Google Scholar 

  25. Oepen, S., Toutanova, K., Shieber, S., Manning, C., Flickinger, D., Brants, T.: The LinGO Redwoods treebank motivation and preliminary applications. In: Proceedings of the 19th international conference on Computational linguistics-II, pp. 1–5 (2002)

    Google Scholar 

  26. Pollard, C., Sag, I.A.: Head-Driven Phrase Structure Grammar. Center for the Study of Language and Information. Chicago Press, Stanford (1994)

    Google Scholar 

  27. Pustejovsky, J.: The Generative Lexicon. MIT Press, Cambridge (1985)

    Google Scholar 

  28. Sag, I., Gazdar, G., Wasow, T., Weisler, S.: Coordination and how to distinguish categories. Natural Language and Linguistic Theories 3, pp. 117–171 (1985)

    Google Scholar 

  29. Tseng, S.-S., Chang, M.-Y., Hsieh, C.-C., Chen, K.J.: Approaches on an experimental Chinese electronic dictionary. In: Proceedings of 1988 International Conference on Computer Processing of Chinese and Oriental Languages, pp. 371–374 (1988)

    Google Scholar 

  30. Uszkoreit, H.: Categorial Unification Grammars. In: Proceedings of COLING’86. Bonn: University of Bonn. Also appeared as Report No. CSLI-86-66. Stanford: Center for the Study of Language and Information (1986)

    Google Scholar 

  31. Xia, F.: The Segmentation Guidelines for the Penn Chinese Treebank (3.0). IRCS Report 00-06. University of Pennsylvania, Philadelphia, PA (2000)

    Google Scholar 

  32. Xia, F.: The Part-of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0). IRCS Report 00-07. University of Pennsylvania, Philadelphia, PA (2000)

    Google Scholar 

  33. Xia, F., Palmer, M., Xue, N., Okurowski, M.E., Kovarik, J., Chiou, F.-D., Huang, S., Kroch, T., Marcus, M.: Developing guidelines and ensuring consistency for chinese text annotation. In: Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece (2000)

    Google Scholar 

  34. Xia, F., Han, C., Palmer, M., Joshi, A.: Comparing lexicalized treebank grammars extracted from Chinese, Korean, and English. In: Proceedings of 2\(^{nd}\) Chinese Language Processing Workshop (Held in conjunction with the 38\(^{th}\) Annual Meeting of the Association for Computational Linguistics, ACL-2000), pp. 52–59. Hong Kong (2000)

    Google Scholar 

  35. Xue, N., Xia, F.: The Bracketing Guidelines for the Penn Chinese Treebank (3.0). IRCS Report 00-07. University of Pennsylvania, Philadelphia, PA (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chu-Ren Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Huang, CR., Chen, KJ. (2017). Sinica Treebank. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_23

Download citation

  • DOI: https://doi.org/10.1007/978-94-024-0881-2_23

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-024-0879-9

  • Online ISBN: 978-94-024-0881-2

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics