Abstract
We discuss in this chapter some of the basic challenges that are involved in analyzing sentences and designing a scheme for syntactic annotation. Here we define the basic concept of syntactic annotation with comments on its nature, method, and function in a language. Next, we focus on some goals and purposes behind developing a syntactic annotation tool for a language. There are some guidelines and instructions for developing a syntactic annotation tool for some advanced languages. We do not try to address these issues and strategies again in this chapter. Rather, we focus on the theoretical and practical importance of syntactic annotation in the process of extracting syntactic information from a sentence. During syntactic annotation, we supply a sentence of a natural language to a machine as an input and instruct the machine to identify phrases and mark their grammatical-cum-syntactic roles in the sentence. It implies that a machine has to learn how phrases are formed and organized so that it understands how a sentence is to be analyzed and interpreted from the perspective of syntactic function and semantic information of words and phrases. It also needs to learn how syntactic-cum-semantic roles of various syntactic units are functionally controlled based on their lexical associations and morphological functions in retrieving information embedded within a sentence. We address all these issues in this chapter and present some ideas and processes that are normally used in syntactic annotation. In course of formulating the basic ideas, we refer to the rules of context-free grammars and show how the outputs generated from syntactically annotated corpus can be used in a better description of a grammar of a language, teaching grammatical forms of a language with better information and analysis, understanding how human brain applies syntactic rules to form sentences, how syntactic rules can be designed to train a computer, and how applications relating to language can be developed with proper syntactic information of a language.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aarts, B., Wallis, S. A., & Nelson, G. (2000). Syntactic annotation in reverse: Exploring ICE-GB with fuzzy tree fragments and ICECUP. In: J. M. Kirk (Ed.) Corpora galore: Analyses and techniques in describing english (pp. 335-343). Rodopi.
Aldebazal, I., Aranzabe, M. J., Arriola, J. M., & Dias de Ilarraza, A. (2009). Syntactic annotation in the reference corpus for processing of basque: Theoretical and practical issues. Corpus Linguistics and Linguistic Theory, 5(2), 241–269.
Antony, P. J., Nandini, J. W., & Soman, K. P. (2012). Computational morphology and natural language parsing for Indian languages: A literature survey. International Journal of Computer Science, Engineering and Technology, 3(4), 136–146.
Antony, P. J., Nandini, J. W., & Soman, K. P. (2010). Penn Treebank-based syntactic parsers for South Dravidian languages using a machine learning approach. International Journal on Computer Application, 7(8), 14–21.
Atwell, E., Demetriou, G., Hughes, J., Schiffrin, A., Souter, C., & Wilcock, S. (2000). A comparative evaluation of modern English corpus grammatical annotation schemes. International Computer Archive of Modern English Journal, 24(1), 7–23.
Barnbrook, G. (1998). Language and computers. Edinburgh University Press.
Begum, R., Husain, S., Dhwaj, A., Sharma, D., Bai, L., & Sangal, R. (2008). A dependency annotation scheme for Indian languages. Proceedings of the international joint conference on natural language processing (IJCNLP-2008). International Institute of Information Technology, January 2008, pp. 1–7.
Bharati, A., Chaitanya, V. and Sangal, R. (1995). Natural language processing: A paninian perspective. Prentice-Hall of India.
Bharati, A., Gupta, M., Yadav, V., Gali, K., & Sharma, D. M. (2009). Simple parser for Indian languages in a dependency framework. Proceedings of the 3rd linguistic annotation workshop (LAWIII) (pp. 162-165). SIGANN, 47th ACL—4th IJCNLP, Singapore (IJCNLP-2009).
Bhat, I.A., Bhat, R.A., Shrivastava, M., & Sharma, D. M. (2017). Joining hands: Exploiting monolingual treebanks for parsing of code-mixing data. Proceedings of the 15th conference of the European chapter of the association for computational linguistics (Vol. 2, pp. 324–330). April 3–7, 2017.
Bhat, R. A., Bhat, I. A., & Sharma, D. M. (2017). Improving transition-based dependency parsing of Hindi and Urdu by modeling syntactically relevant phenomena. ACM transactions on Asian and Low-Resource Language Information Processing (TALLIP), Article 17, 6(3), 1–35.
Borsley, R. (1991). Syntactic theory: A unified approach. Edward Arnold.
Brants, T. (2000). TnT—A statistical part-of-speech tagger. Proceedings of the sixth applied natural language processing conference (ANLP-2000) (pp. 37–42).
Brekke, M. (1991). Automatic syntactic annotation meets the wall. In S. Johansson & A.-B. Stenström (Eds.), English computer corpora: Selected papers and research guides (pp. 83–103). Mouton de Gruyter.
Bresnan, J. (2001). Lexical-functional syntax. Blackwell.
Bresnan, J., Asudeh, A., Toivonen, I., & Wechsler, S. (2015). Lexical-functional syntax. 2nd ed. Wiley Blackwell.
Briscoe, E., & Carroll, J. (1993). Generalized probabilistic LR syntactic annotation of natural language (corpora) with unification-based grammars. Computational Linguistics., 19, 25–60.
Bunt, H., & Tomita, M. (Eds.). (1996). Recent advances in syntactic annotation technology. Kluwer Academic Publishers.
Bunt, H., Carroll, J., & Satta, G. (2004a). Developments in syntactic annotation technology: from theory to application. In H. Bunt, J. Carroll, & G. Satta (Eds.), New Developments in syntactic annotation Technology (pp. 1–18). Kluwer Academic Publishers.
Bunt, H., Carroll, J., & Satta, G. (Eds.). (2004b). New developments in syntactic annotation technology. Kluwer Academic Publishers.
Chen, X., Alexopoulou, T., & Tsimpli, I. (2020). Automatic extraction of subordinate clauses and its application in second language acquisition research. Behavior Research Methods. https://doi.org/10.3758/s13428-020-01456-7
Chomsky, N. (1956). Three models for the description of language. Information Theory, IEEE Transactions., 2(3), 113–124.
Dalrymple, M. (2001). Lexical-functional grammar. No. 42 in Syntax and semantics series. Academic Press.
Dash, N. S., & Ramamoorthy, L. (2019). Utility and application of language corpora. Springer Nature.
Dornescu, I., Evans, R., & Orasan, C. (2014). Relative clause extraction for syntactic simplification. Proceedings of the workshop on automatic text simplification: Methods and applications in multilingual society (ATS-MA 2014) (pp. 1–10). Association for Computational Linguistics and Dublin City University.
Falk, Y. N. (2001). Lexical-functional grammar: An introduction to parallel constraint-based syntax. CSLI.
Garside, R., Leech, G., & Sampson, G. (Eds.). (1987). The computational analysis of English: a Corpus-based approach. Longman.
Greene, B., & Rubin, G. (1971). Automatic grammatical tagging of English. Technical Report. Department of Linguistics. Brown University.
Hajicova, E. (1998). Prague dependency treebank: From analytic to tectogrammatical annotation. Proceedings of the first workshop on text, speech, and dialogue (pp. 45–50).
Han, A. L. F., Wong, D. F., Chao, L. S., Lu, Y., He, L., & Tian, L. (2014). A universal phrase tagset for multilingual treebanks. Proceedings of the CCL and NLP-NABD 2014, LNAI 8801, pp. 247–258.
Han, C., Han, N., & Ko, S.(2002). Development and evaluation of a Korean treebank and its application to NLP. Proceedings of the 3rd international conference on language resources and evaluation (pp. 1635–1642).
Haug, D. (2015). Treebanks in historical linguistic research. In C. Viti (Ed.), Perspectives on historical syntax (pp. 188–202). John Benjamins.
Hewlett, D., & Cohen, P. (2011). Word segmentation as general chunking. Proceedings of 15th conference on computational natural language learning (pp. 39–47). 23–24 June 2011.
Hinrichs, E. W., Bartels, J., Kawata, Y., Kordoni, V., & Telljohann, H. (2000). The tubingen treebanks for spoken German, English and Japanese. In W. Wahlster (Ed.), Verbmobil: Foundations of speech-to-speech translation (pp. 552–576). Springer.
Hopcroft, J. E., & Ullman, J. D. (1979). Introduction to automata theory: Languages, and computation. Addison-Wesley.
Hsieh, Y.-M., Yang, D.-C., & Chen, K.-J. (2007). Improve parsing performance by self-learning. Computational Linguistics, and Chinese Language Processing, 12(2), 195–216.
Huang, C. R., Simon, P., Hsieh, S. K., & Prevot, L. (2007). Rethinking Chinese word segmentation: Tokenization, character classification, or wordbreak identification. Proceedings of the ACL 2007 demo and poster sessions (pp. 69–72). June 2007, Association for Computational Linguistics.
JelÃnek, T. (2016). Partial accuracy rates and agreements of parsers: Two experiments with ensemble syntactic annotation of Czech. ITAT 2016: Proceedings CEUR Workshop Proceedings, 1649, 42–47.
Johansson, S., & Stenström, A.-B. (Eds.). (1991). English computer corpora: Selected papers and research guides. Mouton de Gruyter.
Johansson, S., Leech, G., & Goodluck, H. (1978). Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English. University of Oslo, Norway.
Joshi, A. (1985). How much context-sensitivity is necessary for characterizing structural descriptions? In D. Dowty, L. Karttunen, & A. Zwicky (Eds.), Natural language processing: Theoretical, computational, and psychological perspectives (pp. 206–250). Cambridge University Press.
Joshi, A. K., Rao, K. S., & Yamada, H. M. (1972a). String Adjunct Grammars: I Local and distributed adjunction. Information and Control, 21(2), 93–116.
Joshi, A. K., Rao, K. S., & Yamada, H. M. (1972b). String adjunct grammars: II. Equational Representation, Null Symbols, and Linguistic Relevance. Information and Control, 21(3), 235–260.
Jurafsky, D., & Martin, J. H. (2000). Speech and language processing. Pearson Education Inc.
Kallmeyer, L. (2010). Syntactic annotation beyond context-free grammars. Springer.
Kanayama, H., Torisawa, K., Mitsuishi, Y. & Tsujii, J. (2000). A hybrid Japanese parser with hand-crafted grammar and statistics. Proceedings of 18th international conference on computational linguistics (COLING 2000) (Vo. 2, pp. 411–417). 31 July–4 August 2000, Universität des Saarlandes.
Karlsson, F. (1994). Robust syntactic annotation of unconstrained text. In N. Oostdijk & P. deHaan (Eds.), Corpus-based research into language: In honour of Jan Aarts (pp. 121–142). Rodopi.
Karlsson, F., Voutilainen, A., Heikkilä, J., & Anttila, A. (Eds.). (1995). Constraint grammar: A language-independent system for syntactic annotation unrestricted text. Mouton de Gruyter.
Koster, C. A. (1991). Affix grammars for natural languages. In: Attribute grammars, applications, and systems, International summer school saga. Springer.
Kroeger, P. R. (2004). Analyzing syntax: A lexical-functional approach. Cambridge University Press.
Kübler, S., McDonald, R., & Nivre, J. (2008). Dependency parsing. Synthesis Lectures on Human Language Technologies., 2(1), 1–127.
Leech, G., & Eyes, E. (1993). Syntactic annotation: Linguistic aspects of grammatical tagging and skeleton parsing. In E. Black, R. Garside, & G. Leech (Eds.), Statistically-driven computer grammars of English (pp. 36–61). Rodopi.
Leech, G. (1993). Corpus annotation schemes. Literary and Linguistic Computing., 8(4), 275–281.
Liu, H. (2009). Dependency grammar: From theory to practice. Science Press.
Maamouri, M., & Bies, A. (2004). Developing an Arabic treebank: Methods, guidelines, procedures, and tools. Proceedings of the workshop on computational approaches to arabic script-based languages (pp. 2–9).
Makwana, M. T., & Vegda, D. C. (2015). Survey: Natural language parsing for Indian languages. ArXiv. arXiv:1501.07005. pp. 1–9.
Mambrini, F. (2016). The ancient Greek dependency treebank: Linguistic annotation in a teaching environment. In G. Bodard & M. Romanello (Eds.), Digital classics outside the echo-chamber: Teaching, knowledge exchange & public engagement (pp. 83–99). Ubiquity Press.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford coreNLP natural language processing toolkit. In: Association for computational linguistics (ACL) system demonstrations. pp. 55–60.
Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics., 19(2), 313–330.
McEnery, T., & Wilson, A. (1996). Corpus linguistics. Edinburgh University Press.
Melʹc̆uk, I. A. (1987). Dependency syntax: Theory and practice. State University Press of New York.
Montemagni, S., Barsotti, F., Battista, M., Calzolari, N., Corazzari, O., Lenci, A., Zampolli, A., Fanciulli, F., Massetani, M., Raffaelli, R., Basili, R., Pazienza, M.T., Saracino, D., Zanzotto, F., Nana, N., Pianesi, F., & Delmonte, R. (2003). Building the Italian syntactic-semantic treebank. In: A. Abeille’ (Ed.) reebanks: Building and using parsed corpora (pp. 189–210). Kluwer.
Moreno, A., Lopez, S., Sanchez, F., & Grishman, R. (2003). Developing a Spanish treebank. In: A. Abeille’ (Ed.) Treebanks: Building and using parsed corpora (pp. 149–163). Kluwer.
Müller, F. H. (2004). Stylebook for the Tübingen partially parsed corpus of written German (TüPP-D/Z). University of Tübingen, 15 Jan 2004.
Nelson, G., Wallis, S., & Aarts, B. (2002). Exploring natural language: Working with the British component of the international corpus of English. John Benjamins.
Nivre, J. (2008). Treebanks. In: A. Lüdeling & M. Kytö (Ed.) Corpus linguistics: An international handbook (pp. 225–241). Mouton de Gruyter. Chapter 13.
Osborne, T. (2019). A dependency grammar of English: An introduction and beyond. John Benjamins.
Palmer, M., Bhatt, R., Narasimhan, B., Rambow, O., Sharma, D., & Xia, F. (2009). Hindi syntax: Annotating dependency, lexical predicate-argument structure, and phrase structure. Proceedings of the 7th international conference on natural language processing, (ICON-2009) (pp. 14–17).
Perlmutter, D. M. (1980). Relational grammar. In: E. A. Moravcsik & J. R. Wirth (Eds.) Syntax and semantics: Current approaches to syntax (Vol. 13, pp. 195–229). Academic Press.
Perlmutter, D. M. (Ed.). (1983). Studies in relational grammar 1. Chicago University Press.
Pullum, G. K., & Gazdar, G. (1982). Natural languages and context-free languages. Linguistics and Philosophy., 4(4), 471–504.
Sag, I., & Wasow, T. (1999). Syntactic theory: A formal introduction. CSLI Publications.
Sampson, G. (2003). Reflections of a dendrographer. In A. Wilson, P. Rayson, & T. McEnery (Eds.), Corpus linguistics by the Lune: A Festschrift for Geoffrey Leech (pp. 157–184). Peter Lang.
Sharma, A., Gupta, S., Motlani, R., Bansal, P., Shrivastava, M., Mamidi, R., & Sharma, D. M. (2016). Shallow parsing pipeline—Hindi-English code-mixed social media text. Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 1340–1345).
Shieber, S. (1985). Evidence against the context-freeness of natural language. Linguistics and Philosophy., 8(3), 333–343.
Sipser, M. (1997). Introduction to the theory of computation. PWS Publishing.
Souter, C., & Atwell, E. (Eds.). (1993). Corpus-based computational linguistics. Rodopi.
Staub, A., Dillon, B., & Clifton, C., Jr. (2017). The Matrix Verb as a source of comprehension difficulty in object relative sentences. Journal of Cognitive Science, 41(6), 1353–1376.
Tateisi, Y., Yakushiji, A., Ohta, T., & Tsujii, J. (2005). Syntax annotation for the Genia corpus. Proceedings of the IJCNLP, Companion, 2005, 222–227.
Taylor, A., Marcus, M., & Santorini, B. (2003). The Penn Treebank: An overview. In A. Abeillé (Ed.), Treebanks: Building and using parsed corpora (pp. 5–22). Springer.
Telljohann, H., Hinrichs, E. W., Kübler, S., Zinsmeister, H., & Beck, K. (2012). Stylebook for the Tübingen treebank of written German (TüBa-D/Z). University of Tübingen, January 2012.
Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. Proceedings of HLT-NAACL, 2003, 252–259.
van Valin, R. (2001). An introduction to syntax. Cambridge University Press.
Vempaty, C., Naidu, V., Husain, S., Kiran, R., Bai, L., Sharma, D., & Sangal, R. (2010). Issues in analyzing Telugu sentences towards building a Telugu treebank. Computational Linguistics and Intelligent Text Processing, pp. 50–59.
Verma, S. K., & Krishnaswamy, N. (1989). Modern linguistics: An introduction. Oxford University Press.
Wallis, S. (2008). Searching treebanks and other structured corpora. In: A. Lüdeling & M. Kytö (Eds.) Corpus linguistics: An international handbook (pp. 738–758). Mouton de Gruyter. Chapter 34.
Watt, D. A., & Thomas, M. (1991). Programming language syntax and semantics. Prentice-Hall.
Webster, J. J., & Kit, C. (1992). Tokenization as the initial phase in NLP. Proceedings of COLING-92, Nantes, Aug 23–28, 1992. pp. 1106–1110.
Xue, N., Xia, F., Chiou, F. D., & Palmer, M. (2004). The Penn Chinese treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(1), 207–238.
Web Links
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Dash, N.S. (2021). Syntactic Annotation. In: Language Corpora Annotation and Processing. Springer, Singapore. https://doi.org/10.1007/978-981-16-2960-0_10
Download citation
DOI: https://doi.org/10.1007/978-981-16-2960-0_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2959-4
Online ISBN: 978-981-16-2960-0
eBook Packages: EducationEducation (R0)