Abstract
The focus of this article is on the creation of a collection of sentences manually annotated with respect to their sentence structure. We show that the concept of linear segments—linguistically motivated units, which may be easily detected automatically—serves as a good basis for the identification of clauses in Czech. The segment annotation captures such relationships as subordination, coordination, apposition and parenthesis; based on segmentation charts, individual clauses forming a complex sentence are identified. The annotation of a sentence structure enriches a dependency-based framework with explicit syntactic information on relations among complex units like clauses. We have gathered a collection of 3,444 sentences from the Prague Dependency Treebank, which were annotated with respect to their sentence structure (these sentences comprise 10,746 segments forming 6,341 clauses). The main purpose of the project is to gain a development data—promising results for Czech NLP tools (as a dependency parser or a machine translation system for related languages) that adopt an idea of clause segmentation have been already reported. The collection of sentences with annotated sentence structure provides the possibility of further improvement of such tools.
Similar content being viewed by others
Notes
E.g., in experiments reported by Lopatková and Holan (2009), a correct level of embedding was assigned only to approx. 75% of segments.
In Czech, the subordinated clause representing the object must be separated by a comma and introduced by a subordinating conjunction, as in Řekla, že přijde.
We consider main clauses to be such clauses that are syntactically/formally independent, see also Section 3.
This decision enables us to speed up the annotation as well as to avoid undesired overlapped/repeated annotation: The analytical layer of the PDT already contains the information on syntactic functions (like predicate, subject, object, nominal predicate, attribute, or adverbial); detailed semantic classification pertains to the tectogrammatical layer of the PDT.
Quotation marks marking direct speech have to be combined with another boundary in Czech, primarily with a comma. This rule serves for reliably distinguishing direct speech from the cases when quotation marks are used, e.g., for emphasizing individual words—the latter type gets the same level of embedding as its neighbors.
In the PDT, a coordination of sentence members and a coordination of clauses are not distinguished (at the analytical layer).
The reason for this decision lies in the verb-centric character of dependency syntax traditionally used for Czech.
At the a-layer, the ellipsis of a predicate is marked by a special analytical function; at the t-layer, ellipsis is restored (as a node of a tree).
We have focused on the sentences from data/full/amw/train2 portion of the PDT data, i.e., one (out of eight) directory with the PDT standard training data with the annotation both on m- and a-layers; the number of annotated sentences is approximately the same as the number of sentences in the developing data set from this portion of PDT.
References
Abney, S. P. (1991). Parsing by chunks. In R. Berwick, S. Abney, & C. Tenny (Eds.). Principle-based parsing (pp. 257–278). Dordrecht: Kluwer Academic Publishers.
Abney, S. P. (1995). Partial parsing via finite-state cascades. Journal of Natural Language Engineering 2(4), 337–344.
Ciravegna, F., & Lavelli, A. (1999). Full text parsing using cascades of rules: An information extraction procedure. In Proceedings of EACL’99 (pp. 102–109). University of Bergen, Bergen.
Hajič, J. (2004). Disambiguation of rich inflection (computational morphology of Czech). Prague: Karolinum Press.
Hajič, J., Panevová, J., Buráňová, E., Urešová, Z., Bémová, A., Štěpánek, J., et al. (2004). Anotace na analytické rovině. Návod pro anotátory. UFAL/CKL technical report no. 2004/TR-2004-23, ÚFAL/CKL MFF UK.
Hajič, J., Hajičová, E., Panevová, J., Sgall, P., Pajas, P., Štěpánek, J., et al. (2006). Prague dependency treebank 2.0. Philadelphia: Linguistic Data Consortium.
Holan, T., & Žabokrtský, Z. (2006). Combining Czech dependency parsers. In Proceedings of TSD 2006(pp. 95–102). Springer, LNAI, Vol. 4188.
Homola, P., & Kuboň, V. (2010). Exploiting charts in the MT between related languages. International Journal of Computational Linguistics and Applications 1(1–2), 185–199.
Jones, B. E. M. (1994). Exploiting the role of punctuation in parsing natural text. In: Proceedings of the COLING’94, (pp. 421–425).
Krůza, O., & Kuboň, V. (2009). Automatic extraction of clause relationships from a treebank. In Computational linguistics and intelligent text processing. Proceedings of CICLing 2009 (pp. 195–206). Springer, LNCS, Vol. 5449.
Kuboň, V. (2001). Problems of robust parsing of Czech. PhD thesis, Faculty of Mathematics and Physics, Charles University in Prague, Prague.
Kuboň, V., Lopatková, M., Plátek, M. & Pognan, P. (2007). A linguistically-based segmentation of complex sentences. In D. Wilson & G. Sutcliffe (Eds.). Proceedings of FLAIRS conference (pp. 368–374). Menlo Park, CA: AAAI Press.
Lopatková, M. & Holan, T. (2009). Segmentation charts for Czech—Relations among segments in complex sentences. In A. H. Dediu, A. M. Ionescu, & C. Martín-Vide (Eds.). Proceedings of LATA 2009 (Vol. 5457, pp. 542–553). New York: Springer, LNCS.
Lopatková, M., & Kljueva, N. (2010). Anotace segmentů. (Anotanční příručka) (in manuscript).
Marinčič, D., Šef, T., & Gams, M. (2010). Intraclausal coordination and clause detection as a preprocessing step to dependency parsing. In V. Matoušek, & P. Mautner (Eds.) Proceedings of TSD 2009 (Vol. 5729, pp. 147–153). Springer, LNAI, New York.
Ohno, T., Matsubara, S., Kashioka, H., Maruyama, T., & Inagaki, Y. (2006) Dependency parsing of Japanese spoken monologue based on clause boundaries. In Proceedings of COLING and ACL, ACL, (pp. 169–176).
Šmilauer, V. (1969). Novočeská skladba (New Czech syntax). PhD thesis, Praha: Státní pedagogické nakladatelství.
Zeman, D. (2004). Parsing with a statistical dependency model. PhD thesis, Prague: Charles University in Prague.
Acknowledgments
The article presents the results of the project supported by the grant No. 405/08/0681 and partially by the grant No. P202/10/1333, Grant Agency of the Czech Republic. Also, the authors are grateful to the unknown reviewers for their valuable suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lopatková, M., Homola, P. & Klyueva, N. Annotation of sentence structure. Lang Resources & Evaluation 46, 25–36 (2012). https://doi.org/10.1007/s10579-011-9162-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-011-9162-z