Annotation of sentence structure

Lopatková, Markéta; Homola, Petr; Klyueva, Natalia

doi:10.1007/s10579-011-9162-z

Annotation of sentence structure

Capturing the relationship between clauses in Czech sentences

Original paper
Published: 28 August 2011

Volume 46, pages 25–36, (2012)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Markéta Lopatková¹,
Petr Homola¹ &
Natalia Klyueva¹

365 Accesses
4 Citations
Explore all metrics

Abstract

The focus of this article is on the creation of a collection of sentences manually annotated with respect to their sentence structure. We show that the concept of linear segments—linguistically motivated units, which may be easily detected automatically—serves as a good basis for the identification of clauses in Czech. The segment annotation captures such relationships as subordination, coordination, apposition and parenthesis; based on segmentation charts, individual clauses forming a complex sentence are identified. The annotation of a sentence structure enriches a dependency-based framework with explicit syntactic information on relations among complex units like clauses. We have gathered a collection of 3,444 sentences from the Prague Dependency Treebank, which were annotated with respect to their sentence structure (these sentences comprise 10,746 segments forming 6,341 clauses). The main purpose of the project is to gain a development data—promising results for Czech NLP tools (as a dependency parser or a machine translation system for related languages) that adopt an idea of clause segmentation have been already reported. The collection of sentences with annotated sentence structure provides the possibility of further improvement of such tools.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Experiment with Theme–Rheme Identification

Constructing of Semantically Dependent Patterns Based on SpaCy and StanfordNLP Libraries

Annotated Clause Boundaries’ Influence on Parsing Results

Notes

We adopt the basic idea of segments introduced and used by Kuboň (2001) and Kuboň et al. (2007). We slightly modify it for the purposes of the annotation task.
http://ufal.mff.cuni.cz/pdt2.0/.
E.g., in experiments reported by Lopatková and Holan (2009), a correct level of embedding was assigned only to approx. 75% of segments.
In Czech, the subordinated clause representing the object must be separated by a comma and introduced by a subordinating conjunction, as in Řekla, že přijde.
We consider main clauses to be such clauses that are syntactically/formally independent, see also Section 3.
This decision enables us to speed up the annotation as well as to avoid undesired overlapped/repeated annotation: The analytical layer of the PDT already contains the information on syntactic functions (like predicate, subject, object, nominal predicate, attribute, or adverbial); detailed semantic classification pertains to the tectogrammatical layer of the PDT.
Quotation marks marking direct speech have to be combined with another boundary in Czech, primarily with a comma. This rule serves for reliably distinguishing direct speech from the cases when quotation marks are used, e.g., for emphasizing individual words—the latter type gets the same level of embedding as its neighbors.
In the PDT, a coordination of sentence members and a coordination of clauses are not distinguished (at the analytical layer).
The reason for this decision lies in the verb-centric character of dependency syntax traditionally used for Czech.
At the a-layer, the ellipsis of a predicate is marked by a special analytical function; at the t-layer, ellipsis is restored (as a node of a tree).
We have focused on the sentences from data/full/amw/train2 portion of the PDT data, i.e., one (out of eight) directory with the PDT standard training data with the annotation both on m- and a-layers; the number of annotated sentences is approximately the same as the number of sentences in the developing data set from this portion of PDT.

References

Abney, S. P. (1991). Parsing by chunks. In R. Berwick, S. Abney, & C. Tenny (Eds.). Principle-based parsing (pp. 257–278). Dordrecht: Kluwer Academic Publishers.
Chapter Google Scholar
Abney, S. P. (1995). Partial parsing via finite-state cascades. Journal of Natural Language Engineering 2(4), 337–344.
Article Google Scholar
Ciravegna, F., & Lavelli, A. (1999). Full text parsing using cascades of rules: An information extraction procedure. In Proceedings of EACL’99 (pp. 102–109). University of Bergen, Bergen.
Hajič, J. (2004). Disambiguation of rich inflection (computational morphology of Czech). Prague: Karolinum Press.
Google Scholar
Hajič, J., Panevová, J., Buráňová, E., Urešová, Z., Bémová, A., Štěpánek, J., et al. (2004). Anotace na analytické rovině. Návod pro anotátory. UFAL/CKL technical report no. 2004/TR-2004-23, ÚFAL/CKL MFF UK.
Hajič, J., Hajičová, E., Panevová, J., Sgall, P., Pajas, P., Štěpánek, J., et al. (2006). Prague dependency treebank 2.0. Philadelphia: Linguistic Data Consortium.
Google Scholar
Holan, T., & Žabokrtský, Z. (2006). Combining Czech dependency parsers. In Proceedings of TSD 2006(pp. 95–102). Springer, LNAI, Vol. 4188.
Homola, P., & Kuboň, V. (2010). Exploiting charts in the MT between related languages. International Journal of Computational Linguistics and Applications 1(1–2), 185–199.
Google Scholar
Jones, B. E. M. (1994). Exploiting the role of punctuation in parsing natural text. In: Proceedings of the COLING’94, (pp. 421–425).
Krůza, O., & Kuboň, V. (2009). Automatic extraction of clause relationships from a treebank. In Computational linguistics and intelligent text processing. Proceedings of CICLing 2009 (pp. 195–206). Springer, LNCS, Vol. 5449.
Kuboň, V. (2001). Problems of robust parsing of Czech. PhD thesis, Faculty of Mathematics and Physics, Charles University in Prague, Prague.
Kuboň, V., Lopatková, M., Plátek, M. & Pognan, P. (2007). A linguistically-based segmentation of complex sentences. In D. Wilson & G. Sutcliffe (Eds.). Proceedings of FLAIRS conference (pp. 368–374). Menlo Park, CA: AAAI Press.
Google Scholar
Lopatková, M. & Holan, T. (2009). Segmentation charts for Czech—Relations among segments in complex sentences. In A. H. Dediu, A. M. Ionescu, & C. Martín-Vide (Eds.). Proceedings of LATA 2009 (Vol. 5457, pp. 542–553). New York: Springer, LNCS.
Google Scholar
Lopatková, M., & Kljueva, N. (2010). Anotace segmentů. (Anotanční příručka) (in manuscript).
Marinčič, D., Šef, T., & Gams, M. (2010). Intraclausal coordination and clause detection as a preprocessing step to dependency parsing. In V. Matoušek, & P. Mautner (Eds.) Proceedings of TSD 2009 (Vol. 5729, pp. 147–153). Springer, LNAI, New York.
Ohno, T., Matsubara, S., Kashioka, H., Maruyama, T., & Inagaki, Y. (2006) Dependency parsing of Japanese spoken monologue based on clause boundaries. In Proceedings of COLING and ACL, ACL, (pp. 169–176).
Šmilauer, V. (1969). Novočeská skladba (New Czech syntax). PhD thesis, Praha: Státní pedagogické nakladatelství.
Zeman, D. (2004). Parsing with a statistical dependency model. PhD thesis, Prague: Charles University in Prague.

Download references

Acknowledgments

The article presents the results of the project supported by the grant No. 405/08/0681 and partially by the grant No. P202/10/1333, Grant Agency of the Czech Republic. Also, the authors are grateful to the unknown reviewers for their valuable suggestions.

Author information

Authors and Affiliations

Charles University in Prague, Faculty of Mathematics and Physics, Prague, Czech Republic
Markéta Lopatková, Petr Homola & Natalia Klyueva

Authors

Markéta Lopatková
View author publications
You can also search for this author in PubMed Google Scholar
Petr Homola
View author publications
You can also search for this author in PubMed Google Scholar
Natalia Klyueva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Markéta Lopatková.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lopatková, M., Homola, P. & Klyueva, N. Annotation of sentence structure. Lang Resources & Evaluation 46, 25–36 (2012). https://doi.org/10.1007/s10579-011-9162-z

Download citation

Published: 28 August 2011
Issue Date: March 2012
DOI: https://doi.org/10.1007/s10579-011-9162-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Annotation of sentence structure

Abstract

Access this article

Similar content being viewed by others

An Experiment with Theme–Rheme Identification

Constructing of Semantically Dependent Patterns Based on SpaCy and StanfordNLP Libraries

Annotated Clause Boundaries’ Influence on Parsing Results

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Annotation of sentence structure

Abstract

Access this article

Similar content being viewed by others

An Experiment with Theme–Rheme Identification

Constructing of Semantically Dependent Patterns Based on SpaCy and StanfordNLP Libraries

Annotated Clause Boundaries’ Influence on Parsing Results

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation