Abstract
We present a feature-rich system for automatic evaluation of surface text coherence in Czech essays written by native and non-native speakers. The EVALD system, in addition to basic features covering spelling, vocabulary, morphology and syntax, stands on two main pillars representing the features closely related to the phenomenon of surface coherence: discourse relations and coreference. Newly we add a third pillar, features targeting topic–focus articulation (sentence information structure). Therefore, we propose and implement a procedure for disclosing topic–focus articulation by marking contextual boundness in the text automatically. The experiments show that EVALD enriched with topic–focus articulation features succeeds in outperforming the original system. Further experiments show that the system for essays written by non-native speakers exhibits different signs in terms of importance of individual feature sets and the size of the training data than the system for native speakers.
Supported by the Ministry of Culture of the Czech Republic (project No. DG16P02B016 Automatic Evaluation of Text Coherence in Czech). This work has been using language resources developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The Prague Dependency Treebank (for its last version 3.5 see [6]) is a corpus of Czech newspaper texts (containing almost 50 thousand sentences) with a multi-layer annotation in the theoretical framework of the Functional Generative Description, see [26, 27]: it contains manual annotation on morphological, surface syntactic and deep semantico-syntactic (tectogrammatical) layers. On top of the dependency trees of the tectogrammatical layer, the PDT also contains manual annotation of coreference, discourse relations including annotation of discourse connectives, and topic–focus articulation.
- 2.
As a theoretical background for capturing discourse relations in text, the approach described in [19] is used. It is based on the approach used for the annotation of the Penn Discourse Treebank 2.0 (PDTB; [20]). Both these approaches are lexically based and aim at capturing local discourse relations between clauses, sentences, or short spans of text.
- 3.
Weka toolkit ver. 3.8.0, downloaded from http://www.cs.waikato.ac.nz/ml/weka/.
- 4.
- 5.
The topic/focus categories in general do not need to overlap with given/new information (see [5]).
- 6.
With some exceptions, see below.
- 7.
Some types of tectogrammatical nodes, such as coordinating nodes, are not TFA-relevant, i.e. not annotated as contextual bound or non-bound.
- 8.
The whole data of the Prague Dependency Treebank have been manually annotated for contextual boundness, see [18].
- 9.
In [16], they distinguish pronoun and coreference features. Here, we denote the union of these two sets as coreference features.
- 10.
Apart from L1 texts, they also evaluate the system on L2 essays manually annotated with a mark for “text organization”, however not as a classification task but again as a pairwise ranking only.
References
Boyd, A., et al.: The MERLIN corpus: learner language and the CEFR. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 1281–1288. European Language Resources Association, Reykjavík (2014)
Feng, V.W., Lin, Z., Hirst, G.: The impact of deep hierarchical discourse structures in the evaluation of text coherence. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 940–949 (2014)
Hajičová, E., Havelka, J., Veselá, K.: Corpus evidence of contextual boundness and focus. In: Danielsson, P. (ed.) Proceedings of the Corpus Linguistics Conference Series. vol. 1, pp. 1–9. University of Birmingham, Birmingham (2005)
Hajičová, E., Sgall, P., Partee, B.: Topic-Focus Articulation, Tripartite Structures, and Semantic Content. Kluwer, Dordrecht (1998). ISBN 0-7923-5289-0
Hajičová, E., Mírovský, J.: Topic/focus vs. given/new: information structure and coreference relations in an annotated corpus. In: 51st Annual Meeting of the Societas Linguistica Europaea, Book of Abstracts, Tallinn, Estonia (in press)
Hajič, J., et al.: Prague Dependency Treebank 3.5. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University (2018). http://hdl.handle.net/11234/1-2621
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The Weka data mining software: an update. ACM SIGKDD explor. newslett. 11(1), 10–18 (2009)
Hancke, J., Meurers, D.: Exploring CEFR classification for German based on rich linguistic modeling. Learner Corpus Research, pp. 54–56 (2013)
Joshi, A.K., Weinstein, S.: Control of inference: role of some aspects of discourse structure-centering. In: IJCAI, pp. 385–387 (1981)
Lin, Z., Ng, H.T., Kan, M.Y.: Automatically evaluating text coherence using discourse relations. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 997–1006. Association for Computational Linguistics (2011)
Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text-Interdisc. J. Study Discourse 8(3), 243–281 (1988)
Miltsakaki, E., Kukich, K.: Evaluation of text coherence for electronic essay scoring systems. Nat. Lang. Eng. 10(1), 25–55 (2004)
Mírovský, J., Rysová, K., Rysová, M., Hajičová, E.: (Pre-)annotation of topic-focus articulation in Prague Czech-English Dependency Treebank. In: Proceedings of the 6th International Joint Conference on Natural Language Processing, pp. 55–63. Asian Federation of Natural Language Processing, Nagoya (2013)
Novák, M., Rysová, K., Mírovský, J., Rysová, M., Hajičová, E.: EVALD 2.0, data/software. ÚFAL MFF UK, Prague, Czechia (2017)
Novák, M., Rysová, K., Mírovský, J., Rysová, M., Hajičová, E.: EVALD 2.0 for Foreigners, data/software. ÚFAL MFF UK, Prague, Czechia (2017)
Novák, M., Rysová, K., Rysová, M., Mírovský, J.: Incorporating coreference to automatic evaluation of coherence in essays. In: Camelin, N., Estève, Y., Martín-Vide, C. (eds.) SLSP 2017. LNCS (LNAI), vol. 10583, pp. 58–69. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68456-7_5
Östling, R., Smolentzov, A., Hinnerich, B.T., Höglin, E.: Automated essay scoring for Swedish. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 42–47 (2013)
Panevová, J., Böhmová, A., Hajičová, E., Sgall, P., Ceplová, M., Řezníčková, V.: A manual for tectogrammatical tagging of the Prague Dependency Treebank. Technical report TR-2000-09 (2000)
Poláková, L., Mírovský, J., Nedoluzhko, A., Jínová, P., Zikánová, Š., Hajičová, E.: Introducing the Prague Discourse Treebank 1.0. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 91–99. Asian Federation of Natural Language Processing, Nagoya (2013)
Prasad, R., et al.: The Penn Discourse Treebank 2.0. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), pp. 2961–2968. European Language Resources Association, Marrakech (2008)
Rysová, K., Mírovský, J., Hajičová, E.: On an apparent freedom of Czech word order. A case study. In: 14th International Workshop on Treebanks and Linguistic Theories (TLT 2015), pp. 93–105. IPIPAN, Warszawa (2015)
Rysová, K., Rysová, M., Mírovský, J.: Automatic evaluation of surface coherence in L2 texts in Czech. In: Proceedings of the 28th Conference on Computational Linguistics and Speech Processing ROCLING XXVIII (2016), pp. 214–228. National Cheng Kung University, The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Taipei (2016)
Rysová, K., Rysová, M., Mírovský, J., Novák, M.: Introducing EVALD - software applications for automatic evaluation of discourse in Czech. In: Angelova, G., Boncheva, K., Mitkov, R., Nikolova, I., Temnikova, I. (eds.) Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 634–641. Bulgarian Academy of Sciences, INCOMA Ltd., Šumen (2017)
Šebesta, K., Bedřichová, Z., Šormová, K., et al.: AKCES 5 (CzeSL-SGT), data/software. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic (2014)
Šebesta, K., Goláňová, H., Letafková, J., et al.: AKCES 1, data/software. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic (2016)
Sgall, P.: Generativní systémy v lingvistice [Generative systems in linguistics]. Slovo a slovesnost 25(4), 274–282 (1964)
Sgall, P.: Generativní popis jazyka a česká deklinace [Generative Description of Language and Czech Declension]. Academia, Prague (1967)
Vajjala, S., Loo, K.: Automatic CEFR level prediction for Estonian learner text. In: Proceedings of the Third Workshop on NLP for Computer-Assisted Language Learning, pp. 113–127 (2014)
Volodina, E., Pilán, I., Alfter, D.: Classification of Swedish learner essays by CEFR levels. In: CALL Communities and Culture-Short Papers from EUROCALL 2016, pp. 456–461 (2016)
Žabokrtský, Z.: Treex - an open-source framework for natural language processing. In: Information Technologies - Applications and Theory, vol. 788, pp. 7–14. Univerzita Pavla Jozefa Šafárika v Košiciach, Košice (2011)
Zesch, T., Wojatzki, M., Scholten-Akoun, D.: Task-independent features for automated essay grading. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 224–232 (2015)
Zikánová, Š., et al.: Discourse and coherence. From the sentence structure to relations in text. Studies in Computational and Theoretical Linguistics, ÚFAL, Praha (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Novák, M., Mírovský, J., Rysová, K., Rysová, M. (2018). Topic–Focus Articulation: A Third Pillar of Automatic Evaluation of Text Coherence. In: Batyrshin, I., Martínez-Villaseñor, M., Ponce Espinosa, H. (eds) Advances in Computational Intelligence. MICAI 2018. Lecture Notes in Computer Science(), vol 11289. Springer, Cham. https://doi.org/10.1007/978-3-030-04497-8_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-04497-8_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04496-1
Online ISBN: 978-3-030-04497-8
eBook Packages: Computer ScienceComputer Science (R0)