Topic–Focus Articulation: A Third Pillar of Automatic Evaluation of Text Coherence

Novák, Michal; Mírovský, Jiří; Rysová, Kateřina; Rysová, Magdaléna

doi:10.1007/978-3-030-04497-8_8

Michal Novák¹⁵,
Jiří Mírovský¹⁵,
Kateřina Rysová¹⁵ &
…
Magdaléna Rysová¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11289))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

1020 Accesses
2 Citations

Abstract

We present a feature-rich system for automatic evaluation of surface text coherence in Czech essays written by native and non-native speakers. The EVALD system, in addition to basic features covering spelling, vocabulary, morphology and syntax, stands on two main pillars representing the features closely related to the phenomenon of surface coherence: discourse relations and coreference. Newly we add a third pillar, features targeting topic–focus articulation (sentence information structure). Therefore, we propose and implement a procedure for disclosing topic–focus articulation by marking contextual boundness in the text automatically. The experiments show that EVALD enriched with topic–focus articulation features succeeds in outperforming the original system. Further experiments show that the system for essays written by non-native speakers exhibits different signs in terms of importance of individual feature sets and the size of the training data than the system for native speakers.

Supported by the Ministry of Culture of the Czech Republic (project No. DG16P02B016 Automatic Evaluation of Text Coherence in Czech). This work has been using language resources developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The Prague Dependency Treebank (for its last version 3.5 see [6]) is a corpus of Czech newspaper texts (containing almost 50 thousand sentences) with a multi-layer annotation in the theoretical framework of the Functional Generative Description, see [26, 27]: it contains manual annotation on morphological, surface syntactic and deep semantico-syntactic (tectogrammatical) layers. On top of the dependency trees of the tectogrammatical layer, the PDT also contains manual annotation of coreference, discourse relations including annotation of discourse connectives, and topic–focus articulation.
2.
As a theoretical background for capturing discourse relations in text, the approach described in [19] is used. It is based on the approach used for the annotation of the Penn Discourse Treebank 2.0 (PDTB; [20]). Both these approaches are lexically based and aim at capturing local discourse relations between clauses, sentences, or short spans of text.
3.
Weka toolkit ver. 3.8.0, downloaded from http://www.cs.waikato.ac.nz/ml/weka/.
4.
A similar phenomenon based on the Centering Theory [9] and its theoretical contribution to evaluation of the text coherence (especially of so-called “rough shifts”) was studied in [12].
5.
The topic/focus categories in general do not need to overlap with given/new information (see [5]).
6.
With some exceptions, see below.
7.
Some types of tectogrammatical nodes, such as coordinating nodes, are not TFA-relevant, i.e. not annotated as contextual bound or non-bound.
8.
The whole data of the Prague Dependency Treebank have been manually annotated for contextual boundness, see [18].
9.
In [16], they distinguish pronoun and coreference features. Here, we denote the union of these two sets as coreference features.
10.
Apart from L1 texts, they also evaluate the system on L2 essays manually annotated with a mark for “text organization”, however not as a classification task but again as a pairwise ranking only.

References

Boyd, A., et al.: The MERLIN corpus: learner language and the CEFR. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 1281–1288. European Language Resources Association, Reykjavík (2014)
Google Scholar
Feng, V.W., Lin, Z., Hirst, G.: The impact of deep hierarchical discourse structures in the evaluation of text coherence. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 940–949 (2014)
Google Scholar
Hajičová, E., Havelka, J., Veselá, K.: Corpus evidence of contextual boundness and focus. In: Danielsson, P. (ed.) Proceedings of the Corpus Linguistics Conference Series. vol. 1, pp. 1–9. University of Birmingham, Birmingham (2005)
Google Scholar
Hajičová, E., Sgall, P., Partee, B.: Topic-Focus Articulation, Tripartite Structures, and Semantic Content. Kluwer, Dordrecht (1998). ISBN 0-7923-5289-0
Book Google Scholar
Hajičová, E., Mírovský, J.: Topic/focus vs. given/new: information structure and coreference relations in an annotated corpus. In: 51st Annual Meeting of the Societas Linguistica Europaea, Book of Abstracts, Tallinn, Estonia (in press)
Google Scholar
Hajič, J., et al.: Prague Dependency Treebank 3.5. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University (2018). http://hdl.handle.net/11234/1-2621
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The Weka data mining software: an update. ACM SIGKDD explor. newslett. 11(1), 10–18 (2009)
Article Google Scholar
Hancke, J., Meurers, D.: Exploring CEFR classification for German based on rich linguistic modeling. Learner Corpus Research, pp. 54–56 (2013)
Google Scholar
Joshi, A.K., Weinstein, S.: Control of inference: role of some aspects of discourse structure-centering. In: IJCAI, pp. 385–387 (1981)
Google Scholar
Lin, Z., Ng, H.T., Kan, M.Y.: Automatically evaluating text coherence using discourse relations. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 997–1006. Association for Computational Linguistics (2011)
Google Scholar
Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text-Interdisc. J. Study Discourse 8(3), 243–281 (1988)
Article Google Scholar
Miltsakaki, E., Kukich, K.: Evaluation of text coherence for electronic essay scoring systems. Nat. Lang. Eng. 10(1), 25–55 (2004)
Article Google Scholar
Mírovský, J., Rysová, K., Rysová, M., Hajičová, E.: (Pre-)annotation of topic-focus articulation in Prague Czech-English Dependency Treebank. In: Proceedings of the 6th International Joint Conference on Natural Language Processing, pp. 55–63. Asian Federation of Natural Language Processing, Nagoya (2013)
Google Scholar
Novák, M., Rysová, K., Mírovský, J., Rysová, M., Hajičová, E.: EVALD 2.0, data/software. ÚFAL MFF UK, Prague, Czechia (2017)
Google Scholar
Novák, M., Rysová, K., Mírovský, J., Rysová, M., Hajičová, E.: EVALD 2.0 for Foreigners, data/software. ÚFAL MFF UK, Prague, Czechia (2017)
Google Scholar
Novák, M., Rysová, K., Rysová, M., Mírovský, J.: Incorporating coreference to automatic evaluation of coherence in essays. In: Camelin, N., Estève, Y., Martín-Vide, C. (eds.) SLSP 2017. LNCS (LNAI), vol. 10583, pp. 58–69. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68456-7_5
Chapter Google Scholar
Östling, R., Smolentzov, A., Hinnerich, B.T., Höglin, E.: Automated essay scoring for Swedish. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 42–47 (2013)
Google Scholar
Panevová, J., Böhmová, A., Hajičová, E., Sgall, P., Ceplová, M., Řezníčková, V.: A manual for tectogrammatical tagging of the Prague Dependency Treebank. Technical report TR-2000-09 (2000)
Google Scholar
Poláková, L., Mírovský, J., Nedoluzhko, A., Jínová, P., Zikánová, Š., Hajičová, E.: Introducing the Prague Discourse Treebank 1.0. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 91–99. Asian Federation of Natural Language Processing, Nagoya (2013)
Google Scholar
Prasad, R., et al.: The Penn Discourse Treebank 2.0. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), pp. 2961–2968. European Language Resources Association, Marrakech (2008)
Google Scholar
Rysová, K., Mírovský, J., Hajičová, E.: On an apparent freedom of Czech word order. A case study. In: 14th International Workshop on Treebanks and Linguistic Theories (TLT 2015), pp. 93–105. IPIPAN, Warszawa (2015)
Google Scholar
Rysová, K., Rysová, M., Mírovský, J.: Automatic evaluation of surface coherence in L2 texts in Czech. In: Proceedings of the 28th Conference on Computational Linguistics and Speech Processing ROCLING XXVIII (2016), pp. 214–228. National Cheng Kung University, The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Taipei (2016)
Google Scholar
Rysová, K., Rysová, M., Mírovský, J., Novák, M.: Introducing EVALD - software applications for automatic evaluation of discourse in Czech. In: Angelova, G., Boncheva, K., Mitkov, R., Nikolova, I., Temnikova, I. (eds.) Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 634–641. Bulgarian Academy of Sciences, INCOMA Ltd., Šumen (2017)
Google Scholar
Šebesta, K., Bedřichová, Z., Šormová, K., et al.: AKCES 5 (CzeSL-SGT), data/software. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic (2014)
Google Scholar
Šebesta, K., Goláňová, H., Letafková, J., et al.: AKCES 1, data/software. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic (2016)
Google Scholar
Sgall, P.: Generativní systémy v lingvistice [Generative systems in linguistics]. Slovo a slovesnost 25(4), 274–282 (1964)
Google Scholar
Sgall, P.: Generativní popis jazyka a česká deklinace [Generative Description of Language and Czech Declension]. Academia, Prague (1967)
Google Scholar
Vajjala, S., Loo, K.: Automatic CEFR level prediction for Estonian learner text. In: Proceedings of the Third Workshop on NLP for Computer-Assisted Language Learning, pp. 113–127 (2014)
Google Scholar
Volodina, E., Pilán, I., Alfter, D.: Classification of Swedish learner essays by CEFR levels. In: CALL Communities and Culture-Short Papers from EUROCALL 2016, pp. 456–461 (2016)
Chapter Google Scholar
Žabokrtský, Z.: Treex - an open-source framework for natural language processing. In: Information Technologies - Applications and Theory, vol. 788, pp. 7–14. Univerzita Pavla Jozefa Šafárika v Košiciach, Košice (2011)
Google Scholar
Zesch, T., Wojatzki, M., Scholten-Akoun, D.: Task-independent features for automated essay grading. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 224–232 (2015)
Google Scholar
Zikánová, Š., et al.: Discourse and coherence. From the sentence structure to relations in text. Studies in Computational and Theoretical Linguistics, ÚFAL, Praha (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Charles University, Malostranské náměstí 25, 11800, Prague 1, Czech Republic
Michal Novák, Jiří Mírovský, Kateřina Rysová & Magdaléna Rysová

Authors

Michal Novák
View author publications
You can also search for this author in PubMed Google Scholar
Jiří Mírovský
View author publications
You can also search for this author in PubMed Google Scholar
Kateřina Rysová
View author publications
You can also search for this author in PubMed Google Scholar
Magdaléna Rysová
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michal Novák .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Ildar Batyrshin
Universidad Panamericana, Mexico City, Mexico
María de Lourdes Martínez-Villaseñor
Faculty of Engineering, Universidad Panamericana, Mexico City, Mexico
Hiram Eredín Ponce Espinosa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Novák, M., Mírovský, J., Rysová, K., Rysová, M. (2018). Topic–Focus Articulation: A Third Pillar of Automatic Evaluation of Text Coherence. In: Batyrshin, I., Martínez-Villaseñor, M., Ponce Espinosa, H. (eds) Advances in Computational Intelligence. MICAI 2018. Lecture Notes in Computer Science(), vol 11289. Springer, Cham. https://doi.org/10.1007/978-3-030-04497-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-04497-8_8
Published: 03 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04496-1
Online ISBN: 978-3-030-04497-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics