Skip to main content

Topic–Focus Articulation: A Third Pillar of Automatic Evaluation of Text Coherence

  • Conference paper
  • First Online:
Advances in Computational Intelligence (MICAI 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11289))

Included in the following conference series:

Abstract

We present a feature-rich system for automatic evaluation of surface text coherence in Czech essays written by native and non-native speakers. The EVALD system, in addition to basic features covering spelling, vocabulary, morphology and syntax, stands on two main pillars representing the features closely related to the phenomenon of surface coherence: discourse relations and coreference. Newly we add a third pillar, features targeting topic–focus articulation (sentence information structure). Therefore, we propose and implement a procedure for disclosing topic–focus articulation by marking contextual boundness in the text automatically. The experiments show that EVALD enriched with topic–focus articulation features succeeds in outperforming the original system. Further experiments show that the system for essays written by non-native speakers exhibits different signs in terms of importance of individual feature sets and the size of the training data than the system for native speakers.

Supported by the Ministry of Culture of the Czech Republic (project No. DG16P02B016 Automatic Evaluation of Text Coherence in Czech). This work has been using language resources developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The Prague Dependency Treebank (for its last version 3.5 see [6]) is a corpus of Czech newspaper texts (containing almost 50 thousand sentences) with a multi-layer annotation in the theoretical framework of the Functional Generative Description, see [26, 27]: it contains manual annotation on morphological, surface syntactic and deep semantico-syntactic (tectogrammatical) layers. On top of the dependency trees of the tectogrammatical layer, the PDT also contains manual annotation of coreference, discourse relations including annotation of discourse connectives, and topic–focus articulation.

  2. 2.

    As a theoretical background for capturing discourse relations in text, the approach described in [19] is used. It is based on the approach used for the annotation of the Penn Discourse Treebank 2.0 (PDTB; [20]). Both these approaches are lexically based and aim at capturing local discourse relations between clauses, sentences, or short spans of text.

  3. 3.

    Weka toolkit ver. 3.8.0, downloaded from http://www.cs.waikato.ac.nz/ml/weka/.

  4. 4.

    A similar phenomenon based on the Centering Theory [9] and its theoretical contribution to evaluation of the text coherence (especially of so-called “rough shifts”) was studied in [12].

  5. 5.

    The topic/focus categories in general do not need to overlap with given/new information (see [5]).

  6. 6.

    With some exceptions, see below.

  7. 7.

    Some types of tectogrammatical nodes, such as coordinating nodes, are not TFA-relevant, i.e. not annotated as contextual bound or non-bound.

  8. 8.

    The whole data of the Prague Dependency Treebank have been manually annotated for contextual boundness, see [18].

  9. 9.

    In [16], they distinguish pronoun and coreference features. Here, we denote the union of these two sets as coreference features.

  10. 10.

    Apart from L1 texts, they also evaluate the system on L2 essays manually annotated with a mark for “text organization”, however not as a classification task but again as a pairwise ranking only.

References

  1. Boyd, A., et al.: The MERLIN corpus: learner language and the CEFR. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 1281–1288. European Language Resources Association, Reykjavík (2014)

    Google Scholar 

  2. Feng, V.W., Lin, Z., Hirst, G.: The impact of deep hierarchical discourse structures in the evaluation of text coherence. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 940–949 (2014)

    Google Scholar 

  3. Hajičová, E., Havelka, J., Veselá, K.: Corpus evidence of contextual boundness and focus. In: Danielsson, P. (ed.) Proceedings of the Corpus Linguistics Conference Series. vol. 1, pp. 1–9. University of Birmingham, Birmingham (2005)

    Google Scholar 

  4. Hajičová, E., Sgall, P., Partee, B.: Topic-Focus Articulation, Tripartite Structures, and Semantic Content. Kluwer, Dordrecht (1998). ISBN 0-7923-5289-0

    Book  Google Scholar 

  5. Hajičová, E., Mírovský, J.: Topic/focus vs. given/new: information structure and coreference relations in an annotated corpus. In: 51st Annual Meeting of the Societas Linguistica Europaea, Book of Abstracts, Tallinn, Estonia (in press)

    Google Scholar 

  6. Hajič, J., et al.: Prague Dependency Treebank 3.5. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University (2018). http://hdl.handle.net/11234/1-2621

  7. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The Weka data mining software: an update. ACM SIGKDD explor. newslett. 11(1), 10–18 (2009)

    Article  Google Scholar 

  8. Hancke, J., Meurers, D.: Exploring CEFR classification for German based on rich linguistic modeling. Learner Corpus Research, pp. 54–56 (2013)

    Google Scholar 

  9. Joshi, A.K., Weinstein, S.: Control of inference: role of some aspects of discourse structure-centering. In: IJCAI, pp. 385–387 (1981)

    Google Scholar 

  10. Lin, Z., Ng, H.T., Kan, M.Y.: Automatically evaluating text coherence using discourse relations. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 997–1006. Association for Computational Linguistics (2011)

    Google Scholar 

  11. Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text-Interdisc. J. Study Discourse 8(3), 243–281 (1988)

    Article  Google Scholar 

  12. Miltsakaki, E., Kukich, K.: Evaluation of text coherence for electronic essay scoring systems. Nat. Lang. Eng. 10(1), 25–55 (2004)

    Article  Google Scholar 

  13. Mírovský, J., Rysová, K., Rysová, M., Hajičová, E.: (Pre-)annotation of topic-focus articulation in Prague Czech-English Dependency Treebank. In: Proceedings of the 6th International Joint Conference on Natural Language Processing, pp. 55–63. Asian Federation of Natural Language Processing, Nagoya (2013)

    Google Scholar 

  14. Novák, M., Rysová, K., Mírovský, J., Rysová, M., Hajičová, E.: EVALD 2.0, data/software. ÚFAL MFF UK, Prague, Czechia (2017)

    Google Scholar 

  15. Novák, M., Rysová, K., Mírovský, J., Rysová, M., Hajičová, E.: EVALD 2.0 for Foreigners, data/software. ÚFAL MFF UK, Prague, Czechia (2017)

    Google Scholar 

  16. Novák, M., Rysová, K., Rysová, M., Mírovský, J.: Incorporating coreference to automatic evaluation of coherence in essays. In: Camelin, N., Estève, Y., Martín-Vide, C. (eds.) SLSP 2017. LNCS (LNAI), vol. 10583, pp. 58–69. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68456-7_5

    Chapter  Google Scholar 

  17. Östling, R., Smolentzov, A., Hinnerich, B.T., Höglin, E.: Automated essay scoring for Swedish. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 42–47 (2013)

    Google Scholar 

  18. Panevová, J., Böhmová, A., Hajičová, E., Sgall, P., Ceplová, M., Řezníčková, V.: A manual for tectogrammatical tagging of the Prague Dependency Treebank. Technical report TR-2000-09 (2000)

    Google Scholar 

  19. Poláková, L., Mírovský, J., Nedoluzhko, A., Jínová, P., Zikánová, Š., Hajičová, E.: Introducing the Prague Discourse Treebank 1.0. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 91–99. Asian Federation of Natural Language Processing, Nagoya (2013)

    Google Scholar 

  20. Prasad, R., et al.: The Penn Discourse Treebank 2.0. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), pp. 2961–2968. European Language Resources Association, Marrakech (2008)

    Google Scholar 

  21. Rysová, K., Mírovský, J., Hajičová, E.: On an apparent freedom of Czech word order. A case study. In: 14th International Workshop on Treebanks and Linguistic Theories (TLT 2015), pp. 93–105. IPIPAN, Warszawa (2015)

    Google Scholar 

  22. Rysová, K., Rysová, M., Mírovský, J.: Automatic evaluation of surface coherence in L2 texts in Czech. In: Proceedings of the 28th Conference on Computational Linguistics and Speech Processing ROCLING XXVIII (2016), pp. 214–228. National Cheng Kung University, The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Taipei (2016)

    Google Scholar 

  23. Rysová, K., Rysová, M., Mírovský, J., Novák, M.: Introducing EVALD - software applications for automatic evaluation of discourse in Czech. In: Angelova, G., Boncheva, K., Mitkov, R., Nikolova, I., Temnikova, I. (eds.) Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 634–641. Bulgarian Academy of Sciences, INCOMA Ltd., Šumen (2017)

    Google Scholar 

  24. Šebesta, K., Bedřichová, Z., Šormová, K., et al.: AKCES 5 (CzeSL-SGT), data/software. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic (2014)

    Google Scholar 

  25. Šebesta, K., Goláňová, H., Letafková, J., et al.: AKCES 1, data/software. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic (2016)

    Google Scholar 

  26. Sgall, P.: Generativní systémy v lingvistice [Generative systems in linguistics]. Slovo a slovesnost 25(4), 274–282 (1964)

    Google Scholar 

  27. Sgall, P.: Generativní popis jazyka a česká deklinace [Generative Description of Language and Czech Declension]. Academia, Prague (1967)

    Google Scholar 

  28. Vajjala, S., Loo, K.: Automatic CEFR level prediction for Estonian learner text. In: Proceedings of the Third Workshop on NLP for Computer-Assisted Language Learning, pp. 113–127 (2014)

    Google Scholar 

  29. Volodina, E., Pilán, I., Alfter, D.: Classification of Swedish learner essays by CEFR levels. In: CALL Communities and Culture-Short Papers from EUROCALL 2016, pp. 456–461 (2016)

    Chapter  Google Scholar 

  30. Žabokrtský, Z.: Treex - an open-source framework for natural language processing. In: Information Technologies - Applications and Theory, vol. 788, pp. 7–14. Univerzita Pavla Jozefa Šafárika v Košiciach, Košice (2011)

    Google Scholar 

  31. Zesch, T., Wojatzki, M., Scholten-Akoun, D.: Task-independent features for automated essay grading. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 224–232 (2015)

    Google Scholar 

  32. Zikánová, Š., et al.: Discourse and coherence. From the sentence structure to relations in text. Studies in Computational and Theoretical Linguistics, ÚFAL, Praha (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michal Novák .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Novák, M., Mírovský, J., Rysová, K., Rysová, M. (2018). Topic–Focus Articulation: A Third Pillar of Automatic Evaluation of Text Coherence. In: Batyrshin, I., Martínez-Villaseñor, M., Ponce Espinosa, H. (eds) Advances in Computational Intelligence. MICAI 2018. Lecture Notes in Computer Science(), vol 11289. Springer, Cham. https://doi.org/10.1007/978-3-030-04497-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04497-8_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04496-1

  • Online ISBN: 978-3-030-04497-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics