Abstract
This paper presents an approach for evaluating coherence in Chinese middle school student essays, addressing the challenges of time-consuming and inconsistent essay assessment. Previous approaches focused on linguistic features, but coherence, crucial for essay organization, has received less attention. Recent works utilized neural networks, such as CNN, LSTM, and transformers, achieving good performance with labeled data. However, labeling coherence manually is costly and time-consuming. To address this, we propose a method that pretrains RoBERTa with whole word masking (WWM) on a low-resource dataset of middle school essays, followed by finetuning for coherence evaluation. The WWM pretraining is unsupervised and captures general characteristics of the essays, adding little cost to the low-resource setting. Experimental results on Chinese essays demonstrate that this strategy improves coherence evaluation compared to naive finetuning on limited data. We also explore variants of their method, including pseudo labeling and additional neural networks, providing insights into potential performance trade-offs. The contributions of this work include the collection and curation of a substantial dataset, the proposal of a cost-effective pretraining method, and the exploration of alternative approaches for future research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Asher, N., Lascarides, A.: Logics of Conversation. Cambridge University Press, Cambridge (2003)
Barzilay, R., Lapata, M.: Modeling local coherence: an entity-based approach. Comput. Linguist. 34(1), 1–34 (2008)
Cui, Y., Che, W., Liu, T., Qin, B., Yang, Z.: Pre-training with whole word masking for Chinese BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3504–3514 (2021)
Farag, Y., Yannakoudakis, H.: Multi-task learning for coherence modeling. arXiv preprint arXiv:1907.02427 (2019)
Grosz, B.J., Joshi, A.K., Weinstein, S.: Centering: a framework for modelling the local coherence of discourse. Comput. Linguist. 21(2), 203–225 (1995)
Guinaudeau, C., Strube, M.: Graph-based local coherence modeling. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 93–103 (2013)
He, Y., Jiang, F., Chu, X., Li, P.: Automated Chinese essay scoring from multiple traits. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 3007–3016 (2022)
Jeon, S., Strube, M.: Entity-based neural local coherence modeling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7787–7805 (2022)
Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text-interdisciplinary J. Study Discourse 8(3), 243–281 (1988)
Mesgar, M., Strube, M.: A neural local coherence model for text quality assessment. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4328–4339 (2018)
Moon, H.C., Mohiuddin, T., Joty, S., Chi, X.: A unified neural coherence model. arXiv preprint arXiv:1909.00349 (2019)
Nguyen, D.T., Joty, S.: A neural local coherence model. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1320–1330 (2017)
Song, W., Song, Z., Liu, L., Fu, R.: Hierarchical multi-task learning for organization evaluation of argumentative student essays. In: IJCAI, pp. 3875–3881 (2020)
Acknowledgement
This work is supported by the National Natural Science Foundation of China (62076008) and the Key Project of Natural Science Foundation of China (61936012).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
During the process of improving overall accuracy, we have also experimented with some new model architectures, including the Cross Task Grader model mentioned below.
1.1 PFT+HAN
We proposed a multi-layer coherence evaluation model, depicted in Fig. 1, which firstly utilized pre-trained RoBERTa to extract features from the articles, followed by an attention pooling layer. Then, we concatenated punctuation-level embeddings and passed them through another attention pooling layer. Finally, we obtained the ultimate coherence score by using a classifier.
Pre-trained Encoder. A sequence of words \(s_i=\{w_1,w_2,\ldots ,w_m\}\)is encoded with the pre-trained RoBERTa.
Paragraph Representation Layer. An attention pooling layer applied to the output of the pre-trained encoder layer is designed to capture the paragraph representations and is defined as follows:
where \(W_m\) is a weights matrix, \(w_u\) is a weights vector, \(m_i\) is the attention vector for the i-th word, \(u_i\) is the attention weight for the i-th word, and p is the paragraph representation.
Essay Representation Layer. We incorporated punctuation representations to enhance the model’s performance. We encoded the punctuation information for each paragraph, obtaining the punctuation representation \(pu_i\) for each paragraph. Then, we concatenated this representation \(pu_i\) with the content representation \(p_i\) of each paragraph:
where \(c_i\) represents the representation of the concatenated i-th paragraph. Next, we use another layer of attention pooling to obtain the representation of the entire essay and is defined as follows:
where \(W_a\) is a weights matrix, \(w_v\) is a weights vector, \(a_i\) is the attention vector for the i-th paragraph, \(v_i\) is the attention weight for the i-th paragraph, and E is the essay representation.
1.2 Cross Task Grader
We also used Multi-task Learning(MTL) in our experiment, which is depicted in Fig. 2.
We used both target data and some pseudo-labeled essays from various grade and created a separate PFT+HAN model for each. To facilitate multi-task learning, we adopted the Hard Parameter Sharing approach, sharing the pre-trained encoder layer and the first layer of attention pooling among all the models.Additionally, we added a cross attention layer before the classifier.
Cross Attention Layer. After obtaining the essay representation, we added a cross attention layer to learn the connections between different essays, defined as follows:
where A is a concatenation of the representations for each task \([E_1,E_2,\ldots ,E_N]\), and \(\alpha ^{i}_{j}\), is the attention weight. We then calculate attention vector \(P_i\) through a summation of the product of each weight \(\alpha ^{i}_{j}\) and \(A_{i,j}\). The final representation \(y_i\) is a concatenation of \(E_i\) and \(P_i\).
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, Z., Lee, S., Cai, Y., Wu, Y. (2023). Task-Related Pretraining with Whole Word Masking for Chinese Coherence Evaluation. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14304. Springer, Cham. https://doi.org/10.1007/978-3-031-44699-3_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-44699-3_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44698-6
Online ISBN: 978-3-031-44699-3
eBook Packages: Computer ScienceComputer Science (R0)