Task-Related Pretraining with Whole Word Masking for Chinese Coherence Evaluation

Wang, Ziyang; Lee, Sanwoo; Cai, Yida; Wu, Yunfang

doi:10.1007/978-3-031-44699-3_28

Ziyang Wang^11,12,
Sanwoo Lee^11,13,
Yida Cai^11,12 &
…
Yunfang Wu^11,13

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14304))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

538 Accesses

Abstract

This paper presents an approach for evaluating coherence in Chinese middle school student essays, addressing the challenges of time-consuming and inconsistent essay assessment. Previous approaches focused on linguistic features, but coherence, crucial for essay organization, has received less attention. Recent works utilized neural networks, such as CNN, LSTM, and transformers, achieving good performance with labeled data. However, labeling coherence manually is costly and time-consuming. To address this, we propose a method that pretrains RoBERTa with whole word masking (WWM) on a low-resource dataset of middle school essays, followed by finetuning for coherence evaluation. The WWM pretraining is unsupervised and captures general characteristics of the essays, adding little cost to the low-resource setting. Experimental results on Chinese essays demonstrate that this strategy improves coherence evaluation compared to naive finetuning on limited data. We also explore variants of their method, including pseudo labeling and additional neural networks, providing insights into potential performance trade-offs. The contributions of this work include the collection and curation of a substantial dataset, the proposal of a cost-effective pretraining method, and the exploration of alternative approaches for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Asher, N., Lascarides, A.: Logics of Conversation. Cambridge University Press, Cambridge (2003)
Google Scholar
Barzilay, R., Lapata, M.: Modeling local coherence: an entity-based approach. Comput. Linguist. 34(1), 1–34 (2008)
Article Google Scholar
Cui, Y., Che, W., Liu, T., Qin, B., Yang, Z.: Pre-training with whole word masking for Chinese BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3504–3514 (2021)
Article Google Scholar
Farag, Y., Yannakoudakis, H.: Multi-task learning for coherence modeling. arXiv preprint arXiv:1907.02427 (2019)
Grosz, B.J., Joshi, A.K., Weinstein, S.: Centering: a framework for modelling the local coherence of discourse. Comput. Linguist. 21(2), 203–225 (1995)
Google Scholar
Guinaudeau, C., Strube, M.: Graph-based local coherence modeling. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 93–103 (2013)
Google Scholar
He, Y., Jiang, F., Chu, X., Li, P.: Automated Chinese essay scoring from multiple traits. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 3007–3016 (2022)
Google Scholar
Jeon, S., Strube, M.: Entity-based neural local coherence modeling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7787–7805 (2022)
Google Scholar
Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text-interdisciplinary J. Study Discourse 8(3), 243–281 (1988)
Article Google Scholar
Mesgar, M., Strube, M.: A neural local coherence model for text quality assessment. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4328–4339 (2018)
Google Scholar
Moon, H.C., Mohiuddin, T., Joty, S., Chi, X.: A unified neural coherence model. arXiv preprint arXiv:1909.00349 (2019)
Nguyen, D.T., Joty, S.: A neural local coherence model. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1320–1330 (2017)
Google Scholar
Song, W., Song, Z., Liu, L., Fu, R.: Hierarchical multi-task learning for organization evaluation of argumentative student essays. In: IJCAI, pp. 3875–3881 (2020)
Google Scholar

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (62076008) and the Key Project of Natural Science Foundation of China (61936012).

Author information

Authors and Affiliations

MOE Key Laboratory of Computational Linguistics, Peking University, Beijing, China
Ziyang Wang, Sanwoo Lee, Yida Cai & Yunfang Wu
School of Software and Microelectronics, Peking University, Beijing, China
Ziyang Wang & Yida Cai
School of Computer Science, Peking University, Beijing, China
Sanwoo Lee & Yunfang Wu

Authors

Ziyang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Sanwoo Lee
View author publications
You can also search for this author in PubMed Google Scholar
Yida Cai
View author publications
You can also search for this author in PubMed Google Scholar
Yunfang Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunfang Wu .

Editor information

Editors and Affiliations

Emory University, Atlanta, GA, USA
Fei Liu
Microsoft Research Asia, Beijing, China
Nan Duan
Soochow University, Suzhou, China
Qingting Xu
Soochow University, Suzhou, China
Yu Hong

Appendix

During the process of improving overall accuracy, we have also experimented with some new model architectures, including the Cross Task Grader model mentioned below.

1.1 PFT+HAN

We proposed a multi-layer coherence evaluation model, depicted in Fig. 1, which firstly utilized pre-trained RoBERTa to extract features from the articles, followed by an attention pooling layer. Then, we concatenated punctuation-level embeddings and passed them through another attention pooling layer. Finally, we obtained the ultimate coherence score by using a classifier.

Pre-trained Encoder. A sequence of words $s_i=\{w_1,w_2,\ldots ,w_m\}$is encoded with the pre-trained RoBERTa.

Paragraph Representation Layer. An attention pooling layer applied to the output of the pre-trained encoder layer is designed to capture the paragraph representations and is defined as follows:

$$\begin{aligned} m_{i}=tanh({W_m}\cdot {x_i}+{b_m}) \end{aligned}$$

(2)

$$\begin{aligned} u_i=\frac{e^{{w_u}\cdot {m_i}}}{\sum \limits _{j=1}^{m} e^{{w_u}\cdot {m_j}}} \end{aligned}$$

(3)

$$\begin{aligned} p=\sum \limits _{i=1}^{m} {u_i}\cdot {x_i} \end{aligned}$$

(4)

where $W_m$ is a weights matrix, $w_u$ is a weights vector, $m_i$ is the attention vector for the i-th word, $u_i$ is the attention weight for the i-th word, and p is the paragraph representation.

Essay Representation Layer. We incorporated punctuation representations to enhance the model’s performance. We encoded the punctuation information for each paragraph, obtaining the punctuation representation $pu_i$ for each paragraph. Then, we concatenated this representation $pu_i$ with the content representation $p_i$ of each paragraph:

$$\begin{aligned} c_i=concatenate(p_i,pu_i) \end{aligned}$$

(5)

where $c_i$ represents the representation of the concatenated i-th paragraph. Next, we use another layer of attention pooling to obtain the representation of the entire essay and is defined as follows:

$$\begin{aligned} a_{i}=tanh({W_a}\cdot {c_i}+{b_a}) \end{aligned}$$

(6)

$$\begin{aligned} v_i=\frac{e^{{w_v}\cdot {a_i}}}{\sum \limits _{j=1}^{a} e^{{w_v}\cdot {a_j}}} \end{aligned}$$

(7)

$$\begin{aligned} E=\sum \limits _{i=1}^{a} {v_i}\cdot {c_i} \end{aligned}$$

(8)

where $W_a$ is a weights matrix, $w_v$ is a weights vector, $a_i$ is the attention vector for the i-th paragraph, $v_i$ is the attention weight for the i-th paragraph, and E is the essay representation.

1.2 Cross Task Grader

We also used Multi-task Learning(MTL) in our experiment, which is depicted in Fig. 2.

We used both target data and some pseudo-labeled essays from various grade and created a separate PFT+HAN model for each. To facilitate multi-task learning, we adopted the Hard Parameter Sharing approach, sharing the pre-trained encoder layer and the first layer of attention pooling among all the models.Additionally, we added a cross attention layer before the classifier.

Cross Attention Layer. After obtaining the essay representation, we added a cross attention layer to learn the connections between different essays, defined as follows:

$$\begin{aligned} A=[E_1,E_2,\ldots ,E_N] \end{aligned}$$

(9)

$$\begin{aligned} \alpha ^{i}_{j}=\frac{e^{score(E_i,A_{i,j})}}{\sum \limits _i^{l} e^{score(E_i,A_{i,l})}} \end{aligned}$$

(10)

$$\begin{aligned} P_i=\sum \limits {\alpha ^{i}_{j}}\cdot {A_{i,j}} \end{aligned}$$

(11)

$$\begin{aligned} y_i=concatenate(E_i,P_i) \end{aligned}$$

(12)

where A is a concatenation of the representations for each task $[E_1,E_2,\ldots ,E_N]$, and $\alpha ^{i}_{j}$, is the attention weight. We then calculate attention vector $P_i$ through a summation of the product of each weight $\alpha ^{i}_{j}$ and $A_{i,j}$. The final representation $y_i$ is a concatenation of $E_i$ and $P_i$.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Z., Lee, S., Cai, Y., Wu, Y. (2023). Task-Related Pretraining with Whole Word Masking for Chinese Coherence Evaluation. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14304. Springer, Cham. https://doi.org/10.1007/978-3-031-44699-3_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-44699-3_28
Published: 08 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44698-6
Online ISBN: 978-3-031-44699-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

Task-Related Pretraining with Whole Word Masking for Chinese Coherence Evaluation

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 PFT+HAN

1.2 Cross Task Grader

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation