Exploring Automatic Readability Assessment for Science Documents within a Multilingual Educational Context

Uçar, Suna-Şeyma; Aldabe, Itziar; Aranberri, Nora; Arruarte, Ana

doi:10.1007/s40593-024-00393-2

Exploring Automatic Readability Assessment for Science Documents within a Multilingual Educational Context

Article
Open access
Published: 07 February 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Artificial Intelligence in Education Aims and scope Submit manuscript

Exploring Automatic Readability Assessment for Science Documents within a Multilingual Educational Context

Download PDF

Suna-Şeyma Uçar ORCID: orcid.org/0000-0002-8100-7970¹,
Itziar Aldabe¹,
Nora Aranberri¹ &
…
Ana Arruarte²

732 Accesses
1 Altmetric
Explore all metrics

Abstract

Current student-centred, multilingual, active teaching methodologies require that teachers have continuous access to texts that are adequate in terms of topic and language competence. However, the task of finding appropriate materials is arduous and time consuming for teachers. To build on automatic readability assessment research that could help to assist teachers, we explore the performance of natural language processing approaches when dealing with educational science documents for secondary education. Currently, readability assessment is mainly explored in English. In this work we extend our research to Basque and Spanish together with English by compiling context-specific corpora and then testing the performance of feature-based machine-learning and deep learning models. Based on the evaluation of our results, we find that our models do not generalize well although deep learning models obtain better accuracy and F1 in all configurations. Further research in this area is still necessary to determine reliable characteristics of training corpora and model parameters to ensure generalizability.

Sentence-Level Readability Assessment for L2 Chinese Learning

Assessing Readability of Learning Materials on Artificial Intelligence in English for Second Language Learners

Readability Assessment of Academic Texts at Different Degree Levels

Introduction

Automatic Readability Assessment (ARA) is a well-established area of research that seeks to automatically determine the level of difficulty a written text might pose to a reader. Given the importance of fully comprehending a text in countless situations, studies have been conducted from multiple perspectives and contexts, such as the health sector (Basch et al., 2020), education (Vajjala & Lučić, 2018) or even computer science (Scalabrino et al., 2018), among others.

Over the years, researchers have proposed diverse approaches to develop ARA models. Traditional readability formulas, such as Flesch-Kincaid and Gunning (see Tekfi, 1987 for a full review), have long been used in education-related areas and beyond (François & Miltsakaki, 2012; Martinc et al., 2021). More recently, in an attempt to dive further into a broader set of characteristics that may have an effect on a reader’s ease to understand a text, researchers have explored feature-based Natural Language Processing (NLP) approaches with considerable success (Vajjala & Lučić, 2018; Bengoetxea et al., 2020). In the last couple of years, neural approaches have also entered the picture Azpiazu and Pera (2019); Schicchi et al. (2020). The results of the NLP approaches have shown great potential, but there is yet ample room for exploration, as studies have so far mainly focused on English and the domain-adaptability of the models has not been thoroughly examined.

In this work, we seek to test the applicability of NLP approaches to ARA for educational material in the Obligatory Secondary Education (ESO for its acronym in Spanish) in the Basque Autonomous Community (BAC). ESO encompasses a total of four grades and covers the ages between 12 and 16. In the BAC, two official languages, Basque – the minority language and main language of instruction – and Spanish – the majority language –, and a foreign language, English, coexist in the majority of classrooms.

The Basque education curriculum promotes the implementation of the Integrated Treatment of Languages (Bikandi & Valls, 2008) and Project-based learning (Hung et al., 2008) approaches. In brief, this means that learners address challenges that involve a number of science, technology, engineering and mathematics (STEM) subjects and develop language skills in line with such challenges in all three languages. To properly implement and combine the teaching approaches, teachers are faced with the arduous task of gathering textual material about the topic of the project in three languages.

In this context, our ultimate goal is to create an ARA model for a multilingual context that will help teachers decide on the adequacy of a text for a particular secondary education student group. Specifically, we focus on predicting the readability level of STEM subject texts in Basque, Spanish and English, which remains a largely under-researched area. What is more, we aim to do so for complete documents (see Section 3).

To build a model based on NLP approaches, we require not only efficient learning algorithms but also annotated science text corpora. It is not yet known of any publicly accessible, domain-specific, graded corpus in Basque, Spanish, or English at secondary education level. Therefore, as a first step toward our objective, we present the compilation process of annotated document-level corpora for Basque (BasqueARA), Spanish (Agrega2Es) and English (Agrega2En+), which will be released upon the acceptance of the paper. Given the nature of our context, the first two corpora consist of texts for native speakers while the English corpus comprises texts created for non-native learners. With regards NLP learning techniques, we explore the behaviour of Machine Learning (ML) approaches using the Support Vector Machine (SVM) algorithm and Deep Learning (DL) approaches using transformer architectures (cf. Sections 2 and 4). Previous studies have identified SVM classification as the most effective among supervised machine learning algorithms, showcasing substantial improvements in classification results compared to other methods (Liu et al., 2010). Additionally, SVM demonstrates a strong ability to capture the inherent characteristics of the data (Bhavani & Kumar, 2021). On the other hand, DL approaches have also shown improved accuracy in readability tasks (Azpiazu & Pera, 2019; Imperial, 2021).

The remaining of this paper is structured as follows: Section 2 describes related research starting from early research in ARA and up to the newest methods of DL; Section 3 presents the compilation process and characteristics of our Basque, Spanish and English educational science text corpora; Section 4 presents the main architectures used in the experiments and Section 5 the developed feature-based ML and DL models; Section 6 explores the generalizability of the models and, finally, Section 7 outlines the conclusions.

Related Work

Early research in ARA focused on the use of mathematical equations that calculated the readability of a text based on shallow features such as number of words, sentences and difficult words, among others. The reliability of those formulas is questioned despite their broad use (Davison & Kantor, 1982; Si and Callan, 2001; Vajjala & Meurers, 2012). Authors claim that the formulas ignore an important in many aspects such as word order, content and purpose of the text (Klare, 1963). Another widely-accepted drawback of these formulas is that they are mostly developed for English (Vajjala, 2021), even though there have been a number of initiatives to create similar methods for other languages such as Fernández-Huerta formula for Spanish (Fernández Huerta, 1959). Compared to English and even Spanish, research in Basque in this area emerged rather late, when NLP approaches were being tested. For this reason, to our knowledge, no traditional formula has been developed for Basque.

Thanks to the advances in computational capacity and the development of ML approaches, researchers started exploring data-driven approaches for ARA as well. As the quality of the texts is thought to be one of the key factors determining the performance of a model, in the last years, various corpora have been compiled and made available for further research in the area. Among the various corpora developed for English, WeeBit (Vajjala & Meurers, 2012), OneStopEnglish (Vajjala & Lučić, 2018) and Newsela (Xu et al., 2015) have been commonly used. While WeeBit comprises educational articles targeting different age groups and school grades, OneStopEnglish and Newsela consist of news articles for adult English as a second language learners and children of different grade levels, respectively. Nadeem and Ostendorf (2018) presented a corpus consisting of texts from the Siyavula project^{Footnote 1}, an educational initiative with the aim of creating natural sciences and mathematics materials for high school students. The compiled English corpus consists of short science texts from 44 science and 11 history and social sciences textbooks developed for native speakers of English. Recently, Crossley et al. (2022) introduced the CommonLit Ease of Readability (CLEAR) corpus, which comprises 5000 English informational and literature texts rated by humans.

The resources available for Spanish are more scarce. Azpiazu and Pera (2019) introduced the VikiWiki dataset, which comprises texts in 6 languages, including Spanish and English. In this case, the texts, which come from Vikidia(.org) and Wikipedia, were annotated as simple or complex rather than according to academic grades. Interestingly, Lee and Vajjala (2022) published the Spanish version of the Newsela corpus. For Basque, we were able to identify a single corpus, that complied by Gonzalez-Dios et al. (2014), consisting of texts classified as simple or complex.

In reference to learning algorithms, feature-based supervised approaches were the first to be explored. ARA has been treated as a form of regression problem (Feng et al., 2010), classification task (Vajjala & Meurers, 2013), or ranking problem (Xia et al., 2016) with a growing tendency to use SVM classifiers for text classification. An experiment on the WeeBit corpus conducted by Xia et al. (2016) showed their SVM classifier obtained 80.3% accuracy using 5-fold cross-validation using traditional lexicosyntactic language modeling and discourse features. Vajjala and Lučić (2018) conducted a classification experiment by training a SVM using Sequential Minimal Optimization (SMO) with OneStopEnglish, and they achieved an accuracy of 78.13%. Similarly, in a 3-level classification scenario also using SMO, Bengoetxea and Gonzalez-Dios (2021) obtained 90.09% accuracy for the OneStopEnglish test set when using the 50 most predictive features. Crossley et al. (2022) reported 0.726 RMSE (Root Mean Square Error) with a linear model using 107 linguistic features obtained from various extraction tools.

Even though feature-based models continue to be researched, the latest neural network-based approaches incorporate language models that obtain higher accuracy rates in ARA tasks (Lee et al., 2021). For example, predictions of textual embeddings such as the HAN and BERT models have been used as additional features in SVM models and evaluated in WeeBit and Newsela (Deutsch et al., 2020). Imperial (2021) explores the concatenation of BERT embeddings and handcraft linguistic features to be used in ML algorithms for English and Filipino. A range of neural architectures have also been explored. Nadeem and Ostendorf (2018) test GRU and hierarchical RNN architectures on the WeeBit corpus, which prove successful in training a paragraph-level ARA model on Siyavula. Azpiazu and Pera (2019) use multi-attentive RNNs on the VikiWiki dataset with an accuracy of 84.7%. Lee and Vajjala (2022) propose a neural pairwise ranking model and obtained a zero-shot cross-lingual ranking accuracy of over 80% for Spanish when trained on English data from Newsela.

Given this background, in this paper we explore the options that feature-based ML and DL approaches offer in our science educational context to work with Basque, Spanish and English corpora.

Scientific Discourse and Readability

From the review on ARA, we infer that readability measurements have been used to assess the suitability of texts for specific academic grades with little to no adaptation to the characteristics of the texts involved in terms of genre, type or topic. Given the differences between general domain texts and scientific discourse, we wonder if the generic approaches studied so far are indeed useful to evaluate the difficulty of science textbooks.

As posited by Franco Aixelá (2015) based on the principles outlined by Castellví (2004), scientific texts have a particular set of cognitive, linguistic and pragmatic characteristics. Just in terms of purpose, and therefore text type, the scientific discourse looks to communicate knowledge by demonstrating theories, proving hypothesis, and explaining objective phenomena (Shishkova & Popok, 1989).

In terms of linguistic features, which, we could argue, are the ones considered by ARA approaches to a greater or lesser extent, the difference lies in that, in contrast to general texts, scientific texts tend to have a more rigid structure, a higher presence of terminology that is less accessible the more specialised the field is, a simpler syntax and are consistently formal Franco Aixelá (2015). Roméu Escobar (2002) presents an elaborate list of linguistic features of scientific texts that encompasses lexical, morphological and syntactic traits.

While counter intuitively at first sight, it is accepted that narrative texts, often found in everyday discourse and literature, are easier to understand than informative-argumentative texts, core text types within scientific texts (Pérez Zorrilla, 2005), even when the latter are syntactically simpler and more repetitive (Muñoz Calvo et al., 2013). This is mainly because the scientific texts refer to and associate complex events that we cannot always relate to our life experiences and require a higher level of abstraction and linguistic competence (Guevara Benítez et al., 2015).

If we consider scientific articles, Plavén-Sigray et al. (2017) claim that the readability of scientific texts is steadily declining, the main reason being the increased use of scientific jargon. Ball (2017) adds that it is not only scientific jargon that makes scientific texts difficult to read, but rather the use of multi-syllable daily words, that are difficult to process by the reader. Interestingly, Ehara (2022) points out that 10%-30% of the scientific texts available are not readable to intermediate ESL learners.

This being the case, authors such as Uribe (2007) have gone as far as proposing that if, as stated by Gutiérrez Rodilla (1998), in contrast to the everyday discourse, understanding scientific discourse requires a high linguistic awareness and therefore specific training, science teachers are also language teachers (Sutton, 2003).

All in all, in STEM, scientific discourse is considered to be an essential part of the learning process and, as such, teachers are encouraged to use texts specific to the discipline in classrooms (Daugherty et al., 2017). Nevertheless, as we mentioned, these texts can be difficult to process for different student profiles (Arfé et al, 2018). Yet, most readability studies carried out so far have focused on evaluating the suitability of science textbooks by assuming that the readability approaches used are accurate (Chiang-Soong & Yager, 1993; Gyasi, 2013; Nwafor et al., 2022; Hu et al., 2021) and little has been done to investigate their performance in this distinct scenario.

Corpora for Training Models

While building readability models, training corpora have been identified as a key factor in ensuring their accuracy. However, to the best of our knowledge, no publicly available, domain-specific, graded corpus exists to work on science texts for secondary education in Basque, Spanish and English. This is not surprising as the difficulty of finding corpora for readability assessment has been mentioned in previous studies (Petersen & Ostendorf, 2009). Therefore, our first aim was to compile an open, context specific corpus that would allow us to test the performance of different NLP approaches to ARA for our intended educational setting.

In our search for adequate material for our experimental context, we encountered the Agrega2 project^{Footnote 2}. It is a Spanish national initiative co-funded by the European Union (Feder) aimed at creating a unified online repository of teaching materials that cover all educational stages and official languages involved in the Spanish education curriculum. The project repository hosts a wide range of materials (learning objects, teaching sequences, learning programs, etc.) developed by the autonomous communities within Spain and covers the different subjects and languages involved.

It is important to note that while classroom documents in Basque and Spanish are primarily directed at native speakers, materials in English are for non-native learners, following the main profile of students. Consequently, it is expected that materials for a particular grade will vary in textual complexity to fit the language competence of the learners (Xia et al., 2016), that is, for a particular grade, English texts are expected to display simpler vocabulary and structure than Spanish and Basque texts.

Interestingly, all materials in Agrega2 are labeled according to the grade for which they are intended, and thus they are a reliable source as a gold-standard for ARA research. The Agrega2 repository served as our main source for texts: we extracted all classroom-ready texts for the four grades of ESO, which covers ages between 12 and 16^{Footnote 3}, that belonged to the category of natural sciences and that were accessible between September-November 2021. As we will see in the description for each corpus, however, the available volume of text was not always sufficient for our experiments and therefore additional data was collected to extend the corpora. The vast majority of the material stored in the Agrega2 repository is free.

Regardless of the source of the texts, the procedure to prepare them for inclusion in our corpora was the same. As we are interested in providing teachers with a tool to identify the science texts that can be useful for the classroom, we limited the content of our corpus to materials with science reading passages and discarded the rest (exercises, for example). We define a document as a reading passage that is self-contained, useful to address a specific topic, definition, theory, fact, formula, or similar in a science classroom. The documents were obtained by dividing didactic units into sections guided by each unit’s headings and subheadings, that is, we created a document with the text included under heading 1 up until subheading 1.1, then another with the text included within subheading 1.1 up until subheading 1.2, and so on. For deeper subheadings such as 1.1.1 and 1.1.2, the decision to merge or separate their text to create a document was made based on their length. If the sections were considerably short (one paragraph) and continued with a related topic, they were fused into a single document. Exceptions have occurred where the application of these rules were not possible. As a final step, each document was assigned a level 1-4 according to the original ESO grade for which the material was developed. Note that we discarded duplicated documents and created separate files for each document and its metadata (grade, keywords, topic, subject and source).

The Basque Corpus: BasqueARA

It was Basque science texts that were the most difficult to gather. In fact, we were only able to collect 17 documents from Agrega2. Given this highly limited number of documents, we resorted to a website and science textbooks for additional text. The final BasqueARA corpus includes a total of 329 documents across the 4 ESO levels (see Table 1) (see the science topics covered in each level in Appendix A). It is an unbalanced corpus where ESO-1 has 102 documents, ESO-2 has 90 documents, ESO-3 has 58 and ESO-4 79. The corpus includes a total of 8,311 unique lemmas, 52,459 words, 6,269 sentences and 4,238 paragraphs. We observe that, rather counterintuitively, the average number of words per document is highest for ESO-1, it then drops to the lowest for ESO-2 and increases for ESO-3 and ESO-4. A similar pattern is observed in the count of unique lemmas.

Table 1 Quantitative information for the BasqueARA where #docs refers to the number of documents, #lemmas refer to number of unique lemmas, #words to the number of words, w. avg. to the average number of words per document and w. st.dev. to the standard deviation

Exploring Automatic Readability Assessment for Science Documents within a Multilingual Educational Context

Abstract

Similar content being viewed by others

Sentence-Level Readability Assessment for L2 Chinese Learning

Assessing Readability of Learning Materials on Artificial Intelligence in English for Second Language Learners

Readability Assessment of Academic Texts at Different Degree Levels

Introduction

Related Work

Scientific Discourse and Readability

Corpora for Training Models

The Basque Corpus: BasqueARA

The Spanish Corpus: Agrega2-Es

The English Corpus: Agrega2-En+

Model Architecture

Feature-Based Models

Deep Learning Models

BERT-Based Models

Longformer-Based Models

Science Models

Experimental Set-up

Results

Automatic Evaluation

Corpora for Automatic Evaluation

Results of Automatic Evaluation

Models Trained on Data for Native Learners

Conclusions and Future Work

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Appendices

Appendix A

Appendix B

Appendix C

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation