Introduction

In this paper, we investigate crosslingual methods for automated content scoring, a scenario where scoring models trained on data in a source language are applied to data in a different target language. In content scoring in general, students answer to a prompt in written form, typically with one or a few sentences. These answers are scored on the basis of their content rather than their form. It is only important that the student understood the right concepts, while form aspects, such as orthography and grammar, are not considered for the final score. In a real-life educational setting, such a procedure can discriminate against non-native speaking students who might conceptually understand the topic in question, but are unable to express their understanding in the language of instruction. One solution to this problem could be that students are allowed to answer a question in a language they are proficient in. As only the content matters, the form, including the language, is unimportant. This setting, however, would require that a teacher scoring an item is also proficient in the language used by the student, which would still restrict the available language options. We therefore investigate in this paper the feasibility of crosslingual automated scoring, where training data in one language is used to predict scores on test items in a different language.

Besides fostering educational equality, crosslingual scoring can also help to overcome data sparsity, as available data in one language can be used to score the same item in a new language without the need to collect and humanly annotate large amounts of new training data in the target language.

Our work builds on earlier work by Horbach et al. (2018) who collected ASAP\(_{de}\), a German version of the established English ASAP dataset (henceforth called ASAP\(_{orig}\)) and conducted crosslingual scoring experiments with the help of automatic machine translation finding a substantial performance loss compared to monolingual scoring.

We extend this work in several respects: First, we extend the crosslingual setup to include a broader variety of languages. We collected new versions of the ASAP dataset in Spanish and French through crowd-sourcing studies and include a recently collected Chinese version (Ding et al., 2020). While German and English are closely related Germanic languages for which machine translation usually works very well, including those further languages allows us to examine the influence of language proximity on crosslingual scoring performance. Second, previous work (Horbach et al., 2018) has speculated that part of the performance drop reported in a crosslingual setting could be attributed rather to the data collection setup than to the language difference, as ASAP\(_{orig}\) was collected from high school students while ASAP\(_{de}\) was acquired through crowdsourcing. To further address this question, we also collected a crowdsourced ASAP version in English, ASAP\(_{en}\). For each of these setups (ASAP\(_{es}\), ASAP\(_{fr}\), ASAP\(_{en}\)), we collected about 300 learner answers for each of the three individual prompts already used in ASAP\(_{de}\). These datasets were manually double-annotated by trained annotators. Figure 1 shows as an example the English version of the prompt material from prompt 1. Table 1 presents example answers from the different language-specific datasets in response to that prompt. Note that the shown answers from different datasets are conceptually very similar and thus all received the same score. We take this as a hint that the annotators’ scoring, who followed the scoring instructions published with ASAP\(_{orig}\), is indeed consistent across datasets.

Fig. 1
figure 1

Prompt 1 from the original English ASAP dataset

Table 1 Example answers in different languages for prompt 1

Third, we methodologically extend existing crosslingual work to include recent developments in neural content scoring as well as crosslingual text-classification. We follow Horbach et al. (2018) in using machine translation to either automatically translate training data into the language of the test data or vice versa as a baseline condition. As machine translation has the risk of introducing noise into the data, especially in the case of learner data with misspellings and other language problems, we investigate ways to circumvent the machine translation step through the use of a multilingual BERT classification model (Devlin et al., 2019) capable of dealing with multilingual input. We use such a model as another experimental condition and explore ways how the two approaches, machine translation and the use of multilingual transformer models, can be combined. (See Fig. 2 for a visualization of the different variants of the scoring process.) In a domain adaptation approach, we also investigate how smaller amounts of in-domain data from the same language as the test data can be used to fine-tune a model trained on larger amounts of out-of-domain data, i.e. in our case the large amounts of English answers from the original dataset (but always using training data from the same prompt as the test data).

Our paper thus makes the following contributions:

  • We compare and discuss five additional language variants of the ASAP dataset, three of which we have newly collected.Footnote 1

  • We present crosslingual experiments methodologically covering approaches based on (a) machine translation, (b) multilingual transformer models and (c) a combination thereof, showing that the latter works best in most setups.

  • We show that results can be further improved in a two-step training process, where a model is first trained on large amounts of data in a source language and then fine-tuned on the smaller available amounts of target language training data.

The remainder of the paper is organized as follows. We discuss related work in “Related Work”. In “Datasets”, we discuss and analyze the multilingual short-answer datasets used in our study. “Experimental Studies” presents two experimental studies. The first one uses monolingual scoring for comparison. The second one uses crosslingual scoring with machine translation or multilingual transformer models or both to cross the language boundary. We further analyze the datasets and classification results in “Further Analyses” before concluding in “Conclusion and Future Work” and discussing ethical considerations in “Ethical Considerations”.

Fig. 2
figure 2

Visualization of the crosslingual scoring process at the example of English as source and German as target language

Related Work

In the following, we first describe options how NLP tasks in general can be placed in a crosslingual setting. We then review the automated scoring literature with respect to multi- and crosslingual setups.

Crosslingual Methods

Most NLP applications have been developed in a monolingual setting, which makes sense given that any machine learning task works best if the training data is as similar as possible to the actual test data at application time. However, given that the acquisition of new training data is a time-consuming task, crosslingual methods have been investigated as one way of performing domain transfer as the language of the data is one domain aspect in which two data sets can differ. In our paper, we mean by crosslingual NLP tasks any NLP task with training data in one source language and test data in a different target language.

There are several ways how the language boundary can be crossed, the most straightforward one being by means of machine translation. In information retrieval, for example, queries have been automatically translated from English to Spanish and vice-versa in a crosslingual information retrieval setup (Ballesteros and Croft, 1996). Another approach for crosslingual text-classification consists in translating a learnt n-gram model such as Shi et al. (2010). They accounted for the fact that individual words can have multiple translations by learning feature translation probabilities. Avoiding machine translation, Prettenhofer and Stein (2010) proposed a structural correspondence learning method for text classification, where large amounts of unlabeled data and some bilingual word correspondence lists are used to learn structural correspondences between two languages, i.e. basically to learn a task-specific machine translation model.

In the context of recent neural network advances, crosslingual word-embeddings have become popular, for example in work by Klementiev et al. (2012), who learn representations of words in different languages in the same vector space, such that dog, chien, Hund, perro and are represented by similar feature vectors. Finally, large pre-trained language models such as BERT (Devlin et al., 2019), which we employ in our work, have also been trained on documents in a multitude of languages making them multilingual. Such approaches thus seemingly remove the need to translate training or test data.

Crosslingual Automated Scoring

The existence of available data in more than one language is a necessary pre-requisite for multi- and crosslingual scoring. However, the main body of work in the area of automated scoring concerns monolingual setups in English. This holds for both automated content scoring, where only conceptual correctness is scored (Horbach and Zesch, 2019), and essay scoring, where both content and linguistic form are evaluated (Klebanov and Madnani, 2021). We therefore first provide pointers to approaches on languages other than English, before discussing the few instances of truly multi- or crosslingual scoring approaches that have been developed so far.

Our work is an instance of automated content scoring, with an ample body of work for monolingual English scenarios and a fair number of established English-language datasets (e.g., Bailey and Meurers (2008); Basu et al. (2013); Dzikovska et al. (2013); Mohler and Mihalcea (2009). In contrast, approaches and datasets for languages other than English are much scarcer, especially when it comes to approaches with datasets that are freely available for researchFootnote 2. They include languages such as Arabic (Abdul Salam et al., 2022; Ouahrani and Bennouar, year), German (Meurers et al., 2011; Pado and Kiefer, 2015; Sawatzki et al., 2021), Hebrew (Ariely et al., 2023), Indonesian (Herwanto et al., 2018; Wijaya, 2021), Japanese (Funayama et al., 2023), Portuguese (Galhardi et al., 2018; Gomes et al., 2021), Punjabi (Walia et al., 2019), Swedish (Weegar and Idestam-Almquist, 2023) or Turkish (Çınar et al., 2020)Footnote 3.

The English ASAP-SAS datasetFootnote 4 used as one of the datasets in this paper has received substantial attention in the content scoring community (e.g. Heilman and Madnani (2015); Higgins et al. (2014); Kumar et al. (2019)). Crosslingual content scoring approaches instead are rare, with Horbach et al. (2018) as one of the first instances. Recently, Schlippe and Sawatzki (2022) proposed a method for crosslingual scoring, unfortunately on datasets which are not publicly available. Similar to our approach, they also make use of multilingual transformer models. The work by Camus and Filighera (2020), while not presenting genuine crosslingual experiments, investigates the influence of translating whole content scoring datasets automatically into a foreign language. Note that for content scoring in general, most approaches fall either into the instance-based or the similarity-based algorithmic design (see Horbach and Zesch (2019)). The majority of current approaches belong to the first category where a classifier learns from a set of labeled learner answers properties of a correct or incorrect answer. An alternative is the similarity-based approach where learner answers are compared to one or several reference answers to determine their label. Both approaches have been found to reach similar performances Bexte et al. (2022, 2023). In this paper, we stick to the instance-based approach, as we extend prior, also instance-based, cross-lingual approaches. A similarity-based cross-lingual scenario opens up a vast additional parameter space - Do we fine-tune a monolingual or cross-lingual similarity metric? Should we translate the answers used to train the metric, the reference answers or the learner answers to be scored? A thorough investigation of these additional questions goes beyond the scope of this paper and will be tackled in future work.

The situation regarding data availability is similar for the essay scoring task. An abundance of work has been published for the English ASAP-AES datasetFootnote 5, a de-facto standard in automated essay scoring. Work with publicly available datasets in other languages mostly comes from a foreign-language learning perspective and provides data for CEFR classification, such as the Swedish SWELL corpus (Volodina et al., 2016), the German Falko corpus (Lüdeling et al., 2008) and the Portuguese COPLE2 corpus (Mendes et al., 2016). While most of these datasets are monolingual, the Merlin corpus (Boyd et al., 2014) is an exception in that it covers three languages, Czech, German and Italian, and has thus allowed for crosslingual CEFR classification experiments (Vajjala and Rama, 2018). They use a set of essay scoring features that generalize across languages, such as features based on part-of-speech tags and Universal Dependency relations. Such features are suitable for the domain of essay scoring and especially proficiency classification, where linguistic form is important, but less so for content scoring.

An important research branch also concerns large-scale assessment. This scoring branch is not discussed in detail here because these data sets can typically not be made public. Yet PISA (Peña-López et al., 2012) and similar international standardized assessments of educational attainment would be ideal deatasets for crosslingual scoring studies as data often comes in a wide variety of languages targeting exactly the same prompts per language. While there is research targeting the (semi-)automatic assessment of such data, e.g. by means of clustering approaches (Andersen et al., 2023; Zehner et al., 2016) for the German PISA subset, there has - to the best of our knowledge - not been a systematic investigation of crosslingual scoring for this data.

Datasets

In this section, we describe the data sets used in our study. We first discuss requirements for crosslingual data sets. Then we describe the data collection and annotation process. In order to assess how similar the new datasets are to each other and to the original English ASAP data, we analyze and compare them regarding answer length, label distribution and linguistic variance in the data.

Requirements for Crosslingual Data

When working on crosslingual data, language should be the only or, at least, the main factor where data sets differ from each other. Horbach et al. (2018) propose a set of requirements for crosslingual data sets that includes independence of the prompts of the respective culture, language or curriculum. Especially when extending an already existing dataset, the guidelines used to score this original dataset should be available to and applicable by new annotators. Following this rationale, we extend the multilingual ASAP data sets already used in their study by integrating a Chinese data set by Ding et al. (2020) and three new data set variants: English, French and Spanish. As mentioned above, although the original ASAP data is in English, we collected new English answers via crowd-sourcing in order to keep the population providing the answers similar across all languages in the multilingual dataset and have a way to compare ecologically valid educational data with crowd-sourced data in the same language.

Data Collection

Prompts 1, 2 and 10 in ASAP are science-related tasks, which do not have a strong cultural background, and are therefore considered as appropriate to be transferred to other languages and learner populations. In addition, earlier work found that annotators were unable to apply the annotation guidelines successfully for the other prompts (Horbach et al., 2018).

Therefore, we manually translated the original prompt material and collected answers in all languages for these three prompts.

The French data was collected by distributing the prompts via different platforms and groups (e.g. Facebook). The majority of answers were obtained from Amazon Mechanical Turk (Nguimkeng, 2021). In total, 1,070 answers were collected, of which 672 were usable (manually eliminating nonsense answers or those given in another language). The participants came from many different countries according to their self-report, predominantly from France and the USA, followed by Brazil, England, India and Italy. The new English data, as well as the Spanish data was collected via Amazon Mechanical Turk only, with the majority of participants coming from the US and (for Spanish) from Venezuela. To determine payment for crowd-workers, we tested how long it takes on average to read the instructions and answer the questions and calculated per-item payments in a way that crowd-workers would earn the same as a student assistant at university. Those participants in the French data collection not recruited via AMT were unpaid volunteers.

For the Chinese data, Ding et al. (2020) collected 314 answers per prompt from high school students in grades 9-12, which is a comparable population in age and educational background to the population in the original ASAP-SAS data set. The answers are manually transcribed from handwriting into digital form. The German dataset by Horbach et al. (2018) consists of 301 crowdsourced answers per prompt.

Table 2 shows key statistics for the different data sets. Apart from the original ASAP data set, all other data sets come with comparable amounts of answers per prompt.

Table 2 Key statistics for the datasets used in our study

Dataset Annotation

For the German, French, Spanish and Chinese datasets, two native speakers of the respective language scored the answers based on the original scoring guidelines. The newly collected English dataset was scored by two native speakers of German with advanced English language skills. As defined in the original scoring guidelines, prompts 1 and 2 were rated on a scale from 0 to 3 points and prompt 10 on a scale from 0 to 2 points. All annotators were first trained on a set of answers from the original English ASAP-SAS dataset before they scored the new data. The inter-annotator agreement measured in quadratically weighted kappa is shown in Table 2.Footnote 6 The final gold standard was created in that the annotators discussed and agreed on a final label. Note that for the original English data, IAA only refers to the training set consisting of about 75% of the answers per prompt. IAA results for the other datasets are lower than for the original ASAP data, although Higgins et al. (2014) have already wondered whether these extremely high reported agreement values are really the result of completely independent ratings, while Shermis (2014) described kappa values between .62 and .85 as “a typical range for human rater performance in statewide high-stakes testing programs” (p. 58).

Table 3 Label distribution for the original ASAP dataset and the different language versions

Dataset Analysis

We further analyze properties of our data sets assuming that any differences between datasets can be due either to the difference in language or to the difference in learner population. By including also a crowdsourced version of the English data we aim at decoupling those two influence factors. We investigate label distribution, answer length and linguistic diversity operationalized by type-token-ratio.

Fig. 3
figure 3

Distribution of answer length in characters for all answers with a certain score (left part) and for all answers of a certain prompt (right part). We show results on the original datasets (upper part) and their English translations (lower part)

Label Distribution

We evaluate the relative frequency for each label per prompt and dataset, comparing the original ASAP dataset and the different additional versions for different languages (see Table 3). We see that all crowd-sourced datasets have a higher amount of low-scoring answers than the original ASAP data and attribute that to the lower intrinsic motivation in crowd-workers compared to the high school students involved in creating the original ASAP data and to the fact that the tasks were probably aligned with the science classes of the students, even though we tried to select prompts that were not particularly curriculum-dependent. These high amounts of low-scoring answers also lead to more imbalanced datasets, which bear the risk of being harder to score automatically. It seems that the prompts have different difficulty for individual populations, e.g. prompt 10 seems to be particularly difficult for Chinese respondents, while prompt 1 has high amounts of low-scoring answers among the crowdsourced English and German population.

Answer Length

We measure answer length in characters and report for every dataset the distribution of lengths for all answers per score across all prompts (Fig. 3, left part) and for all answers per prompt across all scores (Fig. 3, right part).Footnote 7 We observe that across all datasets, better answers, i.e. answers with a higher score, tend to be longer than answers with a lower score. We also see differences between prompts, with prompt 2 eliciting the longest answers. Effects from data collection contexts are visible in that English crowd-worker answers are shorter than answers from the original ASAP dataset. To factor out the presumed language influence in the length comparison (most pronounced for the Chinese dataset), we also compare a version of the dataset with all data automatically translated to English (Fig. 3, lower part) using DeepL (as we did later in the crosslingual scoring experiments using machine translation). One can see that even when translated to English, the answers in the Chinese dataset tend to be shorter than those in the other datasets. By inspecting the data, we found that the Chinese answers tend to appear in the form of bullet points. Instead of writing e.g. After reading this procedure, I have noticed that I would need to know the Ph of the vinegar in ASAP\(_{orig}\), answers like (1)PH of the vinegar sample are often found in the Chinese dataset. Across all datasets, we observe that crowdsourced answers tend to be shorter than the original data.

Linguistic Diversity

Variance in learner answers has been identified as one parameter influencing the scoreability of datasets (Horbach and Zesch, 2019). One example of such variance is linguistic diversity. Intuitively, a very repetitive data set is much easier to score than one with a broad range of different words and word combinations.

We measure linguistic diversity by type-token-ratio (TTR), i.e. the number of unique words of a text divided by the overall number of words. We use TTR of unlemmatized word-forms treating each dataset as an individual text. Type-token ratio is known to be influenced by text length, so that a comparison between TTR values is only possible if texts of the same length are compared. To overcome this problem, we use a method similar to the moving average TTR (Covington and McFall, 2010) and randomly draw subsamples of a fixed size from the datasets and report averages over the individual values.

For Chinese, TTR could either be calculated on the token level or on the character level (Cui et al., 2022). Here, we rely on a character-based representation because, in our content scoring scenario, the accuracy of tokenization is lower, since most of the answers are not complete sentences. From the results in Fig. 4 we see that most datasets fall into the same TTR range after translation to English with the exception of Chinese displaying a lower TTR ratio than the other datasets indicating that there seems to be slightly less lexical variance in the Chinese data.

Fig. 4
figure 4

Type-token-ratio for the six datasets computed for answers from a certain prompt (a) on the original version of each dataset and (b) on the version translated to English

Experimental Studies

In the following, we first describe our experimental setup, followed by three experimental studies. In Study 1, we establish baselines for each dataset by cross-validating on every dataset separately, either in its original version or translated into the other languages. In Study 2, we score crosslingually by translating training or test data into the respective other language or by using an inherently multilingual transformer model (either with or without additional translating datasets). In Study 3, we explore the effects of pretraining a model on a larger amount of English data and then fine-tuning on a smaller amount of data in the target language.

Experimental Setup

Classifiers and Features

For the shallow learning baseline we use a logistic regression classifier provided through sklearn (Pedregosa et al., 2011)Footnote 8. For all European languages, we use word uni- to trigrams and character bi- to 5-grams as features. In the case of Chinese, tokenization of words can be a non-trivial problem as the Chinese language does not separate words by whitespaces. Therefore, we follow Ding et al. (2020) in using only character uni- to 7-grams.

In the case of deep learning, we used a standard transformer architecture with a pretrained multilingual BERT model (bert-base-multilingual-uncased) with a sequence classification head, as shown in Fig. 5, and, for comparison, also a monolingual English BERT model (bert-base-unased) as monolingual models are known to exhibit a stronger performance (Artetxe et al., 2023). We fine-tuned for six epochs using an AdamW optimizer and CrossEntropy loss.

Fig. 5
figure 5

Our neural architecture, using a multilingual BERT model with a sequence classification head

Note that our main interest lies in comparing different crosslingual scoring models, not finding an optimal hyperparameter setup. We therefore hold meta-parameters fixed and also chose a fixed number of epochs instead of using validation data to select the best epoch.

Datasplit and Evaluation Metric

For the ASAP\(_{orig}\) dataset, we use the datasplit provided in the original data. Due to the small size of the other datasets, we use 10-fold cross-validation for these datasets. We always train separate models for each of the three prompts. Again, we do not use a separate validation dataset for deep learning.

We evaluate our experiments using quadratically weighted kappa (QWK) (Cohen, 1960).

Table 4 Performance in QWK for monolingual baseline experiments

Experimental Study 1: Within-Dataset Baselines

In our first experimental study, we establish baseline scoring performance per dataset, i.e. we train and test on each dataset individually as described in the previous section.

Scoring each Dataset in its Original Language

Table 4 shows the performance per language and prompt in both the shallow learning and deep learning condition. The original ASAP dataset (ASAP\(_{orig}\)) reaches the highest scoring performance, which is not surprising given that this dataset contains a much higher amount of training data. Therefore, we also down-sampled the dataset to 300 instances per prompt roughly matching the training sizes of the other data sets (ASAP\(_{orig\_300}\)). In this scenario, results are more on par with other datasets, suggesting that also the other datasets might benefit from more training data than available in the monolingual setup - a scenario we will test later in Study 3 when combining different data sets as training data.

We further observe for most datasets that prompt 2 is the hardest among the three prompts to score with the exception of ASAP\(_{orig}\) and ASAP\(_{fr}\) while prompt 1 is on average and across both shallow and neural approaches the easiest. This is reflected in substantially lower inter-annotator agreement for all but these two datasets for prompts 2 and 10 compared to prompt 1, hinting at a generally higher difficulty by most of the annotators to score these two prompts accurately. We also see that in most cases the BERT model outperforms logistic regression (the highest performance per dataset averaged across the three prompts is marked in bold in the table). Performance on Chinese data is slightly worse than for the other datasets despite high inter-annotator agreements. The values that we find in our experiment, however, are comparable to those reported by Ding et al. (2020) for Chinese and Horbach et al. (2018) for English and German.

Comparison between Monolingual and Multilingual BERT Models

We wanted to keep the pre-trained transformer model fixed across experiments and therefore chose to use the same multilingual BERT model for all of our experiments. However, monolingual models are known to exhibit a higher performance. Thus, in order to assess the potential performance drop from using an inherently multilingual pre-trained model in cases where a monolingual model could be used, we compare the two alternatives at the example of the English data.

Table 5 Performance in QWK for monolingual baseline experiments comparing a monolingual transformer model (left) to a multilingual transformer model (right) for English

Table 5 repeats the monolingual baseline results from the previous experiment, which used a multilingual transformer model for the three English datasets (on the right) and for comparison the same experiments using a monolingual English transformer model. Results are mixed, but we see a tendency that the monolingual BERT model indeed has a slight advantage over the multilingual model (.671 for the monolingual versus .660 for the multilingual model). As the improvements are rather small and in order to use the same model in all experiments, we use the M-BERT model from here on.

Scoring Translated Datasets

As a second monolingual baseline, we use DeepLFootnote 9 to translate both the test and the training data into the same other language, using each of the other languages represented in our datasets as the target language. I.e. we still score in a monolingual setting but both training and test data have been machine-translated.

Table 6 Performance in QWK for monolingual baseline experiments using the logistic regression model (left) and the neural model (right)

Results are shown in Table 6, with values highlighted in blue on the main diagonal representing the untranslated baseline condition. In most cases, results produced on translated datasets are in the same ballpark as the original data, sometimes even slightly higher especially in the case of Chinese as the original language. This points at a normalizing effect that machine translation can have, for example on spelling errors. It can also mean that a multilingual model does not perform equally well for all language pairs (Pires et al., 2019).

We manually checked a number of misspelled sentences and indeed found that a spelling variant of a word, such as vinagar instead of vinegar or expirement instead of experiment were often translated correctly. Only occasionally such a misspelled word was not translated at all as if it were a named entity. We also encountered examples of non-words spelling errors being translated to a different word in the target language than the correct form, such as English vinager being translated to German Wein (wine). These probing support the idea of a normalization by translation, thus the translation would be able to provide more informative embeddings or ngrams than the original version.

The case is somewhat different for Chinese as potential misspellings in Chinese were already removed by force through the digitization process (i.e. when typing the handwritten answers, a non-existing character had to be entered as an existing one), so that non-word errors in Chinese are rare and real-word errors often led to problematic translations. For example, some Chinese answers contain the compound word , which combines two real words (rinse) and (wash) in an unusual order. The intended compound word could be (wash up). This miss-spelled compound was translated into shabu-shabu, because the character is somehow related to (shabu-shabu, a Japanese hotpot) by machine translation.

When looking at the average over all datasets translated into the same language (the last line of Table 6), we see that values are quite similar per machine learning method, indicating that both linear regression and M-BERT are able to handle data in different languages and that differences we saw in Table 4 can be mainly attributed to the dataset, not the language.

Crosslingual Scoring within ASAP\(_{orig}\)

Our ultimate goal is to score cross-lingually data from one dataset with a model trained on a different dataset in a different language. In the previous step, we have established monolingual baseline performance per dataset and have also established that there is hardly any loss from applying machine translation to a data set. Next, we investigate the influence of applying a multilingual transformer model on training and test data in different languages but from the same dataset in order to separate dataset from language effects. Therefore, as an additional third baseline, we translate either the training (MT-train) or the test data section (MT-test) from the ASAP\(_{orig}\) set into each of the other four languages and then train a model on one language and apply it to another language (where one of those languages is always English).

Table 7 Performance in QWK for experiments on ASAP\(_{orig}\) where either the training or the test data has been translated to a different target language

From the results in Table 7 we see that there is a performance loss of up to 12 percentage points for German, Spanish and French compared to the English baseline where both training and test data are the original untranslated ASAP dataset. For Chinese, performance goes down by over 50%. We expect that these values are an upper bound for the performance we can expect in true crosslingual setups without machine-translation (at least when one language is English) where the effect of different datasets add to that of different languages.

Experimental Study 2: Crosslingual Scoring using Machine Translation and Pretrained Multilingual Models

The monolingual baselines established in Study 1 can serve as an upper bound for the performance we hope to reach using crosslingual methods. We compare it against two ways of how to cross the language barrier. One is by means of machine translation to translate either train or test data into the respective other language. As a second method, similar to the previous study, we use the M-BERT model in a zero-shot fashion so that neither test nor training data is translated but rather both will be represented in a shared embedding space.

Table 8 presents these crosslingual results grouped by the test data in different languages. In every block, we train on the different available training data (one training data set per line) and apply the model to the same test data making results directly comparable. We compare the different methods per column: We present results for logistic regression with translated training data (LR MT-train) or test data (LR MT-test) and the same for the neural model (M-BERT MT-train and M-BERT MT-test). Finally, M-BERT zero-shot relies solely on multilingual representations without any machine translation.

When comparing the three methods in Table 8, logistic regression with MT, M-BERT with MT and M-BERT in a zero-shot setting, we see that the M-BERT setting almost always outperforms logistic regression and that machine translation is beneficial (comparing M-BERT MT with zero-shot). The best crosslingual method per block (highlighted in bold print) is often M-BERT with MT with no clear tendency whether translating training or test data is more beneficial. For all languages, the best setup among those with 300 training instances, i.e. excluding ASAP\(_{orig}\), is lower than the monolingual baseline with the smallest distance between monolingual baseline and best crosslingual setup for French (.726 for monolingual vs .698 for crosslingual when trained on Spanish data) and the worst performance gap for Chinese (.550 vs .457). Also the zero-shot setup sometimes yields surprisingly good results, for example for the transfer from German or Spanish to French. On the basis of the monolingual comparison between a monolingual and a multilingual transformer model from Experimental Study 1, we assume that replacing the multilingual BERT model with monolingual versions for the respective variants would benefit both the monolingual baseline as well as the MT\(_{train}\) and MT\(_{test}\) variants as both could be scored with a monolingual model.

Overall, we observe that the transfer often works well for highly related languages, such as Franch and Spanish, while for Chinese, which is in our setup phylogenetically farthest from the other languages, the transfer does not nearly work as well.

Table 8 Performance in QWK for crosslingual experiments either using Logistic Regression (LR) or a multilingual BERT model (M-BERT) either with machine translation (MT) or relying only on the multilingual transformer (zero-shot)

Experimental Study 3: Pretraining on Larger Amounts of Data in another Language

So far, we examined domain transfer scenarios where only data in another language but no target language data is available for training and compare it to a within-domain approach with (only) data in the target language. But one could imagine a real-life scenario where larger amounts of data might be available in a majority language (such as the original English ASAP data set) and only limited amounts of data in a new application language. To simulate such a use case, we further pre-train the pre-trained M-BERT model on the original ASAP dataset in English and then fine-tune on a specific language using cross-validation (similar to the monolingual experiments in Experimental Study 1) in order to see whether we benefit from English pretraining. I.e. each model in Study 3 has been trained on about 2000 English answers and about 270 target language answers.

We compare several conditions. In the pretrain zero-shot condition, we pretrain on the original (untranslated) English ASAP dataset before fine-tuning on the (untranslated) target language dataset. In pretrain MT-train we translate the original ASAP data into the respective target language before pretraining. In pretrain MT-test, we translate the in-domain data (for both fine-tuning and testing, i.e. the whole cross-validation procedure) from the target language into English.

For comparison, we report three baseline values: monolingual CV without pretraining, i.e. results from Study 1, as well as model performance when only training on ASAP\(_{orig}\) without further fine-tuning on the in-domain data in two versions. One where the original ASAP data has been translated into the respective language and one where the test data has been translated to English, i.e. results from Study 2. (We do not show the zero-shot condition from Study 2 as another baseline as it was clearly outperformed by the two variants using machine translation.) As our method in this study is essentially always a combination of two baseline setups, we expect results to be above the maximum of these baselines.

Table 9 shows results for different test data (per line) and different experimental conditions (per column), the three baselines on the left followed by three options to combine both test data sets in the right part of the table.

We see that in all cases results outperform all three baselines. The improvement over the monolingual baseline indicates that we indeed benefit from more training data than the approximately 300 training instances in each dataset, even if it is not in the target language. The improvement over the ASAP\(_{orig}\) MT baselines shows the importance of in-domain training data, given that a modest increase in training data size (adding less than 300 instances to an existing dataset of 2000 answers) results in a relatively large performance gain.

When comparing the three experimental conditions, there is no clear winner and the zero-shot condition performs surprisingly well. We take that as an indicator that fine-tuning on the target language can compensate for a pre-training in a non-matching language. (Note that the three identical results in the first line for ASAP\(_{en}\) result from source and target language both being English, thus translating any data set has no effect.) Overall, results in this study are for Spanish and French in the same region as monolingual scoring performance on the full original English ASAP dataset (with much less additional annotation effort in the new target language), and at least substantially improved over the baselines in case of crowd-sourced English, German and Chinese.

Table 9 Performance in QWK for an M-BERT model pretrained on the original ASAP data and finetuned on each of the other datasets (right part) either training or test data is translated or neither (zero-shot)

Further Analyses

We saw in our experiments that crosslingual scoring with training and test data from different datasets yields a substantial performance loss compared to monolingual scoring where only one dataset is used, even if the data has been translated into a different language. This gap only becomes smaller, when in addition to the data from a different dataset also genuine target-language data is used. We therefore conclude that part of the drop in performance is due to the nature of the different datasets, i.e. that different user populations think or at least write differently about the same prompt. In the following, we thus provide additional analyses intended to shed light on those differences. We do this in two ways: by comparing the semantic similarity between answers across data sets and by examining high-frequency lexical material per data set.

Similarity between Datasets

As a first analysis step, we measure the similarity between datasets. We operationalize this by measuring pairwise similarity between answers for the same prompt.

Table 10 Distribution of maximum similarity between an answer from the dataset specified in the row label and any answer from the dataset specified in the column label

We are first and foremost interested in semantic similarity, not similarity on the surface. We measure semantic similarity by encoding every answer with a multilingual SBERT model (distiluse-base-multilingual-cased-v1) and calculate its cosine similarity with every other answer from either the same or a different dataset (only within the same prompt). For each answer in dataset A, we record the maximum similarity to any answer in dataset B, with A and B being either different or the same dataset. The idea is that if different user populations answer a prompt in conceptually different ways, we would see a low maximum answer similarity between these datasets. Likewise, we expect the answer similarities to be higher within a dataset. The reason why we look at maximum similarity only is that we want to see if a scoring model would have had the chance to see a similar answer in the training data or not.

Table 10 presents the results. Every histogram shows the distribution of the maximum similarity of every answer in the dataset specified in the row label with the answers in the dataset specified in the column label (measured within the same prompt). The diagonal from the upper left to the lower right corner contains the results for the similarities within one dataset. We see that ASAP\(_{orig300}\) is rather homogeneous in that most answers have a matching answer with a very high similarity, i.e. the histogram has a clear peak towards the right. Also when compared to answers in the other datasets (see first row), ASAP\(_{orig300}\) mostly has matching answers with high similarities, with the Chinese dataset behaving a bit differently. For the Chinese dataset itself (see last row), we see that the answers are rather similar to each other within this dataset but not so much when the answers are compared to any of the other datasets. This indicates that either the Chinese data is indeed rather dissimilar to the other datasets, which is in line with the observation that Chinese was hardest to score in a crosslingual setting, or that the multilingual SBERT model is not able to capture semantic similarity across languages for Chinese. Most of the other histograms for same-dataset vs. different-dataset comparisons look rather similar to each other, indicating that the conceptual differences between datasets are not more pronounced than the conceptual differences one finds within a dataset.

Fig. 6
figure 6

Top 10 word-level trigrams in different datasets (character-level trigrams for Chinese). (The x-axes are not on the same scale because of the different size of data sets.)

Most Frequent Terms and N-Grams

To investigate the idea further that different learner or crowd worker populations simply talk about different concepts when answering the same questions we inspect the most frequent n-grams per dataset and prompt as an easy way to gain insight into which concepts are frequently mentioned.

We focus on content words and thus exclude n-grams that consist only of function words, such as ‘and for the’. As for some of the previous analyses, we do this in two variants: We first analyze each datasets in its original language. Next, we re-run the analyses on all datasets translated into English. In doing so, we want to make the analyses more accessible to readers unfamiliar with those languages, but also see potentially detrimental effects of machine translation when one term might be consistently translated to English as a term not occurring in the English data set.

Figures 6 and 7 show exemplarily the most frequent 3-grams for Prompt 1 in all languages. We see that terms like vinegar, sample or experiment can be found frequently in all datasets. Many answers in German contain the factor of time, since terms at the beginning and at the end are only listed as top 10 in this dataset. For answers in German and Chinese, unlike other languages, common terms for the formulation of answers into a whole sentence, repeating phrases from the question, such as need to know or I would need are rare. This confirms our first observation that answers in German and Chinese are often in bullet points. It is interesting to see that the terms PH value and acid rain can only be found in the top 10 list in Chinese whereas answers in other languages rather talk about the amount of vinegar and it remains an open question whether such variation reflects cultural differences or different approaches to a reading comprehension task as the term acid rain appeared only in the title of the prompt but not in the task itself. This supports our assumption that Chinese data is indeed conceptually different from other datasets.

Fig. 7
figure 7

Top 10 word-level 3-grams in different datasets translated to English

Conclusion and Future Work

In this paper, we have presented new datasets from three languages addressing the same content scoring prompts as a starting point to explore the field of crosslingual content scoring. We used both machine translation and multilingual transformer models as well as a combination of the two to assess the feasibility of using training data in one language to score test data in another. We found in Study 1 that in concordance with earlier findings machine translation by itself does not introduce noise that makes automated scoring harder. Study 2 revealed that truly crosslingual experiments with training data in a source language and test data in a different target language unsurprisingly work not as well as a monolingual baseline and that the extent of the performance decrease depends on the language pair considered. Closely-related languages (such as the Spanish-French pair) are often easier than pairs that are further apart (such as Chinese paired with any other language in our study). When comparing different methods, a semantic representation works better than one relying on surface information such as n-grams. The best performance could be achieved when using a multilingual transformer model but still translating either the train or test data into the respective other languages. In Study 3, we explored a scenario where a transformer model was iteratively trained on both larger amounts of English data and smaller amounts of in-domain target language data. This setup yielded overall the best results coming close to performance on large in-domain setups in English.

We postulated that differences in behaviour between datasets can have two reasons: language differences and differences between learner populations. One aspect of such population effects is the difference between real learners, such as high school students, and paid crowd workers. In addition, other aspects we could not address here might play a role, such as the cultural background of a learner population. To understand these influences better, more research, ideally under more controlled conditions would be necessary.

Ethical Considerations

The benefits and risks of automated scoring in general have been extensively debated in the literature (see, for example Loukina et al. (2019)). One important argument in favor of the use of automated scoring is that normally all learner answers are scored by the same algorithm. If we use crosslingual scoring where one scoring model per language is trained, learners will be scored by different models depending on their choice of language and thus be subjected to different model biases. A more advisable scenario would thus be the automatic translation of all learner answers into the same language in which a model has been trained, but again with the caveat that machine translation can have different quality for different source languages and learners writing in a less-resourced minority language might be discriminated against by the automated scoring process. Similarly, multilingual transformer models work differently well for different languages and language combination. Thus we would consider crosslingual scoring at the moment not suitable for high-stakes assessment.