Introduction

Recently, there has been a growing interest in the development of digital technologies that offer adaptivity and personalization as a way of supporting and enhancing language learning (Kerr 2015). Among these technologies are intelligent language tutoring systems (ILTSs), which use artificial intelligence to capture and analyze learner data and make appropriate adjustments to the instructional process (Shute and Zapata-Rivera 2012; Slavuj et al. 2016). The language proficiency of the individual learner, including type and level of knowledge, skills and misconceptions, typically represents a central source of adaptation. Natural language processing techniques, for instance, make it increasingly possible to perform fine-grained error analysis of learner input and to deliver error-specific feedback and remedial activities (e.g. E-Tutor (Heift 2016) and Tagarela (Amaral and Meurers 2011)). The intelligent selection, sequencing and mode of presentation of learning content may also be based on long-term performance and can be influenced by further factors such as the learner’s age, linguistic background, goals and styles, affective states, disabilities or indeed the learning context itself (Brusilovsky and Millán 2007; Slavuj et al. 2016). This paper focuses on a form of adaptive sequencing sensitive to the developing language ability of the individual learner and achieved by dynamically matching the difficulty of new or remedial content to the learner’s current level of ability (Wauters et al. 2010). This process is henceforth referred to as dynamic difficulty adaptation (DDA) and the work reported here is primarily concerned with the development of learning materials, and especially exercise items, which lend themselves to DDA.

This paper presents a pilot study conducted in preparation of an ILTS for practicing English tenses – a grammatical area notoriously challenging for learners – which can be used as a complement to classroom instruction, as well as for collecting data on the development of L2 grammatical ability and researching the effectiveness of adaptive tutoring, including DDA. The pilot study has two main objectives regarding the development of the exercise item pool the ILTS will operate with. The first objective involves calibrating the difficulty of an initial tense exercise pool consisting of cued gap-filling items (CGFIs) using item response theory (IRT). Difficulty calibration is not only crucial for DDA but can also help assess the appropriateness of the initial item pool for the target population (currently 9th and 10th grade learners in Germany) and identify where item pool extensions are necessary.

Motivated by related research in psychological and educational measurement (e.g. Gorin and Embretson 2006; Embretson 1983; Hartig et al. 2012), the second and main study objective is to assess whether – and how well – various observable CGFI features can be used to predict item difficulty. This approach has two potential advantages in the context of ILTSs. First, reliable item difficulty models could substitute prior calibration relying on expensive pilot testing or intuitive ratings, which is commonplace in current ILTSs, and instead aid in the automatic or semi-automatic scoring and generation of unlimited new items with desirable linguistic and psychometric properties (cf. Embretson 1998; Gierl and Haladyna 2012). The second advantage is that such models can provide valuable insights regarding the relative difficulty contribution of learning targets and various other item features. In the future, such data-driven insights can inform the overall structuring of learning content and adaptive sequencing and help predict individual learners’ difficulties. More generally, the methodology and findings of this study will also be of interest to test and exercise developers, as well as to researchers of second language acquisition (SLA).

Research Context

Current Approaches to DDA

To date, DDA has been applied predominantly in ILTSs targeting vocabulary and reading skills (e.g. Dynamic e-book guidance system (Sung and Wu 2017), MEL-Enhanced (Sandberg et al. 2014), PIMS (Chen and Hsu 2008), REAP (Heilman et al. 2010), U-Reading (Wu et al. 2011), Chen and Chung 2008). In contrast, DDA in the area of grammar appears to have been implemented only in two ILTSs, Moeyaert et al.’s (2016) system and English ABLE (Zapata-Rivera et al. 2007), both of which focus on formal accuracy only. The former offers exercises targeting a single learning dimension (French verb conjugation), while the latter treats several grammatical features such as subject-verb agreement or pronoun form, each with its own set of (error-correction) exercises.

DDA has its origins in computer adaptive testing (CAT), where test item difficulty is adapted to quickly and accurately assess test takers’ ability level (cf. Van der Linden and Glas 2010). Its purpose in ILTSs goes beyond just assessment and extends to the promotion of learning and motivation (Eggen 2012; Shute et al. 2007). Timms illustrates the reasoning behind this approach as follows:

[T]he difficulty of the problem has a large effect on how productive the interaction between the student and the learning materials will be. If the student finds the problem too easy, little learning will occur. In contrast, if the problem is too difficult […], they will learn nothing and may also become discouraged. (Timms 2007, p. 213)

Arguably, therefore, the optimal adaptive selection and sequencing of exercise items should ensure that learners are challenged yet capable of succeeding. This idea has been influential in learning theory for some time now, notably in Vygotsky’s (1978) zone of proximal development theory, flow theory (Csikszentmihalyi 1991/2008), self-determination theory (Deci and Ryan 1985) and Krashen’s (1985) input hypothesis. It has also found support in recent CAT studies which show that DDA can lead to higher achievement, test-relevant motivation and engagement, as well as to more positive subjective test experiences and lower anxiety levels than non-adaptive tests (Fritts and Marszalek 2010; Martin and Lazendic 2018; but see also Ling et al. 2017). There has, however, been very little research on the precise effects of DDA in digital learning environments and results have been mixed. The only controlled study in the area of language learning was conducted on an ILTS targeting a single dimension, French verb conjugation (Moeyaert et al. 2016). Using IRT to estimate item difficulty and learner ability, the study tested five DDA algorithms, each selecting exercise items with a specific success probability range (from 40 to 50% to 80–90%). Results indicate that DDA did not affect learning and motivation significantly in any of the conditions and, furthermore, did not differ from random sequencing, regardless of learners’ proficiency level. At the same time, a handful of studies investigating the impact of DDA on learning in intelligent tutors/instructional games outside language learning have reported more positive results (Camp et al. 2001 and Salden et al. 2004 on air traffic control training; Kalyuga and Sweller 2005 on algebra; Yuksel et al. 2016 on music instruction), although some studies point in the opposite direction. Orvis et al. (2008) on military training and Shute et al. (2007) on algebra, for instance, found no correlation between DDA and increased learning outcomes. In addition, there is some indication that there may not be a one-size-fits-all approach to DDA. Thus, Mitrovic and Martin’s (2004) study on SQL programming shows that the positive impact of DDA may be mediated by factors such as learners’ proficiency, with advanced learners benefitting the most (see also Orvis et al. 2008, and the CAT studies cited above). Given the lack of unanimity in previous studies and their different domain and task-type foci, it is also possible that DDA may be effective in some learning (sub-)domains or task types but not in others.

While the identification of optimal DDA algorithms – possibly tailored to different learner profiles and especially in ILTSs targeting multidimensional learning domains such as the English tense system – urgently requires more empirical research that would also be of relevance for existing learning theories, this paper concentrates on the more basic issue of implementing DDA in digital learning environments. As Wauters et al. (2012) note, a prerequisite for DDA is having learning materials with a known difficulty level. Yet, the measurement of difficulty in existing ILTSs has some limitations. In some systems, difficulty is evaluated by human raters (system designers, educators, learners) (cf. REAP), potentially introducing subjectivity and bias (Impara and Plake 1998; Wauters et al. 2012) and increasing costs. Other systems implement observable item attributes as predictors but with little or no empirical validation (MEL-Enhanced; PIMS; U-READING; Chen and Chung (2008)). When more objective methods (e.g. based on classical test theory or IRT) are used, items require calibration using large numbers of real learners prior to or during system use (Wauters et al. 2012; e.g. Dynamic e-book guidance system; English ABLE; MEL-Enhanced; Moeyaert et al. 2016). This may quickly become infeasible when a subject domain like the English tense system requires a large pool of exercise items.

In light of these challenges, we propose predictive difficulty modelling as a more objective and economical alternative for item pool generation and calibration in DDA-enabling ILTSs. We also argue that this method can inform curriculum design and adaptivity in our ILTS, including DDA at the level of learning targets, which seems to be lacking in most existing ILTSs. We turn to these and related issues next.

Measuring and Predicting Item Difficulty

As mentioned in the “Current Approaches to DDA” section, there are different methods for estimating item difficulty. In psychological and educational measurement, IRT has long been influential. Psychometric models within IRT assume that a person’s response to an item depends on qualities of both the person and the item (cf. Embretson and Reise 2000). The simplest model, the one-parameter Rasch model, describes the probability π of a correct answer y to an item i as a logistic function of the difference between the person’s ability parameter (θp) and the item difficulty parameter (βi):

$$ \pi \left({y}_{pi}=1|{\theta}_p,{\beta}_i\right)=\frac{\exp \left({\theta}_p\hbox{--} {\beta}_i\right)}{1+\exp \left({\theta}_p\hbox{--} {\beta}_i\right)} $$
(1)

This makes it possible to map the ability and difficulty parameters on a common scale and estimate a probability of success for each item and person. For example, if a person with an estimated ability level of 1 logit receives an item with a difficulty of 1 logit, then according to the model, that person has a 50% chance of solving the item correctly. For an item with a difficulty that is 1 logit above the person’s ability, the probability of success falls to 27%, while an item with a difficulty 2 logits below the person’s ability increases the probability to 88%.

Typically, reliable item difficulty estimates are obtained by piloting items with a large sample of persons (recommendations vary between 50 and 1000; cf. Linacre 1994; Tsutakawa and Johnson 1990). These estimates, together with a choice of a desired probability of success (e.g. 50%, considered neither too easy nor too difficult), can help assess the ability of new persons and serve as the basis for computer-based DDA. While the identification of an optimal probability of success for specific learner groups remains an empirical question (see the preceding section), it should be pointed out that item calibration can also inform item pool development in ILTSs, an issue currently rarely discussed in ILTS literature, which has so far concentrated on technological issues (cf. Vajjala and Meurers 2012). Specifically, if the population sampled for calibration is also the ILTS’s target population, the appropriateness of the calibrated item pool can be determined, that is, whether the item pool contains enough items for the range of learner abilities found in that population. This paper addresses this process to some extent in the “Difficulty Calibration” section below, leaving a deeper investigation for future work (for a CAT example, see Reckase 2010).

In psychology and educational measurement, IRT has been instrumental in the study of construct representation and validity (Embretson 1983). It is assumed that if it is possible to formulate hypotheses identifying the cognitive constructs (knowledge, skills and other cognitive characteristics) involved in successful task performance and describe how they are represented by features of individual task items, these hypotheses can be tested empirically. If differences in item difficulties are indeed explained by item features, then empirical support is provided for assumptions about construct representation (e.g. Embretson 1998; Freedle and Kostin 1993; Gorin and Embretson 2006). This is usually done by regressing item difficulties on item features.

The present study also seeks to model the relationship between item features and item difficulty. As detailed in the “Candidate Predictors” section, we consider a range of potential predictors specific to CGFIs targeting the English tenses. However, our long-term goal is not to study construct representation but rather a) to inform curriculum design and adaptivity in our ILTS and b) to predict the difficulty of new exercise items.

Regarding the first goal, some of the features considered, such as the tense and voice prompted by an item, represent primary tutoring targets of our ILTS, while others relate to contexts of use (e.g. conditionals and reported speech) or the overall syntactic and vocabulary complexity of a CGFI (see the “Candidate Predictors” section for details). If, as we expect, the primary targets and context types are influential correlates of item difficulty, this may have implications for the structuring and adaptivity of the learning content. First, data-driven insights may be useful for informing general content ordering. For example, as far as it also makes pedagogical sense, practice material can be organized according to the relative difficulty contribution of each tense, concomitant grammatical features (e.g. active/passive voice, (ir)regular morphology) and other exercise characteristics causing difficulty. Second, an IRT-based sequencing algorithm operating over this general content structure would enable the system to present exercise materials whose difficulty is adapted to learners’ ability – both within and across learning targets – or to anticipate where additional scaffolding is necessary (e.g. more or less detailed explanations and hints).Footnote 1

The second, and more immediate, goal is to build a model that can predict the difficulty of future CGFIs based on their features. A further step would be to train the system to identify relevant features in new exercise material and rate difficulty automatically. As a consequence, the system would also be able to generate or recommend exercise items with feature constellations specifically tailored to the current needs of the learner.Footnote 2,Footnote 3

These goals, however, extend beyond this pilot study. Here, we focus on difficulty prediction alone and evaluate the generalizability of several different item difficulty models using statistical cross-validation. Details on the methods employed are presented in the “Data and Methods” section below. First, however, a description of the item pool and potential difficulty predictors is provided.

CGFIs Targeting the English Tenses

Exercise Format

This paper presents a model for predicting the difficulty of CGFIs targeting the English tenses. The term CGFI mimics Purpura’s (2004) cued gap-filling tasks, where learners read a short text and fill in the gaps using cues usually consisting of a single word which must be transformed to fit the context. The CGFIs in this study are shorter (spanning two sentences on average) and contain a single gap. As the following examples show, each gap is followed by a bracketed cue, typically a single lexical verb in the infinitive but sometimes also a subject pronoun, an adverb and/or the negative particle not.

figure a

Though the primary focus of these items is on the form and meaning of the English tenses, the examples above show that a number of epiphenomenal features, including voice, polarity, person/number inflection, word order and irregular morphology, are also targeted (see the “Candidate Predictors” section).

CGFIs belong to a group of limited-production exercise formats that have been criticized for their repetitiveness, artificiality and unsuitability for the promotion of broader communicative skills. At the same time, however, even a cursory overview of standard textbooks, practice grammars and online platforms shows that they represent a common and valued exercise format. Probable reasons for this are their relatively straightforward design and utility in targeted grammar practice and assessment (Purpura 2004). From a pedagogical perspective, they can be particularly effective in focusing learners’ attention on specific linguistic forms and form-meaning relationships. This approach is seen as especially advantageous for the development of explicit, declarative knowledge, which plays an important role in early second language acquisition (cf. e.g. DeKeyser 2005; Ellis 2012; Norris and Ortega 2000; Schmidt 1995). Another distinguishing feature of CGFIs is that they target both receptive and productive skills. Compared to multiple-choice or error-recognition tasks, for instance, which require recognition or recall of grammatical form and meaning, CGFIs require not only the ability to reconstruct contextually implied grammatical meaning and retrieve the necessary linguistic form, but also to produce this form accurately (Purpura 2004, p. 127). Thus, successful exercise completion depends on the development of each of these abilities. Learner ability, however, is not the only determinant of success. Given that some CGFIs are easy for most learners, while others represent a challenge even for the strongest ones, it follows that variation in item difficulty must also be considered. We discuss CGFI features that may be implicated in this variability next.

Candidate Predictors

To our knowledge, there have been no previous attempts to identify linguistic features affecting or correlating with CGFI difficulty (or cued gap-filling tasks with multiple gaps). To address this problem, we considered relevant candidates in the SLA and psycholinguistic literature, as well as features known to affect the difficulty of related task types (see especially Beinborn 2016, Beinborn et al. 2014, and Svetashova 2015 on C- and X-tests, as well as multiple-choice cloze tests). To systematize the investigation, we distinguish three feature categories: gap-level, context and item-level features. These are discussed next and listed in Table 4 in the appendix.

Gap-Level Features

Gap-level features refer to linguistic properties of the gap solution. The most obvious and presumably strongest candidate predictor is tense.Footnote 4 As Table 4 shows, not all English tenses were considered in this study. Due to the practical difficulty of collecting sufficient amounts of relevant learner data at this stage, both future tense forms (i.e. tenses with will/shall) and future meanings (e.g. the future meaning of the simple present) were excluded, as was the rare conditional perfect progressive (would have been V-ing). However, the semi-modals used to and was/were going to, which express past habituality and past intention/predictability respectively (Declerck et al. 2006), were included on par with the tenses. Despite extensive SLA literature on tense and aspect in L2 English, it is difficult to anticipate the relative effect of these tenses and semi-modals on item difficulty. First, although it is well known that progressive and perfect tenses are generally acquired later than simple tenses (Bardovi-Harlig 2000) and are especially challenging for (German) learners, usage patterns encompass both under- and overuse (cf. e.g. Axelsson and Hahn 2001; Davydova 2011), with overuse entailing that CGFIs targeting other presumably more straightforward tenses (e.g. simple tenses) may be solved incorrectly due to perfect/progressive tense misconceptions.Footnote 5 Second, since this study distinguishes between correct and incorrect solutions only to enable the implementation of the Rasch model (see the “Answer Coding” section), it is at this stage impossible to differentiate tense misuses from mere errors in form, a distinction usually made in SLA studies.

As stated in the “Exercise Format” section, the item pool also targeted a number of epiphenomenal grammatical features, including voice, polarity, subject-verb agreement, word order, adverb placement and (ir)regular lexical verb morphology. These features may be correlated with item difficulty since they require distinctive morphological and/or syntactic knowledge and skills. In line with Eckman (1977) and White (1989), we hypothesize that marked realizations (e.g. passive voice, negative polarity, marked person/number morphology, etc.) are more challenging than unmarked or non-realizations.Footnote 6

Because CGFI difficulty may be affected by the interaction between the tense/semi-modal and one or more of the epiphenomenal features described above, two gap-level measures, morpho-syntactic edit distance (MSED) and cue size, were adopted as simple proxies. MSED refers to the number of syntactic and morphological transformations necessary to arrive at a target form or construction.Footnote 7 In the present study, the initial stimulus is the bracketed material after the gap, the target may or may not be grammatically composite and MSED in the data ranges from 0 to 7. To illustrate, in ‘cue: go → solution: go’ no transformations are required, while that of ‘cue: still, consider → solution: is still being considered’ requires seven.Footnote 8 Since the second target solution is morpho-syntactically more complex than the first, the chances of committing an error may be higher. Hence, we hypothesize that MSED may also predict CGFI difficulty.Footnote 9 Cue size is a more rudimentary measure, referring to the number of words in the cue and reflecting increases in syntactic complexity. In the present data, it ranges from 1 to 3, including the lexical verb and, optionally, an adverb, the negative particle not and/or a pronoun.

Lastly, previous studies on the difficulty of related text completion tasks, such as C-tests and cloze tests, have hypothesized that successful gap-filling depends partly on word familiarity (e.g. Beinborn et al. 2014; Svetashova 2015). Since word familiarity cannot be measured directly, word frequency in language corpora is usually employed as a proxy and has been shown to positively correlate with task and/or gap-item difficulty. Unlike cloze and C-tests, CGFIs provide the lexical material needed to fill the gap explicitly and so frequency could be deemed unrelated to CGFI difficulty. Despite this, lexical verb frequency measures were included, since semantic familiarity arguably plays a role in the reconstruction of the propositional meaning expressed by the gapped sentence, including its temporal, aspectual and voice meanings. Frequency information comes from the spoken and written components of the British National Corpus (henceforth BNC-S and BNC-W; Hoffmann et al. 2008) and SUBTLEXus (Brysbaert and New 2009). Several related measures were employed (features 37–54 in Table 4). For the BNC, these include frequencies of the verb lemma and the specific verb form required, as well as the verb form percentage. The latter two measures were included to capture the likelihood of having encountered a morphological form. The SUBTLEXus measures are similar, except that lemma frequencies were replaced with type frequencies. The number/proportion of SUBTLEXus documents containing a type was also considered (Brysbaert and New 2009).

Context Features

Several features were defined to examine whether the type of context in which a gap appears affects item difficulty. A first feature group refers to clause type and is categorized according to whether the gapped clause is a simple sentence, part of a compound sentence, superordinate and/or subordinate. The last two contain further grammatical and semantic subcategories:

  • superordinate clause: conditional consequent; head of a temporal clause; miscellaneous

  • subordinate clause: conditional antecedent; object of the verb wish; reported clause; relative clause; temporal clause; miscellaneous

These features were included under the heading of context based on the fact that different clause types (e.g. central conditional antecedents and temporal clauses vs. main clauses) tend to select for different tense/aspect combinations (cf. e.g. Haegeman 2006). We therefore hypothesized that clause type (and the nature of neighboring clauses, if any) can function as a kind of contextual cue to the solution of an item. The classification above is still fairly coarse, however, partly due to the small number of items in the current dataset. Future work should address this.

The second group refers to gap position within the CGFI (beginning, middle or end). Gap position has been used in studies on cloze and C-tests (e.g. Beinborn et al. 2014; Svetashova 2015) on the hypothesis that a gap difficulty is increased by the number of preceding gaps. For the present CGFIs, the reverse could be true: the later the gap appears in the CGFI, the more contextual information will be available to solve it.

Finally, since the item pool contains multiple items representing dialogic exchanges, we included a binary dialogic/monologic feature to assess its effect.

Item-Level Features

Item-level features describe global syntactic and lexical characteristics that may affect CGFI readability and hence difficulty.

Syntactic measures include the number of a) sentences, b) clauses and c) dependent (finite) clauses within a CGFI, as well as the ratio between c) and b). Total CGFI length and mean sentence length in words were also calculated. These features have been found to be good predictors of text readability and C-test difficulty (cf. Beinborn 2016; Svetashova 2015; Vajjala and Meurers 2012).

Several different features describe CGFI vocabulary. Psycholinguistic research shows word frequency plays an important role in language comprehension, with high-frequency vocabulary enabling faster lexical access and therefore increasing readability (Brysbaert and New 2009). To test this hypothesis, we calculate for each CGFI the mean word type frequency and corpus range in SUBTLEXus. This is done separately for content words, function words and all words per CGFI (features 83–94). A related measure is McDonald and Shillcock’s (2001) contextual distinctiveness score, which represents the co-occurrence probability of a word with 500 highly frequent lemmas in the BNC. We include average CGFI scores over all words for this measure.

Age of acquisition (AoA) is another possible predictor that has been shown to explain variance in word recognition and reading (Weekes et al. 2006) and SLA experiments (Izura et al. 2011). We calculate average AoA of CGFI vocabulary using Kuperman et al.’s (2012) database of informant ratings for 30,000 English words. Another measure associated with lexical processing behavior is Brysbaert et al.’s (2014) vocabulary concreteness, involving reference to easily perceptible entities. We obtain scores for content, function and all words from Brysbaert et al.’s database containing ratings of 40,000 words (features 95–97).

Finally, we adopt two lexical features previously tested in readability studies (cf. Vajjala and Meurers 2012): lexical density (percentage of content words) and mean word length in characters. To examine the effect of word length variation, we also include the word length standard deviation.

Having introduced the three categories of features potentially associated with CGFI difficulty, we next provide details of the data and methods employed in this study, followed by the results.

Data and Methods

To estimate item difficulties in the CGFI pool and build a predictive model, a paper-and-pencil test was conducted with a sample of German 9th and 10th grade learners of English. In Germany, students begin learning English as a foreign language in the 3rd grade or earlier and are expected to reach proficiency levels A2 to B1 by the end of the 9th and 10th grades respectively (cf. e.g. Niedersächsisches Kultusministerium 2015a, b). At this stage, all tenses and semi-modals targeted by the initial item pool have been introduced and are reviewed and practiced extensively. The next subsections detail the item pool development, the administration and scoring of the test, the feature extraction procedure and the statistical methods employed.

Item Pool Development and Test Administration

To ensure item pool appropriateness for the target learner group, CGFIs were collected from current print and digital EFL materials for the 9th and 10th grades by the three major education publishers in Germany (Cornelsen, Diesterweg, Klett) and practice grammars for intermediate students. Item quality and possible solutions were evaluated by four native speaker experts and items were modified or discarded if they were topically or stylistically awkward or ambiguous in terms of the range of acceptable solutions (ignoring non-standard/marginal alternatives). In total, 330 items were selected for the test.

After obtaining all necessary permissions and informed consent, the test was administered to 787 9th and 10th graders in two preparatory high schools (Gymnasium) and two integrated comprehensive schools (Integrierte Gesamtschule) in Lower Saxony. After discarding empty and aborted tests, the number was reduced to 689. Table 1 shows a relatively equal split between school grades but not school forms: approximately 73% of the participants were preparatory high school students.

Table 1 Test participants according to school type and grade

Test instructions were provided prior to administration, including an example item and answer. Participants were not informed which specific tenses the test would cover but told that the semi-modals going to and used to were admissible. 40 min were provided to complete the test.

Due to the impracticality of asking students to solve all 330 items, a matrix design consisting of 90 unique booklets containing a subset of 44 items in random order was used. Despite our efforts, 38 items were retrospectively found to permit more than one solution and had to be omitted from the analysis. Each of the remaining 292 items was seen by a mean of 91.68 students (SD = 3.86), with each student working on a mean of 38.86 items (SD = 2.42).

Answer Coding

Test data collection produced 26,772 data points and the coding process distinguished between ‘correct’ and ‘incorrect’ answers. Correct answers include a) complete matches (N = 7788) and b) correct answers with a spelling mistake unrelated to irregular morphology (e.g. *tought vs. thought or *finaly vs. finally; N = 94). Incorrect answers include a) morpho-syntactic inaccuracies (N = 17,865), b) unfilled gaps (N = 842) and c) illegible, stricken through or unserious responses (N = 183). Two authors and four student assistants participated in the coding. Interrater reliability was maintained via a coding manual and a workshop, joint discussions of uncertainties and frequent sample checks by one author.Footnote 10

Feature Extraction

The linguistic features tested in the present study were extracted as follows. Features 37–46 were extracted using BNCweb (Hoffmann et al. 2008). Features 47–54 were extracted from a SUBTLEXus wordlist available at https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus. Features 73–76 and 78–79 were extracted with WordSmith Tools 6 (Scott 2011). Features 77 and 83–99 were extracted using Taales 2.2 (Kyle and Crossley 2015). All remaining features were coded and double-checked manually by two authors.

Statistical Analysis

Difficulty calibration was performed based on the Rasch model (Eq. 1) as implemented in the R package TAM. Four items were excluded from the analysis due to low informativeness (they were solved by all or none of the participants). The remaining 288 items were rescaled and their difficulties were subsequently used in the prediction analysis (see the “Difficulty Calibration” section for details).

To model CGFI difficulty, several ridge regression models with different CGFI feature sets were built using the scikit learn library for Python. Ridge regression was chosen over other approaches to avoid overfitting and tackle multicollinearity.Footnote 11 As a pre-processing step, continuous attributes were scaled to the interval [0,1], while categorical ones were binarized. This resulted in 99 individual features tested in the prediction experiments. No regression intercept was included to obtain difficulty estimates for all features. Nested cross-validation was used for hyper-parameter setting and prediction performance evaluation (five-fold cross-validation repeated 10 times). The following evaluation criteria were used: the average root mean squared error (RMSE)Footnote 12 and the Pearson correlation coefficient r between predicted and observed difficulties. Finally, prediction intervals were calculated for the best performing model to estimate its precision on future data.

Results

Difficulty Calibration

The results of the item pool calibration are displayed in Fig. 1. The overall reliability of the items in measuring tense-related grammatical ability was high (RelEAP = 0.88). However, eight items did not fit the model well (infit mean square > 1.33; cf. Wilson 2004) and were excluded from subsequent analysis.

Fig. 1
figure 1

Wright map displaying the distribution of estimated test participants’ abilities (left) and item difficulties (right). Misfitting items (N = 8), excluded from subsequent analysis, are marked in dark grey

For the remaining 280 items, difficulties ranged from −2.56 to +5.78 logits. As visible from Fig. 1, the overlap between student abilities (M = 0.00, SD = 1.30) and item difficulties (M = 1.51, SD = 1.78) is not ideal. Overall, the latter significantly exceed the former (two-sample t = 14.645, df = 967, p < 0.0001). It is also noticeable that the item pool contains a number of items that were too difficult for most students, and very few items suitable for learners at the bottom of the scale. This suggests that the item pool tends to be overly demanding for German 9th and 10th graders and that the most pressing need for additional items is at the lower end of the scale. These findings, however, do not reveal what item content characteristics affect item difficulty and, hence, offer no guidelines for designing future item pool extensions. In order to begin bridging this gap, we next examine the ability of a number of CGFI features to predict item difficulty.

Difficulty Prediction

The prediction experiments are based on the CGFI difficulty estimates obtained from the calibration and the three feature groups described earlier: gap-level, context and item-level features. We compare the predictive performance of several feature combinations to a baseline model (a featureless model always predicting the overall mean difficulty) to identify the best approach.

Table 2 shows that feature models B–K all represent an improvement over the baseline. Model B, which contains only the 11 tenses and two semi-modals, checks how well these features can predict item difficulty by themselves, given that they represent core targets of the CGFIs. Indeed, this model clearly outperforms the baseline, with a considerably lower RMSE and a stronger correlation between predicted and observed item difficulties. Adding the remaining gap-level features improves the results somewhat (Model C). Still, this improvement is fairly small relative to the much larger number of features, suggesting that most of the additional features have low predictive power. The same holds for context and item-level features, which perform poorly both on their own and in combination (D, E and H), yielding RMSE values close to the baseline. However, the moderate correlation with observed difficulties means that they capture some variability in the data.

Table 2 Cross-validation results for eleven predictive models with different feature combinations.

To better understand the usefulness of the three feature categories, we next assess the performance of different feature combinations. The first, F, contains all 99 features. As shown in Table 2, the full model delivers considerably better results than C (a 18% decrease in RMSE and a 6% correlation increase), confirming that context and item-level features hold some promise. We also check whether a sparser model can be obtained. Models G–I each omit a different category. G, containing gap-level and context features, is the best of the three. Interestingly, it delivers results equivalent to those of the full model but with almost 30% less features. It also represents a considerable improvement over gap-level features alone. H and I are substantially worse, with H performing slightly better than gap-level features alone and I being the weakest by far. Taken together, these findings suggest that gap-level features have the strongest predictive power, followed by context and item-level features.

Removing entire blocks of features as described above, however, may obscure the predictive power of individual features. To examine this possibility, we perform recursive feature elimination of the least influential features. Table 2 reports two of the obtained models, J and K, with 56 and 36 features respectively. Model J delivers the best results, reducing the RMSE by 5% and increasing the Pearson correlation by 1% compared to F and G. Model K is the sparsest model within a 3% tolerance of Model J. It performs virtually equally well as F and G but with far fewer features.

The results of Model K are displayed in Table 3, which shows that features from all three categories contribute to CGFI difficulty/ease. However, the gap-level category is clearly the most important, accounting for 72% of all features in the model. As expected, the type of tense/semi-modal required has a major impact (apart from the past progressive and past perfect).Footnote 13 The effect of epiphenomenal grammatical features is confirmed too, though of these, subject-verb agreement appears to play no role whatsoever.Footnote 14 Interestingly, the selection of MSED and cue size indicate that there are important interactions between (and within) tenses/semi-modals and epiphenomenal features. Finally, the frequency of the lexical verb form, lemma or type also appear influential, suggesting that our CGFIs require not only grammatical but also lexical knowledge.

Table 3 Features selected in Model K (36 features; RMSE = 0.77, r = 0.91), ranked according to their impact on item difficulty (+1 = highest positive impact; −1 = highest negative impact)

Further influence is exerted by clause type and especially certain subordinate clauses associated with special tense constraints (e.g. reported clauses, objects of wish). Gap position and dialogic/monologic context are deselected, although they are present in the larger Model J. Item-level features include syntactic embedding (number of dependent clauses), word length and word length variation, AoA, concreteness and mean document range. The latter five suggest once again that various aspects of vocabulary quality play a role in CGFI processing.

Lastly, we examine the predictive precision of Model K to assess its potential utility in the context of DDA. Here, we estimate the prediction interval in which individual future observations are expected to fall with a 90% probability via the formula \( {\widehat{\beta}}_i \)±1.64*RMSE. Thus, for example, the actual value of an item with a predicted difficulty of 1 logit may fall anywhere between −0.22 and 2.22 logits. For a DDA setting, this means that a student with an ability level of 1 logit would have an estimated probability of success on this item ranging between 22% and 78% (using the formula π=eθ-β/1 + eθ-β). The more accurate but larger Model J would lead to almost the same margin (π=50% ± 27%). With this, we turn to the discussion of the results.

Discussion and Conclusion

One aim of this pilot study was to estimate the difficulty of a set of cued gap-filling exercise items to be used in an ILTS for practicing the English tenses/semi-modals. The IRT calibration showed that the large majority of items measure the same set of abilities required for successful completion and can be ordered on a joint scale. In the future, these items can therefore be deployed in DDA directly. Somewhat surprisingly, the overall difficulty of the item pool tended to surpass a range of abilities found in the learner population (9th and 10th grade students in Germany), even though the items were sourced largely from learning materials intended for their level. This could be explained by the fact that individual exercises in the learning materials typically targeted a small set of tenses explicitly, whereas the participants in this study received no such indication (except for the possibility of using the semi-modals was/were going to and used to). Furthermore, the pool included items from 10th grade textbooks and intermediate practice grammars which might have been too challenging for the 9th graders. In any case, the mismatch highlights that item pool appropriateness relative to a well-defined target group of learners should be taken seriously in the development of pedagogically sound ILTSs. Thus, in our case, there is a clear need for items at the lower end of the difficulty/ability scale that must be addressed.

The second study aim was to attain a better understanding of the factors that affect CGFI difficulty in order to enable a difficulty-oriented item design which could eliminate the need for independent calibration via subjective ratings or costly pilot testing in the future. The prediction results show that CGFI difficulty is associated with a range of SLA, psycholinguistic and some specially formulated features concerning the linguistic properties of the required solution, the context surrounding the gap and, more globally, the syntactic and lexical characteristics of the CGFI as a whole. The results also show that of these, gap-level features are the most predictive, echoing previous findings on related text-completion formats (Beinborn 2016; Beinborn et al. 2014). All grammatical features at the gap level except for subject-verb agreement were shown to be useful predictors. Interestingly, MSED and cue size were also found to be influential, suggesting that item difficulty is affected by the interaction between grammatical categories, which may increase the processing load and lead to more errors. In the future, larger, more balanced datasets will make it possible to explore these interactions further. Finally, even though our CGFIs have been described as limited-production tasks strictly targeting grammatical ability (cf. Purpura 2004), it appears that this is not all they do. In particular, several lexical measures capturing various lexical verb or global vocabulary qualities are relevant for CGFI processing. This is hardly surprising given that inferring the intended grammatical meaning of the gapped verb requires the ability to understand the context. Thus, it is possible that CGFIs targeting the same grammatical features may vary in difficulty due to lexical (and possibly other) factors. This means that when referring to grammatical phenomena in score interpretation, such features need to be controlled for on the one hand. On the other hand, the significant effects of lexical features present the opportunity to include them in score interpretations as well. This would require the systematic inclusion of those features in item construction.

That said, the cross-validation results of the regression experiments are not entirely satisfactory from a DDA perspective. The size of the prediction intervals of our best models cannot guarantee exact ability-difficulty matching. However, the predictions could, upon additional validation, enable a form of DDA in which students are provided with items ranging from moderately easy to moderately difficult. Since the probability of success increases/decreases exponentially with the size of the difficulty/ability mismatch, a much greater number of overly easy and overly difficult items would effectively be filtered out. We consider these results encouraging and possibly even practical, given the dearth of research on what forms of DDA work best (e.g. exact ability/difficulty matching at some to-be-established success probability level or indeed alternating between moderately easy and difficult items). The results should also be seen in the light of state-of-the-art studies on related text-completion formats which report comparatively higher RMSE and lower Pearson correlation values (e.g. Beinborn 2016; Beinborn et al. 2014; Svetashova 2015).

The pilot study was constrained primarily by the amount of data available, notably with regard to the coverage of some binary CGFI features and feature combinations. Despite the use of a matrix design, logistical limitations and the dangers of test fatigue prohibited the inclusion of more test participants and items. In follow-up work, this problem can be addressed easily with an anchor item design, in which a small set of previously calibrated items is incorporated into subsequent tests covering underrepresented features and feature combinations. A simple linking transformation can then be used to express the new item difficulty scale in terms of the existing one (Wauters et al. 2012). Larger datasets will also allow for further statistical validation and model adjustment. Furthermore, it is possible to include additional features in the analysis. NLP tools such as L2SCA (Lu 2010) and TAASSC (Kyle 2016) offer a large number of additional morphological, syntactic and lexical measures that can be tested in the future. On the semantic-pragmatic side, the polyfunctionality/polysemy of the tense forms and semi-modals and the availability of contextual cues pointing to the correct solution, for instance, likely hold high predictive power. The inclusion of the former was constrained by insufficient variability regarding some tenses and will be addressed when more data is available. Incorporating the latter, in contrast, requires additional research into what actually counts as a cue from a learner perspective.

Several directions for future research can be identified. First, as noted above, more data is required to evaluate and improve our prediction models and to obtain a larger item pool that is more balanced and covers a wider range of abilities in the target population. Achieving this would, on the one hand, enable (semi-)automatic evaluation and manipulation of the difficulty of newly generated items for use in the ILTS. On the other hand, it would permit us to investigate the precise contribution of grammatical, contextual and other features of CGFIs on item difficulty and its implications for the content structure and adaptivity of the ILTS. Subsequently, different content structure and DDA scenarios can be constructed and tested in order to assess the precise benefits of DDA on learning outcomes and motivation. To our knowledge, no research exists in these areas with regard to learning environments targeting multidimensional content areas.

Second, as Moeyaert et al. (2016) caution, future research should also consider that simple feedback indicating the correct response is likely not enough to promote learning and that the effect of different types of corrective feedback should also be taken into account.

Third, this study focused on a single exercise format and involved administering a test with a fixed length and a uniform set of instructions. From a pedagogical and language-learning perspective, an ILTS should ideally offer a range of different exercises and exercise variants, involving, for instance, different scaffolding techniques, in order to provide learners with more varied practice opportunities. One avenue for future research, therefore, would be to compare multiple exercise types, as well as variations in exercise instructions, modes of presentation and item selections. Indeed, doing this would offer a window into a whole host of non-linguistic, task-related item characteristics that could have an effect on item difficulty. Here, it should also be mentioned that the items studied in this paper were piloted with a paper-and-pencil test for logistical reasons. In subsequent testings, a computer-based setting should be preferred in order to better approximate an ILTS setting.

And fourth, it must be noted that the present analysis employed a two-stage approach in which item difficulties and the effects of item characteristics were estimated separately. It is technically possible to include the item characteristics directly in the difficulty measurement model. However, Hartig et al. (2012) have shown that this approach delivers practically identical results. We chose the two-stage approach since the direct approach is limited with respect to cross-validation techniques. Nevertheless, including the item characteristics within the measurement model (e.g. a multifaceted Rasch model) is a promising perspective for future studies.

To conclude, this pilot study provides initial evidence regarding the possible features of CGFIs targeting the English tenses that affect exercise item difficulty. Despite the limited scope of the study, this approach has much potential in the context of (semi-)automatic difficulty scoring and manipulation for DDA in ILTSs and CAT in general. We believe that it will also be useful for educational content designers and empirical SLA researchers interested in understanding the factors that underlie learner performance.