Introduction

Learning analytics is a recent field in educational research that aims to use the collected data in learning systems in order to improve learning. Massive online open courses (MOOCs) receive hundreds of thousands of users that have acquired knowledge from different backgrounds. Their performance in the assessments provided by the platform is recorded, therefore it is natural to wonder how to use it to provide personalized assessments for new users. So far adaptive tests have been mainly designed for standardized tests, but recent developments have enabled their use in MOOCs (Rosen et al. 2017).

Whenever a new learner arrives on a MOOC, the platform knows nothing about the knowledge they have acquired in the past: this problem has been coined as learner cold-start (Thai-Nghe et al. 2011). Thus, it is important to be able to elicit their knowledge in few questions, in order to prevent boredom or frustration, and possibly recommend to them lessons they need to master before they can follow the course (Lynch and Howlin 2014). Such learner profiling is a crucial task, and raises the following question: how to sample efficiently questions from an item bank for a newcomer, in order to explore their knowledge at best, and provide to them valuable feedback?

In this paper, we present a new algorithm for selecting the questions to ask to a newcomer. Our strategy is based on a measure of diversity called determinantal point processes that has recently been applied to large-scale machine-learning problems, such as document summarization or clustering of thousands of image search results (Kulesza and Taskar 2012). We apply it to the automatic generation of low-stake practice worksheets by sampling diverse questions from an item bank. Our method is fast, and only relies on a measure of similarity between questions, therefore it can be applied in any environment that provides tests, such as large residential courses, or serious games.

This article is constructed as follows. First, we expose the background studies related to this work, in student modeling in assessment, cold-start in multistage testing and determinantal point processes. Then, we clarify our research context. We then explain our strategy, InitialD, for tackling the cold-start problem. Finally, we expose our experiments, results and discussion.

Background Studies

Student Modeling in Assessment

Student models allow to infer latent variables from the learners based on their answers throughout a test, and potentially predict future performance. Several models have been developed that we describe below under three categories: Item Response Theory, Knowledge Tracing and Cognitive Diagnosis.

Item Response Theory (IRT)

In assessments such as standardized tests, learners are usually modelled by one static latent variable – Rasch model (Hambleton and Swaminathan 1985) – or several latent variables – Multidimensional Item Response Theory aka MIRT (Reckase 2009; Chalmers 2016). Based on the binary outcomes (right or wrong) of the learners over the questions, an estimate of their latent variables is devised. Therefore, this category of models allow summative assessment, e.g., scoring and ranking examinees. MIRT models are said to be hard to train (Desmarais and Baker 2012). However, with the help of Markov chain Monte Carlo methods such as Metropolis-Hastings Robbins-Monro (Cai 2010a), efficient techniques have been proposed and implemented in ready-for-use tools (Chalmers 2016) to train MIRT models in a reasonable time. Variants have been designed in order to incorporate several attempts from the learners over a same question (Colvin et al. 2014) or evolution over time (Wilson et al. 2016).

Knowledge Tracing

In this family of models, learners are modelled by a latent state composed of several binary variables, one per knowledge component (KC) involved in the assessment. One particular advantage of the Bayesian Knowledge Tracing model (BKT) is that the latent state of the learner can change after every question they attempt to solve. Questions are tagged with only one KC. Based on the outcomes of the learner, the KCs they may master are inferred. Some variants include guess and slip parameters from the learners, i.e., probabilities of answering a certain question correctly while the corresponding skill is not mastered, or answering a question incorrectly while the skill is mastered. At the end of the test, a feedback can be provided to the learner, based on the KCs that seem to be mastered and those that do not. Therefore, this category of models allows formative assessment that benefit learning (Dunlosky et al. 2013) and detection of learners that need further instruction. More recently, Deep Knowledge Tracing (DKT) models have been developed (Piech et al. 2015), based on neural networks, outperforming the simplest BKT models at predicting student performance. Quite surprisingly, a study (Wilson et al. 2016) has even proven that a variant of unidimensional IRT performed better than DKT. Presumably, this is because IRT models measure a latent variable shared across all questions while in BKT, a question will only give information over the KC it involves. Also, IRT models are simpler, therefore less prone to overfitting than DKT models.

Cognitive Diagnosis

This family of models is similar to Knowledge Tracing. Learners are modelled by a static latent state composed of K binary variables, where K is the number of KC involved in the assessment. Cognitive diagnostic models allow mapping questions to several KCs involved in their resolution. For example, the DINA model (De La Torre 2009) requires that every KC involved in a question should be mastered by a learner so they can answer it correctly. DINA also considers slip and guess parameters. Being given the answers of learners, it is possible to infer their mastery or non-mastery of each KC, according to some cognitive diagnostic model (Desmarais and Baker 2012). At the end of the test, the estimated latent state of the learner can be provided as feedback, in order to name their strong and weak points. Therefore, this category of models allow formative assessment. In order to reduce the complexity of 2K possible latent states, some variants of these models allow specifying a dependency graph over KCs (Leighton et al. 2004). Lynch and Howlin (2014) have developed such an adaptive test for newcomers in a MOOC that relies on a dependency graph of knowledge components to learn, but they do not consider slip and guess parameters, i.e., their method is not robust to careless errors from the examinees.

Adaptive and Multistage Testing

Multistage testing (MST) (Zheng and Chang 2014) is a framework that allows asking questions in an adaptive, tree-based way (see Fig. 1): according to their performance, examinees are routed towards new questions. Therefore, learners follow a path in a tree, of which every node contains a pool of questions, and different learners may receive different questions, tailored to their level. Computerized adaptive testing (CAT) is a special case of multistage testing in which every node contains a single question.

While it is possible to assemble MST tests manually, automated test assembly algorithms save human labor and allow specifying statistical constraints (Zheng and Chang 2014; Yan et al. 2014). Three approaches are mainly used, either exact methods that solve optimization problems but are usually long to compute, greedy heuristics that do not get optimal subsets, or Monte Carlo methods that rely on random samples.

Such CAT systems have been successfully used in standardized tests such as the GMAT and the GRE, administered to hundreds of thousands of students. Such an idea of selecting the next question based on the previous outcomes has recently been embedded in MOOCs (Rosen et al. 2017), where it has proven successful in eliciting the knowledge of the students with fewer questions. In the literature, some approaches assume that the level of the learners can change between questions, such as Rosen et al. (2017), some others do not.

Fig. 1
figure 1

Example of multistage test. Alice and Bob do not receive the same set of questions during the test

CAT systems rely on a student model that can predict student performance, such as the ones presented above. Based on this model, a CAT system can infer the student parameters, and ask questions accordingly, e.g., ask harder questions if the estimated latent ability of the student is high. Lan et al. (2014) have developed a framework called Sparse Factor Analysis (SPARFA), similar to MIRT, and they have shown that adaptive strategies provided better results for SPARFA than non-adaptive ones for the same number of questions asked. Cheng (2009) have applied adaptive strategies to the DINA model. Such cognitive-diagnostic computerized adaptive tests (CD-CAT) have been successfully applied in Chinese classrooms (Chang 2014; Zheng and Chang 2014). Vie et al. (2016b) have applied adaptive strategies for the General Diagnostic Model (Davier 2005), which is another cognitive diagnostic model, and proven that it outperformed adaptive tests based on the DINA model at predicting student performance. For a review of recent models in adaptive assessment, see Vie et al. (2016a).

Cold-Start

As we want to predict student performance, we try to infer the outcome of the learner over new questions, based on their current parameter estimates, according to their previous outcomes. But at the very beginning of the test, we have no data regarding the learner’s knowledge. This is the learner cold-start problem, a term usually coined for recommender systems (Anava et al. 2015). In an educational setting, at the best of our knowledge, the best reference to cold-start is Thai-Nghe et al. (2011), where they say that the cold-start problem is not as harmful as in an e-commerce platform where new products and users appear every day. But with the advent of MOOCs, the cold-start problem regains some interest.

CAT systems usually compute an estimate of the student parameters at some point during the test. But when few questions are asked, such an estimate may not exist or may be biased (Chalmers 2012; Lan et al. 2014), for example if all outcomes provided by the learner are correct (or all incorrect) and if we want to compute a maximum-likelihood estimate. This is why the choice of the very first questions is important. Chalmers (2016) suggests to ask a group of questions before starting the adaptive process, and this is the work we focus on in this paper.

Determinantal Point Processes

Determinantal Point Processes (DPP) rely on the following idea: if the objects we try to sample from can be represented in a way that we can compute a similarity value over any pair of elements (kernel), then it is possible to devise an algorithm that will sample efficiently a subset of elements that are diverse, i.e. far from each other. DPPs have been successfully applied to a variety of large-scale scenarios: sampling key headlines from a set of newspaper articles, summarizing image search results (Kulesza and Taskar 2012), but to the best of our knowledge, not in cold-start scenarios.

In this paper, we apply this technique to the sampling of questions for the initial stage of the test. Indeed, questions are usually modelled by parameter vectors, be it the KCs that are required, or some difficulty parameters, or some tag information. A question will measure the knowledge of the learner in the direction of its parameter vectors, and close questions in parameter space will measure similar constructs. If we want to reduce the number of questions asked, we should avoid selecting questions that measure similar knowledge, which is why it is natural to rely on a measure of diversity.

Research Context and Notations

Data

We assume we have access to a list of questions from a test, called item bank. This item bank has a history: learners have answered some of those questions in the past.

We only consider dichotomous answers of the learners, i.e., raw data composed of lines of the following form:

  • a student_idi ∈{1,…,m};

  • a question_idj ∈{1,…,n};

  • whether student i got the question jright (1) or wrong (0).

The response pattern of a student i is a sequence (ri1,…,rim) ∈{0,1,NA} where rij denotes the correctness of student i’s answer over question j. NA means that the learner i did not try to answer question j. Such response patterns enable us to calibrate a student model.

As the assessment we are building should be formative, we assume we have access to a mapping of each question with the different knowledge components (KC) that are involved in its resolution: it is a binary matrix called q-matrix of which the (j,k) entry qjk is 1 if the knowledge component k is involved in question j, 0 otherwise. Thanks to this structure, we can name the KCs.

Student Model: General Diagnostic Model

Student models rely on different components:

  • features for every question and learner (the learner features will be called ability features), under the form of vectors with real values;

  • probability model that a certain learner answers a certain question correctly;

  • training step in order to extract question and learner features from an history of answers;

  • update rule of learner parameters based on learner answers.

The probability model relies solely on the features, i.e., for a fixed student, questions with close features will have a similar probability of being solved correctly. Also, students with similar ability features will produce similar response patterns.

In order to extract question and learner features for our purposes, we choose to calibrate a General Diagnostic Model, because it predicts student performance better than other cognitive diagnosis models (Vie et al. 2016b):

  • K is the number of knowledge components of the q-matrix.

  • Learners i are modelled by ability vectors 𝜃i = (𝜃i1,…,𝜃iK) of size K. When estimated, any component 𝜃ik can be interpreted as the strength of the student over KC k.

  • Questions j are modelled by discrimination vectors dj = (dj1,…,djK) of size K and an easiness parameter δj. When estimated, any component djk can be interpreted as how much question j discriminates over KC k. Such vectors can be specified by hand, or directly learned from data (Vie et al. 2016b).

  • According to the q-matrix (qjk)j,k, some parameters are fixed at 0: ∀j,k,(qjk = 0 ⇒ djk = 0).

Thus, the probability model that a learner answers correctly a question is given by the formula:

$$\begin{array}{@{}rcl@{}} Pr(\text{``Student}\ i \ \text{answers question}\ j\ \text{correctly''}) &=& {\Phi}\left( \sum\limits_{k = 1}^K \theta_{ik} q_{jk} d_{jk} + \delta_{j}\right)\\ &=& {\Phi}(\mathbf{\theta_{i}} \cdot \mathbf{d}_{\mathbf{j}} + \delta_{j}) \end{array} $$
(1)

where Φ : x↦1/(1 + ex) is the logistic function.

We assume that the level of the learner does not change after they give an answer. This is reasonable because the learner only knows their results at the end of the sequence of questions, not after every answer they make. So for the very first selected pool of questions we can make this assumption, that allows us to prefer IRT models or Cognitive Diagnosis models over Knowledge Tracing models.

In a cold-start scenario, we do not have any data about a newcomer. But we do have discrimination vectors for the questions, calibrated using the response patterns of the student data. These can allow us to sample k questions from the item bank of n questions and ask them to the newcomer. According to their answers, we can infer the learner ability features, and compute their performance over the remaining questions using the response model.

Diversity Model

Let us call V the matrix of question feature vectors, of which the j-th line is dj = (dj1,…,djK). As a recall, such features allow us to know whether a question measures more one knowledge component than another. A measure of diversity for a subset A ⊂{1,…,n} of questions is the volume of the parallelotope formed by their feature vectors (Kulesza and Taskar 2012):

$$ Vol({\{\mathbf{d}_{\mathbf{j}}\}}_{j \in A}) = \sqrt{\det V_A V_A^T} $$
(2)

where VA denotes the submatrix of V that only keeps the lines indexed by A. Indeed, if two vectors are correlated, the volume of the corresponding parallelotope will be 0. If k vectors form an orthonormal basis, the volume of the corresponding parallelotope will be 1.

Testing every possible subset would be not feasible, as there are \(\left (\begin {array}{ll} n \\ k \end {array}\right )\) of them, and computing their volume has a complexity O(k2K + k3). Furthermore, determining the best subset is NP-hard, therefore intractable in large-scale scenarios (Kulesza and Taskar 2012).

In order to implement our strategy, we also need a kernel that helps compute a similarity value for each pair of questions. It can be represented by a positive definite matrix L such that Lij = (di,dj). For our experiment, we choose the linear kernel (di,dj) = didj (Kulesza and Taskar 2012).

Our Sampling Strategy: Initiald

The main idea of our strategy is the following: if we want to ask as few questions as possible, we want to minimize redundancy in the learner’s answers, and sample questions that measure different knowledge components. In other words, we want to sample vectors that are the least correlated possible, therefore the most diverse possible. For example, in Fig. 2, those three questions are embedded in a 2-dimensional space. Questions 1 and 2 measure the second knowledge component more than the first one, while conversely, question 3 measures KC 2 more than KC 1. If only two questions should be sampled, questions 1 and 3 will be more informative (less correlated) than questions 1 and 2.

Fig. 2
figure 2

Example of question feature vectors

Determinantal Point Processes

A probability distribution over subsets of {1,…,n} is called a determinantal point process if any subset Y drawn from this distribution verifies, for all subset A ⊂{1,…,n}:

$$ Pr(A \subset Y) \propto \det L_A $$
(3)

where LA is the squared submatrix of L indexed by elements of A in row and column.

In order words, subsets of diverse questions are more likely to be drawn than subsets of correlated questions. Indeed, if VA is the submatrix of V that only keeps lines indexed by A, LA = VAVAT because the kernel is linear, so Pr(AY ) is proportional to detLA = Vol({dj}jA)2.

Determinantal point processes have a useful property: k elements among n can be drawn with complexity O(nk3), after a unique preprocessing step in O(n3) (Kulesza and Taskar 2012). This costly step relies in getting the eigenvalues of the kernel matrix, which can be reduced to O(nK2) in the case of the linear kernel, as the question features span over K << n dimensions.

After the preprocessing step, sampling 10 questions over an item bank of size 10000 takes about 10 million operations, which makes it suitable for large-scale data.

Sampling Strategy

InitialD (for Initial Determinant) samples k questions from the item bank according to a determinantal point process, using the question feature vectors. It selects the first bulk of questions, in a test, before any adaptation can be made.

After those questions are asked to the newcomer, a first estimate of the learner ability features can be computed. As the knowledge components are known, such an ability feature can provide a useful feedback for the learner, specifying which KCs need further work.

Sampling several times will give different subsets. This allows balancing over the item bank, and not asking the same bulk of questions to every newcomer, which is interesting for security purposes, and allows calibrating various items from the bank (Zheng and Chang 2014).

Validation

We compared the following strategies for asking the first k questions to a newcomer.

Random

Sample k questions from the item bank at random.

Uncertainty

Sample the k questions for which the probability that the learner will answer them correctly is closest to 0.5 (which means, questions of average difficulty). This method is known as uncertainty sampling in the context of active learning (Settles 2010).

InitialD

Sample k questions according to a determinantal point process, as described above.

For our experiments, we could use the following datasets.

TIMSS

Trends in International Mathematics and Science Study (TIMSS) organizes a standardized test in mathematics. The collected data is freely available on their website for researchers, in SPSS and SAS formats. This dataset comes from the 2003 edition of TIMSS. It is a binary matrix of size 757 × 23 that stores the results of 757 learners from 8th graders over 23 questions in mathematics. The q-matrix was specified by experts from TIMSS and has 13 knowledge components that are described in Su et al. (2013). It was available in the R package CDM (Robitzsch et al. 2014).

Fraction

This dataset contains the results of 536 middle school students over 20 questions about fraction subtraction. Items and the corresponding q-matrix over 8 knowledge components are described in DeCarlo (2010).

Cross Validation

Such a benchmark is performed using cross-validation. 80% of the learners from the history are supposed known, while 20% are kept for simulation of a cold-start situation. The cross validation is stratified, i.e., students have close distribution between the train set and the test set.

Model Calibration for Feature Extraction

Using the history of answers, we want to extract the question and learner features that explain learner data best. For this, we minimize the log-loss between the true response patterns and those computed by the General Diagnostic Model:

$$ \text{logloss}(y, y^{*}) = \frac1n \sum\limits_{k = 1}^n \log (1 - |y_{i} - y^{*}_{i}|). $$
(4)

This optimization problem has been difficult to solve for MIRT models and the General Diagnostic Model but since (Cai 2010a, 2010b), a Metropolis-Hastings Robbins-Monro algorithm has been devised, allowing faster calibration. It has been implemented in the R package mirt (Chalmers 2012), that we used.

Experimental Protocol

The experimental protocol was implemented according to the following algorithm.

figure a

TrainingStep: Feature Extraction

According to the available response patterns of the train set D[Itrain], question vector features are extracted, as described above.

PriorInitialization: Initialization of a Newcomer

At the beginning of the test, the newcomer’s ability features are set to 0, which means the probability that they answer question j correctly is Φ(δj), i.e., it depends solely on the facility parameter (or bias).

FirstBundle: Sampling the First Questions

This is where strategies Random, Uncertainty or InitialD are applied. They return a subset of k questions Y ⊂{1,…,n}.

EstimateParameters: Estimate Learner Ability Features Based on the Collected Answers

Given their answers (ri)iY, the ability features 𝜃 of the learner are inferred using logistic regression. If the answers are all-right or all-wrong, other algorithms are used, defined in Chalmers (2012), Lan et al. (2014), Magis (2015).

As the q-matrix has been specified by an expert, the k-th component of the ability feature 𝜃 can be interpreted as a degree of mastery of the learner over the k-th knowledge component.

PredictPerformance: Based on the Learner Ability Estimate, Compute Performance Over the Remaining Questions

Knowing the learner ability features and the question feature vectors, we can compute the probability that the learner will answer correctly every question by applying the formula at (1).

EvaluatePerformance: Performance Metrics

We compute two performance metrics. The log-loss, as stated in (4), and the distance d(𝜃,𝜃) = ||𝜃𝜃|| to the final diagnosis 𝜃, that is, the learner ability estimate when all questions have been asked (Lan et al. 2014).

Adding a CAT to the Benchmark

We also added a normal computerized adaptive test to the benchmark, called GenMA, developed in Vie et al. (2016b) and proven better at predicting performance than the Rasch and DINA models on a variety of datasets. We choose at each step the question that maximizes the determinant of Fisher information (the so-called D-rule, see Chalmers 2016).

Results

Results are given in Figs. 3 and 4, where the x-axis represents the number of questions asked for tackling the cold-start problem. On the y-axis, either the log-loss, or the distance to the final diagnosis of each of the 3 strategies described above for cold-start, and the CAT model as a baseline. Therefore, lower is better for all figures. In Tables 123 and 4, the best values are denoted in bold, and the percentage of correct predictions is denoted between parentheses.

Fig. 3
figure 3

Log-loss of the predictions (left) and distance to the final diagnosis (right) after a group of questions has been asked for different strategies for the dataset Fraction

Fig. 4
figure 4

Log-loss of the predictions (left) and distance to the final diagnosis (right) after a group of questions has been asked for different strategies for the dataset TIMSS

Table 1 Log-loss values obtained (and precision rates) for the dataset Fraction
Table 2 Distance to the final diagnosis obtained for the dataset Fraction
Table 3 Log-loss values obtained (and precision rates) for the dataset TIMSS
Table 4 Distance to the final diagnosis obtained for the dataset TIMSS

Fraction

In Fig. 3, InitialD performs better than the other strategies, and with a narrow confidence interval. 8 questions seem enough to reconstruct correctly 82% of answers, and converge to a minimal log-loss. For k ≤ 9, InitialD converges faster towards the final diagnosis, while for k ≥ 14, CAT converges faster, showing a benefit in adaptation in later periods of the process.

TIMSS

In Fig. 4, InitialD performs better than Random, CAT and Uncertainty. As early as the first question, InitialD clearly has a lower log-loss in response pattern reconstruction. This happens because the question of biggest “volume” has the vector of highest norm, which means, the most discriminant question, while other models will pick a question of average difficulty.

Asking 7 questions over 23 using the strategy InitialD leads in average to the same estimation error than asking 12 questions at random, or asking 19 questions using a traditional adaptive test. InitialD converges towards the final diagnosis faster than the other strategies. Using our method, 12 questions seem enough to get a diagnostic that predicts correctly 70% of the outcomes.

Discussion

On every dataset we tried, InitialD performed better, and with a narrower confidence interval, than the other strategies. On the TIMSS dataset, the Random strategy performs well compared to a normal adaptive test. It may be because the test is already well balanced, so random questions have high probability to be diverse.

If the number of questions to ask (k), the number of questions in the bank (n) and the number of measured knowledge components (K) are low, it is possible to simulate every subset of k questions over n. However, in practice, question banks on platforms will be large, so InitialD’s complexity, O(nk3) after a preprocessing step in O(n3), will be an advantage. In this paper, we tested our method on datasets of up to 23 questions, but the exact determinantal point process sampling algorithm has already been tried on databases of thousands of items (Kulesza and Taskar 2012). Please note that this work naturally extends to the question cold-start problem: having a new question on the platform, how to identify a group of students to ask it to in order to estimate its discrimination parameters over all knowledge components.

InitialD could be improved by sampling several subsets of questions, and keeping the best of them. Sampling subsets of k questions has complexity O(nk3), finding the one achieving the biggest volume has complexity O(k3). Drawing several subsets increases the chance to determine the best subset to ask first.

Conclusion and Further Work

We showed, using real data, that our strategy InitialD, based on determinantal point processes, performed better than other strategies for cold-start at predicting student performance. As it is fast, this method can be applied to the generation of several diverse worksheets from the item bank of an educational platform: a learner can request a worksheet of k questions, attempt to solve them, receive their ability features as feedback (strong and weak points), then ask for another worksheet. Items already administered to some student can be removed from the item bank, in their view, so that the same learner does not get the same exercise in two consecutive worksheets.

As further work, we would like to check if sampling according to a determinantal point processes is still useful in later stages of a multistage test, after a first learner ability estimate has been computed.

InitialD solely relies on pairwise similarities between questions: it can be used in conjunction with other response models, using other feature extraction techniques that allow better vector representations of the questions. For example, one could use various information at hand such as a bag-of-words representation of the problem statement, or extra tags specified by a teacher, in order to improve the embedding of items. Such extra information will improve the selection of questions, with the same algorithm InitialD. In this paper, we used a linear kernel for the predictions and for the student model, but nonlinear kernels could be used, performing better but at the cost of interpretation.

For interpretation of KCs, a q-matrix is useful. Koedinger et al. (2012) have shown that it is possible to combine q-matrices by crowdsourcing in order to improve student models. We would like to see if it also applies to the General Diagnostic Model, and if observing the discrimination parameters can help us determine better q-matrices.

Our strategy allows a fast and narrow ability estimate of the learner knowledge, that can be revealed to them in order to help them progress. If the lessons in the course are also mapped to KCs, recommendations of lessons could be made based on this initial evaluation, for example. We aim to apply InitialD to automatic generation of worksheets at the beginning of a MOOC, in order to provide low-stakes formative assessments, and evaluate them on real students.