Automated Test Assembly for Handling Learner ColdStart in LargeScale Assessments
 830 Downloads
 1 Citations
Abstract
In largescale assessments such as the ones encountered in MOOCs, a lot of usage data is available because of the number of learners involved. Newcomers, that just arrive on a MOOC, have various backgrounds in terms of knowledge, but the platform hardly knows anything about them. Therefore, it is crucial to elicit their knowledge fast, in order to personalize their learning experience. Such a problem has been called learner coldstart. We present in this article an algorithm for sampling a group of initial, diverse questions for a newcomer, based on a method recently used in machine learning: determinantal point processes. We show, using real data, that our method outperforms existing techniques such as uncertainty sampling, and can provide useful feedback to the learner over their strong and weak points.
Keywords
Coldstart Testsize reduction Learning analytics Determinantal point processes Multistage testing Cognitive diagnosisIntroduction
Learning analytics is a recent field in educational research that aims to use the collected data in learning systems in order to improve learning. Massive online open courses (MOOCs) receive hundreds of thousands of users that have acquired knowledge from different backgrounds. Their performance in the assessments provided by the platform is recorded, therefore it is natural to wonder how to use it to provide personalized assessments for new users. So far adaptive tests have been mainly designed for standardized tests, but recent developments have enabled their use in MOOCs (Rosen et al. 2017).
Whenever a new learner arrives on a MOOC, the platform knows nothing about the knowledge they have acquired in the past: this problem has been coined as learner coldstart (ThaiNghe et al. 2011). Thus, it is important to be able to elicit their knowledge in few questions, in order to prevent boredom or frustration, and possibly recommend to them lessons they need to master before they can follow the course (Lynch and Howlin 2014). Such learner profiling is a crucial task, and raises the following question: how to sample efficiently questions from an item bank for a newcomer, in order to explore their knowledge at best, and provide to them valuable feedback?
In this paper, we present a new algorithm for selecting the questions to ask to a newcomer. Our strategy is based on a measure of diversity called determinantal point processes that has recently been applied to largescale machinelearning problems, such as document summarization or clustering of thousands of image search results (Kulesza and Taskar 2012). We apply it to the automatic generation of lowstake practice worksheets by sampling diverse questions from an item bank. Our method is fast, and only relies on a measure of similarity between questions, therefore it can be applied in any environment that provides tests, such as large residential courses, or serious games.
This article is constructed as follows. First, we expose the background studies related to this work, in student modeling in assessment, coldstart in multistage testing and determinantal point processes. Then, we clarify our research context. We then explain our strategy, InitialD, for tackling the coldstart problem. Finally, we expose our experiments, results and discussion.
Background Studies
Student Modeling in Assessment
Student models allow to infer latent variables from the learners based on their answers throughout a test, and potentially predict future performance. Several models have been developed that we describe below under three categories: Item Response Theory, Knowledge Tracing and Cognitive Diagnosis.
Item Response Theory (IRT)
In assessments such as standardized tests, learners are usually modelled by one static latent variable – Rasch model (Hambleton and Swaminathan 1985) – or several latent variables – Multidimensional Item Response Theory aka MIRT (Reckase 2009; Chalmers 2016). Based on the binary outcomes (right or wrong) of the learners over the questions, an estimate of their latent variables is devised. Therefore, this category of models allow summative assessment, e.g., scoring and ranking examinees. MIRT models are said to be hard to train (Desmarais and Baker 2012). However, with the help of Markov chain Monte Carlo methods such as MetropolisHastings RobbinsMonro (Cai 2010a), efficient techniques have been proposed and implemented in readyforuse tools (Chalmers 2016) to train MIRT models in a reasonable time. Variants have been designed in order to incorporate several attempts from the learners over a same question (Colvin et al. 2014) or evolution over time (Wilson et al. 2016).
Knowledge Tracing
In this family of models, learners are modelled by a latent state composed of several binary variables, one per knowledge component (KC) involved in the assessment. One particular advantage of the Bayesian Knowledge Tracing model (BKT) is that the latent state of the learner can change after every question they attempt to solve. Questions are tagged with only one KC. Based on the outcomes of the learner, the KCs they may master are inferred. Some variants include guess and slip parameters from the learners, i.e., probabilities of answering a certain question correctly while the corresponding skill is not mastered, or answering a question incorrectly while the skill is mastered. At the end of the test, a feedback can be provided to the learner, based on the KCs that seem to be mastered and those that do not. Therefore, this category of models allows formative assessment that benefit learning (Dunlosky et al. 2013) and detection of learners that need further instruction. More recently, Deep Knowledge Tracing (DKT) models have been developed (Piech et al. 2015), based on neural networks, outperforming the simplest BKT models at predicting student performance. Quite surprisingly, a study (Wilson et al. 2016) has even proven that a variant of unidimensional IRT performed better than DKT. Presumably, this is because IRT models measure a latent variable shared across all questions while in BKT, a question will only give information over the KC it involves. Also, IRT models are simpler, therefore less prone to overfitting than DKT models.
Cognitive Diagnosis
This family of models is similar to Knowledge Tracing. Learners are modelled by a static latent state composed of K binary variables, where K is the number of KC involved in the assessment. Cognitive diagnostic models allow mapping questions to several KCs involved in their resolution. For example, the DINA model (De La Torre 2009) requires that every KC involved in a question should be mastered by a learner so they can answer it correctly. DINA also considers slip and guess parameters. Being given the answers of learners, it is possible to infer their mastery or nonmastery of each KC, according to some cognitive diagnostic model (Desmarais and Baker 2012). At the end of the test, the estimated latent state of the learner can be provided as feedback, in order to name their strong and weak points. Therefore, this category of models allow formative assessment. In order to reduce the complexity of 2^{K} possible latent states, some variants of these models allow specifying a dependency graph over KCs (Leighton et al. 2004). Lynch and Howlin (2014) have developed such an adaptive test for newcomers in a MOOC that relies on a dependency graph of knowledge components to learn, but they do not consider slip and guess parameters, i.e., their method is not robust to careless errors from the examinees.
Adaptive and Multistage Testing
Multistage testing (MST) (Zheng and Chang 2014) is a framework that allows asking questions in an adaptive, treebased way (see Fig. 1): according to their performance, examinees are routed towards new questions. Therefore, learners follow a path in a tree, of which every node contains a pool of questions, and different learners may receive different questions, tailored to their level. Computerized adaptive testing (CAT) is a special case of multistage testing in which every node contains a single question.
While it is possible to assemble MST tests manually, automated test assembly algorithms save human labor and allow specifying statistical constraints (Zheng and Chang 2014; Yan et al. 2014). Three approaches are mainly used, either exact methods that solve optimization problems but are usually long to compute, greedy heuristics that do not get optimal subsets, or Monte Carlo methods that rely on random samples.
CAT systems rely on a student model that can predict student performance, such as the ones presented above. Based on this model, a CAT system can infer the student parameters, and ask questions accordingly, e.g., ask harder questions if the estimated latent ability of the student is high. Lan et al. (2014) have developed a framework called Sparse Factor Analysis (SPARFA), similar to MIRT, and they have shown that adaptive strategies provided better results for SPARFA than nonadaptive ones for the same number of questions asked. Cheng (2009) have applied adaptive strategies to the DINA model. Such cognitivediagnostic computerized adaptive tests (CDCAT) have been successfully applied in Chinese classrooms (Chang 2014; Zheng and Chang 2014). Vie et al. (2016b) have applied adaptive strategies for the General Diagnostic Model (Davier 2005), which is another cognitive diagnostic model, and proven that it outperformed adaptive tests based on the DINA model at predicting student performance. For a review of recent models in adaptive assessment, see Vie et al. (2016a).
ColdStart
As we want to predict student performance, we try to infer the outcome of the learner over new questions, based on their current parameter estimates, according to their previous outcomes. But at the very beginning of the test, we have no data regarding the learner’s knowledge. This is the learner coldstart problem, a term usually coined for recommender systems (Anava et al. 2015). In an educational setting, at the best of our knowledge, the best reference to coldstart is ThaiNghe et al. (2011), where they say that the coldstart problem is not as harmful as in an ecommerce platform where new products and users appear every day. But with the advent of MOOCs, the coldstart problem regains some interest.
CAT systems usually compute an estimate of the student parameters at some point during the test. But when few questions are asked, such an estimate may not exist or may be biased (Chalmers 2012; Lan et al. 2014), for example if all outcomes provided by the learner are correct (or all incorrect) and if we want to compute a maximumlikelihood estimate. This is why the choice of the very first questions is important. Chalmers (2016) suggests to ask a group of questions before starting the adaptive process, and this is the work we focus on in this paper.
Determinantal Point Processes
Determinantal Point Processes (DPP) rely on the following idea: if the objects we try to sample from can be represented in a way that we can compute a similarity value over any pair of elements (kernel), then it is possible to devise an algorithm that will sample efficiently a subset of elements that are diverse, i.e. far from each other. DPPs have been successfully applied to a variety of largescale scenarios: sampling key headlines from a set of newspaper articles, summarizing image search results (Kulesza and Taskar 2012), but to the best of our knowledge, not in coldstart scenarios.
In this paper, we apply this technique to the sampling of questions for the initial stage of the test. Indeed, questions are usually modelled by parameter vectors, be it the KCs that are required, or some difficulty parameters, or some tag information. A question will measure the knowledge of the learner in the direction of its parameter vectors, and close questions in parameter space will measure similar constructs. If we want to reduce the number of questions asked, we should avoid selecting questions that measure similar knowledge, which is why it is natural to rely on a measure of diversity.
Research Context and Notations
Data
We assume we have access to a list of questions from a test, called item bank. This item bank has a history: learners have answered some of those questions in the past.

a student_idi ∈{1,…,m};

a question_idj ∈{1,…,n};

whether student i got the question jright (1) or wrong (0).
The response pattern of a student i is a sequence (r_{i1},…,r_{im}) ∈{0,1,NA} where r_{ij} denotes the correctness of student i’s answer over question j. NA means that the learner i did not try to answer question j. Such response patterns enable us to calibrate a student model.
As the assessment we are building should be formative, we assume we have access to a mapping of each question with the different knowledge components (KC) that are involved in its resolution: it is a binary matrix called qmatrix of which the (j,k) entry q_{jk} is 1 if the knowledge component k is involved in question j, 0 otherwise. Thanks to this structure, we can name the KCs.
Student Model: General Diagnostic Model

features for every question and learner (the learner features will be called ability features), under the form of vectors with real values;

probability model that a certain learner answers a certain question correctly;

training step in order to extract question and learner features from an history of answers;

update rule of learner parameters based on learner answers.
The probability model relies solely on the features, i.e., for a fixed student, questions with close features will have a similar probability of being solved correctly. Also, students with similar ability features will produce similar response patterns.

K is the number of knowledge components of the qmatrix.

Learners i are modelled by ability vectors 𝜃_{i} = (𝜃_{i1},…,𝜃_{iK}) of size K. When estimated, any component 𝜃_{ik} can be interpreted as the strength of the student over KC k.

Questions j are modelled by discrimination vectors d_{j} = (d_{j1},…,d_{jK}) of size K and an easiness parameter δ_{j}. When estimated, any component d_{jk} can be interpreted as how much question j discriminates over KC k. Such vectors can be specified by hand, or directly learned from data (Vie et al. 2016b).

According to the qmatrix (q_{jk})_{j,k}, some parameters are fixed at 0: ∀j,k,(q_{jk} = 0 ⇒ d_{jk} = 0).
We assume that the level of the learner does not change after they give an answer. This is reasonable because the learner only knows their results at the end of the sequence of questions, not after every answer they make. So for the very first selected pool of questions we can make this assumption, that allows us to prefer IRT models or Cognitive Diagnosis models over Knowledge Tracing models.
In a coldstart scenario, we do not have any data about a newcomer. But we do have discrimination vectors for the questions, calibrated using the response patterns of the student data. These can allow us to sample k questions from the item bank of n questions and ask them to the newcomer. According to their answers, we can infer the learner ability features, and compute their performance over the remaining questions using the response model.
Diversity Model
Testing every possible subset would be not feasible, as there are \(\left (\begin {array}{ll} n \\ k \end {array}\right )\) of them, and computing their volume has a complexity O(k^{2}K + k^{3}). Furthermore, determining the best subset is NPhard, therefore intractable in largescale scenarios (Kulesza and Taskar 2012).
In order to implement our strategy, we also need a kernel that helps compute a similarity value for each pair of questions. It can be represented by a positive definite matrix L such that L_{ij} = (d_{i},d_{j}). For our experiment, we choose the linear kernel (d_{i},d_{j}) = d_{i} ⋅d_{j} (Kulesza and Taskar 2012).
Our Sampling Strategy: Initiald
Determinantal Point Processes
In order words, subsets of diverse questions are more likely to be drawn than subsets of correlated questions. Indeed, if V_{A} is the submatrix of V that only keeps lines indexed by A, L_{A} = V_{A}VAT because the kernel is linear, so Pr(A ⊂ Y ) is proportional to detL_{A} = Vol({d_{j}}_{j∈A})^{2}.
Determinantal point processes have a useful property: k elements among n can be drawn with complexity O(nk^{3}), after a unique preprocessing step in O(n^{3}) (Kulesza and Taskar 2012). This costly step relies in getting the eigenvalues of the kernel matrix, which can be reduced to O(nK^{2}) in the case of the linear kernel, as the question features span over K << n dimensions.
After the preprocessing step, sampling 10 questions over an item bank of size 10000 takes about 10 million operations, which makes it suitable for largescale data.
Sampling Strategy
InitialD (for Initial Determinant) samples k questions from the item bank according to a determinantal point process, using the question feature vectors. It selects the first bulk of questions, in a test, before any adaptation can be made.
After those questions are asked to the newcomer, a first estimate of the learner ability features can be computed. As the knowledge components are known, such an ability feature can provide a useful feedback for the learner, specifying which KCs need further work.
Sampling several times will give different subsets. This allows balancing over the item bank, and not asking the same bulk of questions to every newcomer, which is interesting for security purposes, and allows calibrating various items from the bank (Zheng and Chang 2014).
Validation
We compared the following strategies for asking the first k questions to a newcomer.
Random
Sample k questions from the item bank at random.
Uncertainty
Sample the k questions for which the probability that the learner will answer them correctly is closest to 0.5 (which means, questions of average difficulty). This method is known as uncertainty sampling in the context of active learning (Settles 2010).
InitialD
Sample k questions according to a determinantal point process, as described above.
For our experiments, we could use the following datasets.
TIMSS
Trends in International Mathematics and Science Study (TIMSS) organizes a standardized test in mathematics. The collected data is freely available on their website for researchers, in SPSS and SAS formats. This dataset comes from the 2003 edition of TIMSS. It is a binary matrix of size 757 × 23 that stores the results of 757 learners from 8th graders over 23 questions in mathematics. The qmatrix was specified by experts from TIMSS and has 13 knowledge components that are described in Su et al. (2013). It was available in the R package CDM (Robitzsch et al. 2014).
Fraction
This dataset contains the results of 536 middle school students over 20 questions about fraction subtraction. Items and the corresponding qmatrix over 8 knowledge components are described in DeCarlo (2010).
Cross Validation
Such a benchmark is performed using crossvalidation. 80% of the learners from the history are supposed known, while 20% are kept for simulation of a coldstart situation. The cross validation is stratified, i.e., students have close distribution between the train set and the test set.
Model Calibration for Feature Extraction
This optimization problem has been difficult to solve for MIRT models and the General Diagnostic Model but since (Cai 2010a, 2010b), a MetropolisHastings RobbinsMonro algorithm has been devised, allowing faster calibration. It has been implemented in the R package mirt (Chalmers 2012), that we used.
Experimental Protocol
TrainingStep: Feature Extraction
According to the available response patterns of the train set D[I_{train}], question vector features are extracted, as described above.
PriorInitialization: Initialization of a Newcomer
At the beginning of the test, the newcomer’s ability features are set to 0, which means the probability that they answer question j correctly is Φ(δ_{j}), i.e., it depends solely on the facility parameter (or bias).
FirstBundle: Sampling the First Questions
This is where strategies Random, Uncertainty or InitialD are applied. They return a subset of k questions Y ⊂{1,…,n}.
EstimateParameters: Estimate Learner Ability Features Based on the Collected Answers
Given their answers (r_{i})_{i∈Y}, the ability features 𝜃 of the learner are inferred using logistic regression. If the answers are allright or allwrong, other algorithms are used, defined in Chalmers (2012), Lan et al. (2014), Magis (2015).
As the qmatrix has been specified by an expert, the kth component of the ability feature 𝜃 can be interpreted as a degree of mastery of the learner over the kth knowledge component.
PredictPerformance: Based on the Learner Ability Estimate, Compute Performance Over the Remaining Questions
Knowing the learner ability features and the question feature vectors, we can compute the probability that the learner will answer correctly every question by applying the formula at (1).
EvaluatePerformance: Performance Metrics
We compute two performance metrics. The logloss, as stated in (4), and the distance d(𝜃,𝜃^{∗}) = 𝜃 − 𝜃^{∗} to the final diagnosis 𝜃^{∗}, that is, the learner ability estimate when all questions have been asked (Lan et al. 2014).
Adding a CAT to the Benchmark
We also added a normal computerized adaptive test to the benchmark, called GenMA, developed in Vie et al. (2016b) and proven better at predicting performance than the Rasch and DINA models on a variety of datasets. We choose at each step the question that maximizes the determinant of Fisher information (the socalled Drule, see Chalmers 2016).
Results
Logloss values obtained (and precision rates) for the dataset Fraction
After 3 questions  After 8 questions  After 15 questions  

CAT  0.757 ± 0.082 (67%)  0.515 ± 0.06 (82%)  0.355 ± 0.05 (88%) 
Uncertainty  0.882 ± 0.095 (72%)  0.761 ± 0.086 (76%)  0.517 ± 0.067 (86%) 
InitialD  0.608 ± 0.055 (74%)  0.376 ± 0.027 (82%)  0.302 ± 0.023 (86%) 
Random  0.842 ± 0.09 (70%)  0.543 ± 0.07 (80%)  0.387 ± 0.051 (86%) 
Distance to the final diagnosis obtained for the dataset Fraction
After 3 questions  After 8 questions  After 15 questions  

CAT  1.446 ± 0.094  1.015 ± 0.101  0 . 3 5 5 ± 0 . 1 0 3 
Uncertainty  1.495 ± 0.103  1.19 ± 0.112  0.638 ± 0.119 
InitialD  1.355 ± 0.08  0 . 8 5 9 ± 0 . 0 5 8  0.502 ± 0.047 
Random  1.467 ± 0.095  1.075 ± 0.089  0.62 ± 0.083 
Logloss values obtained (and precision rates) for the dataset TIMSS
After 3 questions  After 12 questions  After 20 questions  

CAT  1.081 ± 0.047 (62%)  0.875 ± 0.05 (66%)  0.603 ± 0.041 (75%) 
Uncertainty  1.098 ± 0.048 (58%)  0.981 ± 0.046 (68%)  0.714 ± 0.048 (72%) 
InitialD  0.793 ± 0.034 (61%)  0.582 ± 0.023 (70%)  0.494 ± 0.015 (74%) 
Random  1.019 ± 0.05 (58%)  0.705 ± 0.035 (68%)  0.512 ± 0.017 (74%) 
Distance to the final diagnosis obtained for the dataset TIMSS
After 3 questions  After 12 questions  After 20 questions  

CAT  1.894 ± 0.05  1.224 ± 0.046  0 . 4 6 4 ± 0 . 0 5 5 
Uncertainty  1.937 ± 0.049  1.48 ± 0.047  0.629 ± 0.062 
InitialD  1.845 ± 0.051  0 . 9 7 2 ± 0 . 0 3 9  0 . 4 6 5 ± 0 . 0 3 4 
Random  1.936 ± 0.052  1.317 ± 0.048  0.59 ± 0.043 
Fraction
In Fig. 3, InitialD performs better than the other strategies, and with a narrow confidence interval. 8 questions seem enough to reconstruct correctly 82% of answers, and converge to a minimal logloss. For k ≤ 9, InitialD converges faster towards the final diagnosis, while for k ≥ 14, CAT converges faster, showing a benefit in adaptation in later periods of the process.
TIMSS
In Fig. 4, InitialD performs better than Random, CAT and Uncertainty. As early as the first question, InitialD clearly has a lower logloss in response pattern reconstruction. This happens because the question of biggest “volume” has the vector of highest norm, which means, the most discriminant question, while other models will pick a question of average difficulty.
Asking 7 questions over 23 using the strategy InitialD leads in average to the same estimation error than asking 12 questions at random, or asking 19 questions using a traditional adaptive test. InitialD converges towards the final diagnosis faster than the other strategies. Using our method, 12 questions seem enough to get a diagnostic that predicts correctly 70% of the outcomes.
Discussion
On every dataset we tried, InitialD performed better, and with a narrower confidence interval, than the other strategies. On the TIMSS dataset, the Random strategy performs well compared to a normal adaptive test. It may be because the test is already well balanced, so random questions have high probability to be diverse.
If the number of questions to ask (k), the number of questions in the bank (n) and the number of measured knowledge components (K) are low, it is possible to simulate every subset of k questions over n. However, in practice, question banks on platforms will be large, so InitialD’s complexity, O(nk^{3}) after a preprocessing step in O(n^{3}), will be an advantage. In this paper, we tested our method on datasets of up to 23 questions, but the exact determinantal point process sampling algorithm has already been tried on databases of thousands of items (Kulesza and Taskar 2012). Please note that this work naturally extends to the question coldstart problem: having a new question on the platform, how to identify a group of students to ask it to in order to estimate its discrimination parameters over all knowledge components.
InitialD could be improved by sampling several subsets of questions, and keeping the best of them. Sampling ℓ subsets of k questions has complexity O(ℓnk^{3}), finding the one achieving the biggest volume has complexity O(ℓk^{3}). Drawing several subsets increases the chance to determine the best subset to ask first.
Conclusion and Further Work
We showed, using real data, that our strategy InitialD, based on determinantal point processes, performed better than other strategies for coldstart at predicting student performance. As it is fast, this method can be applied to the generation of several diverse worksheets from the item bank of an educational platform: a learner can request a worksheet of k questions, attempt to solve them, receive their ability features as feedback (strong and weak points), then ask for another worksheet. Items already administered to some student can be removed from the item bank, in their view, so that the same learner does not get the same exercise in two consecutive worksheets.
As further work, we would like to check if sampling according to a determinantal point processes is still useful in later stages of a multistage test, after a first learner ability estimate has been computed.
InitialD solely relies on pairwise similarities between questions: it can be used in conjunction with other response models, using other feature extraction techniques that allow better vector representations of the questions. For example, one could use various information at hand such as a bagofwords representation of the problem statement, or extra tags specified by a teacher, in order to improve the embedding of items. Such extra information will improve the selection of questions, with the same algorithm InitialD. In this paper, we used a linear kernel for the predictions and for the student model, but nonlinear kernels could be used, performing better but at the cost of interpretation.
For interpretation of KCs, a qmatrix is useful. Koedinger et al. (2012) have shown that it is possible to combine qmatrices by crowdsourcing in order to improve student models. We would like to see if it also applies to the General Diagnostic Model, and if observing the discrimination parameters can help us determine better qmatrices.
Our strategy allows a fast and narrow ability estimate of the learner knowledge, that can be revealed to them in order to help them progress. If the lessons in the course are also mapped to KCs, recommendations of lessons could be made based on this initial evaluation, for example. We aim to apply InitialD to automatic generation of worksheets at the beginning of a MOOC, in order to provide lowstakes formative assessments, and evaluate them on real students.
Notes
Acknowledgments
This work is supported by the ParisSaclay Institut de la Société Numérique funded by the IDEX ParisSaclay, ANR11IDEX000302.
References
 Anava, O., Golan, S., Golbandi, N., Karnin, Z., Lempel, R., Rokhlenko, O., & Somekh, O. (2015). Budgetconstrained item coldstart handling in collaborative filtering recommenders via optimal design. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee. cit. on p. 4 (pp. 45–54).Google Scholar
 Cai, L. (2010a). Highdimensional exploratory item factor analysis by MetropolisHastings Robbins–Monro algorithm. In Psychometrika 75.1. cit. on pp. 3,10 (pp. 33–57).Google Scholar
 Cai, L. (2010b). MetropolisHastings RobbinsMonro algorithm for confirmatory item factor analysis. In Journal of Educational and Behavioral Statistics 35.3. cit. on p. 10 (pp. 307–335).Google Scholar
 Chalmers, R.P. (2012). mirt: A multidimensional item response theory package for the R environment. In Journal of Statistical Software 48.6. cit. on pp. 5, 10, 11 (pp. 1–29).Google Scholar
 Chalmers, R.P. (2016). Generating Adaptive and NonAdaptive Test Interfaces for Multidimensional Item Response Theory Applications. In Journal of Statistical Software 71.1. cit. on pp. 3, 5, 12 (pp. 1–38).Google Scholar
 Chang, H. H. (2014). Psychometrics Behind Computerized Adaptive Testing. In Psychometrika. cit. on p. 4 (pp. 1–20).Google Scholar
 Cheng, Y. (2009). When cognitive diagnosis meets computerized adaptive testing: CDCAT. In Psychometrika 74.4. cit. on p. 4 (p. 619?632).Google Scholar
 Colvin, K.F., Champaign, J., Liu, A., Zhou, Q., Fredericks, C., & Pritchard, D.E. (2014). Learning in an introductory physics MOOC: All cohorts learn equally, including an oncampus class. In The International Review of Research in Open and Distributed Learning 15.4. cit. on p. 3.Google Scholar
 Davier, M. (2005). A general diagnostic model applied to language testing data. In ETS Research Report Series 2005.2. cit. on p. 4 (pp. i–35).Google Scholar
 De La Torre, J. (2009). DINA model and parameter estimation: A didactic. In Journal of Educational and Behavioral Statistics 34.1. cit. on p. 3 (pp. 115–130).Google Scholar
 DeCarlo, L.T. (2010). On the analysis of fraction subtraction data: The DINA model, classification, latent class sizes, and the Qmatrix. In Applied Psychological Measurement. cit. on p. 10.Google Scholar
 Desmarais, M.C., & Baker, R.S.J.D. (2012). A review of recent advances in learner and skill modeling in intelligent learning environments. In User Modeling and UserAdapted Interaction 22.12. cit. on p. 3 (pp. 9–38).Google Scholar
 Dunlosky, J., Rawson, K.A., Marsh, E.J., Nathan, M.J., & Willingham, D.T. (2013). Improving students’ learning with effective learning techniques promising directions from cognitive and educational psychology. In Psychological Science in the Public Interest 14.1. cit. on p. 3 (pp. 4–58).Google Scholar
 Hambleton, R.K., & Swaminathan, H. (1985). Item response theory: Principles and applications Vol. 7. Berlin: Springer Science and Business Media. cit. on p. 3.CrossRefGoogle Scholar
 Koedinger, K.R., McLaughlin, E.A., & Stamper, J.C. (2012). Automated student model improvement. In International Educational Data Mining Society. cit. on p. 15.Google Scholar
 Kulesza, A., & Taskar, B. (2012). Determinantal point processes for machine learning. In arXiv:http://arXiv.org/abs/1207.6083. cit. on pp. 2, 5, 7–9, 14.
 Lan, A.S., Waters, A.E., Studer, C., & Baraniuk, R.G. (2014). Sparse factor analysis for learning and content analytics. In The Journal of Machine Learning Research 15.1. cit. on pp. 4, 5, 11, 12 (pp. 1959–2008).Google Scholar
 Leighton, J.P., Gierl, M.J., & Hunka, S.M. (2004). The attribute hierarchy method for cognitive assessment: a variation on tatsuoka’s rulespace approach. In Journal of Educational Measurement 41.3. cit. on p. 4 (pp. 205–237).Google Scholar
 Lynch, D., & Howlin, C.P. (2014). Real world usage of an adaptive testing algorithm to uncover latent knowledge (cit. on pp. 2, 4).Google Scholar
 Magis, D. (2015). Empirical comparison of scoring rules at early stages of CAT. In (cit. on p. 11).Google Scholar
 Piech, C., Bassen, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L.J, & SohlDickstein, J. (2015). Deep knowledge tracing. In Advances in Neural Information Processing Systems. cit. on p. 3 (pp. 505–513).Google Scholar
 Reckase, M. (2009). Multidimensional item response theory Vol. 150. Berlin: Springer. cit. on p. 3.CrossRefGoogle Scholar
 Robitzsch, A., Kiefer, T., George, A.C., & Ünlü, A. (2014). CDM: Cognitive diagnosis modeling. In R Package version 3 (cit. on p. 10).Google Scholar
 Rosen, Y., Rushkin, I., Ang, A., Federicks, C., Tingley, D., & Blink, M.J. (2017). Designing adaptive assessments in MOOCs. ISBN: 9781450344500. https://doi.org/10.1145/3051457.3053993. (cit. on pp. 2, 4) (pp. 233–236). Cambridge: ACM.
 Settles, B. (2010). Active learning literature survey. In University of Wisconsin, Madison 52.5566. cit. on p. 9 (p. 11).Google Scholar
 Su, Y. L., Choi, K.M., Lee, W.C., Choi, T., & McAninch, M. (2013). Hierarchical cognitive diagnostic analysis for TIMSS 2003 mathematics. In Centre for Advanced Studies in Measurement and Assessment 35. cit. on p. 10 (pp. 1–71).Google Scholar
 ThaiNghe, N., Drumond, L., Tomáš, H., SchmidtThieme, L., et al. (2011). Multirelational factorization models for predicting student performance. In Proceedings of the KDD Workshop on Knowledge Discovery in Educational Data. cit. on pp. 2, 5: Citeseer.Google Scholar
 Vie, J.J., Popineau, F., Bourda, Y., & Bruillard, É. (2016a). A review of recent advances in adaptive assessment. In Learning analytics: Fundaments, applications, and trends: A view of the current state of the art. cit. on p. 4: Springer, in press.Google Scholar
 Vie, J. J., Popineau, F., Bourda, Y., & Bruillard, É. (2016b). Adaptive testing using a general diagnostic model. In European Conference on Technology Enhanced Learning. cit. on pp. 4, 6, 7, 12 (pp. 331–339): Springer.Google Scholar
 Wilson, K.H., Xiong, X., Khajah, M., Lindsey, R.V., Zhao, S., Karklin, Y., Van Inwegen, E.G., Han, B., Ekanadham, C., Beck, J.E., & et al. (2016). Estimating student proficiency: Deep learning is not the panacea. In In Neural Information Processing Systems, Workshop on Machine Learning for Education. cit. on p. 3.Google Scholar
 Yan, D., Davier, A.A., & Lewis, C. (2014). Computerized multistage testing. USA: CRC Press. cit. on pp. 4.Google Scholar
 Zheng, Y., & Chang, H.H. (2014). Multistage testing, onthefly multistage testing, and beyond. In Advancing methodologies to support both summative and formative assessments. cit. on pp. 4, 9 (pp. 21–39).Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.