A novel computerized adaptive testing framework with decoupled learning selector

Ma, Haiping; Zeng, Yi; Yang, Shangshang; Qin, Chuan; Zhang, Xingyi; Zhang, Limiao

doi:10.1007/s40747-023-01019-1

A novel computerized adaptive testing framework with decoupled learning selector

Original Article
Open access
Published: 24 March 2023

Volume 9, pages 5555–5566, (2023)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

A novel computerized adaptive testing framework with decoupled learning selector

Download PDF

Haiping Ma^1,2,
Yi Zeng¹,
Shangshang Yang³,
Chuan Qin⁴,
Xingyi Zhang³ &
…
Limiao Zhang ORCID: orcid.org/0000-0002-0072-4047⁵

1528 Accesses
1 Citation
Explore all metrics

Abstract

Computerized adaptive testing (CAT) targets to accurately assess the student’s proficiency in the required subject/area. The key issue is how to design a question selector that adaptively selects the best-suited questions for each student based on previous performance step by step. Most existing question selectors execute via greedy metric functions (e.g., question information and uncertainty), which can not effectively capture data characteristics. There also exist learning-based question selectors that redefine the CAT problem as a bilevel optimization problem, where the parameter learning of the question selector and the student proficiency estimation model are coupled, which is not flexible enough. To this end, in this paper, we propose a novel CAT framework with Decoupled Learning selector (DL-CAT). Specifically, we first use the currently estimated student ability and question characteristics as input and design a deep learning-based question selector to predict question selection scores. Then, to address the issue that there is no ground truth to measure the quality of the selected question, an approximate ground-truth and a pairwise rank loss function are specially designed to update the parameters of the question selector independently. Extensive experiments on two real datasets demonstrate that our proposed DL-CAT has certain advantages in effectiveness and significant advantages in efficiency.

Condensed Discriminative Question Set for Reliable Exam Score Prediction

Improving Short Answer Grading Using Transformer-Based Pre-training

Automatic Short Answer Grading via Multiway Attention Networks

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Computerized adaptive testing (CAT) is a new personalized testing mode, with the aim to accurately estimate students’ proficiency by adaptively selecting the best suited questions [20]. To achieve this goal, CAT executes two modules (i.e., student proficiency estimation module and question selector module) alternately step by step, which is shown in Fig. 1. Especially, in each step n, the student proficiency estimation module output the current estimated student proficiency $\theta _n$ based on the previous response (steps 1 to $n-1$); the question selector then selects the next question for the student to answer based on given $\theta _n$ and question traits. Owing to the cost reduction, efficiency improvement, and safety characteristics, CAT has been widely used in many standard examination institutions, e.g., GRE [21] and Graduate Management Admission Test [27].

In past decades, considerable efforts have been made for CAT tasks [1]. One of the most typical works uses 2-parameter Item Response Theory (IRT) [7, 19] to estimate the student proficiency based on previous responses and Maximum Fisher Information (MFI) [15, 20] as the question selector. Equation (1) provides the IRT function, where $P_k(\theta _i)$ is the probability of the student with proficiency $\theta _i$ answering question $e_k$ with two pretrained parameters (i.e., difficulty $\alpha _k$ and discrimination $\beta _K$) correctly, $\sigma [\cdot ]$ is the logistic function. Given the student’s previous responses (steps 1 to n), the IRT model can estimate the student proficiency ${\hat{\theta }}_i^{(n)}$. Equation (2) shows the fisher informativeness of question $e_k$ under the current estimate of student proficiency $\theta _i^{(n)}$, $P_{{k}}^{\prime }({\hat{\theta }})=\partial P_{{k}}({\hat{\theta }}) / \partial {\hat{\theta }}$ is the derivation process. The question with the maximum informativeness is selected. That is, the question with high-discrimination and difficulty closed to current estimate of student’s proficiency is the best suited one at each step. Afterwards, many similar approaches were developed, such as, the CAT system [6] that uses multidimensional Item Response Theory (MIRT) [11] as the student proficiency estimation module and Kullback–Leibler Information (KLI) [4] as the question selector, the CAT system [20] that uses bayesian network models [24] as the student proficiency estimation module and the expected information gain maximum [30] as the question selector.

$$\begin{aligned}{} & {} P_k(\theta _i)=P\left( Y=1|\theta _i, e_k \right) =\sigma \left[ \alpha _{K}\left( \theta _i-\beta _{K}\right) \right] \end{aligned}$$

(1)

$$\begin{aligned}{} & {} I_{k}({\hat{\theta }}_i^{(n)})=\frac{P_{k}^{\prime }({\hat{\theta }}_i^{(n)})}{P_{k}({\hat{\theta }}_i^{(n)})\left[ 1-P_{k}({\hat{\theta }}_i^{(n)})\right] } \end{aligned}$$

(2)

Although these traditional question selection methods have certain effectiveness, they still have great limitations. Specifically, most question selection algorithms are based on predefined criteria, which have certain preferences and cannot effectively capture the complex data characteristics [4, 15]. To address the limitations of these heuristic–based question selectors, some experts also attempted to explore question selectors based on learning selection strategies [10] by redefining the CAT problem as a bilevel optimization [10, 40] and reinforcement learning problem [5, 14]. These learning-based question selectors have shown advantages over the criteria-based question selectors [14]. However, these existing learning-based CAT frameworks are not flexible enough since their two modules (i.e., the student proficiency estimation model and the question selector) are coupled during the training. For example, the bilevel optimization-based CAT framework [10] obtains the parameters of two modules (i.e., student proficiency estimation and question selector) through coupled internal and external optimizations during the training process, i.e., the result of external optimization model is used to measure the quality the internal optimization.

To this end, we propose a novel CAT framework with decoupled learning selector (DL-CAT). To be specific, the main contributions of this paper can be summarized as follows:

We propose a novel deep learning-based question selector, which directly output the next question through giving the currently estimated student proficiency and question features. Compared with these existing CAT systems, the advantages are reflected in two aspects. On one hand, it uses deep neural networks to more intelligently select the next question directly instead of predefined criteria. On the other hand, the question selector is agnostic with both the model and training process of student proficiency estimation module.
We are the first to decouple the parameter learning of the question selector and student proficiency estimation model in the learning-based CAT framework. To achieve this goal, we specially design an approximate ground-truth module and loss learning module to update the parameters of the question selector independently.
The experimental results on two real-world datasets demonstrate the effectiveness and efficiency of the proposed DL-CAT. Especially, the DL-CAT shows certain advantages in terms of effectiveness and significant advantages in terms of efficiency compared to both the traditional question selectors and novel learning-based question selectors.

Related work

In this paper, we focus on the question selector component of CAT. Existing question selectors of CAT are mainly divided into the following two categories: heuristic-based question selector and learning-based question selector.

Heuristic-based question selector

Most of the question selectors in this category are model-specific, which are specially designed by experts according to the characteristics of different student proficiency estimation models (i.e., the other component of CAT), such as Maximum Fisher Information (MFI) [15], Kullback–Leibler Information (KLI) [4], Shannon entropy, mutual information, minimizing error [29, 34, 35, 37] and their extension question selection methods are specially designed for IRT models

[11, 26, 31]. Among them, the core of Maximum Fisher Information (MFI) [15]’s question selection strategy is to use the estimated value of the current student’s ability to calculate the information content of the remaining questions in the question bank and then select the question with the largest amount of information as the next test one. The function of Maximum Fisher Information (MFI) [15] is only regarding the ability estimation value, therefore, Chang and Ying proposed a question selection method based on KL global information, and the selected question is the question with the largest KL value. The larger the KL value, the better the current estimated value of student proficiency and other student proficiency values within a reasonable fluctuation range. Afterward, when the student proficiency estimation module is expanded from single-dimensional IRT to multidimensional IRT (MIRT), a variety of question selection strategies through multidimensional expansion and optimization of MFI and KLI were proposed. The KLI-based multidimensional extensions mainly include a weighted KL strategy (PWKL) [32] and a capability posterior distribution KL strategy [22]. The flexible of these model-specific question selectors is limited, since it is difficult to specially design a question selector according to the neural network-based student proficiency estimation models.

To address the limitations of the model-specific question selectors, a model-agnostic question selector, namely MAAT, was proposed to select questions based on uncertainty [2]. Inspired by the idea of active learning, the model is designed from the perspective of the model change, i.e., the expected model change (EMC) caused by each question is calculated to measure the information content of the problem [3]. Finally, the question with the largest expected model change (EMC) is selected, and the proposed question selector is independent of student proficiency estimation model.

In conclusion, these algorithms based on heuristics are only designed based on some prior knowledge or experience, with the drawback of fully considering the characteristics of data.

Learning-based question selector

To address the problem that the heuristic-based question selection algorithms cannot capture the data features, the data-driven learning question selectors are gradually emphasized. This category attempts to learn and continuously optimize a question selector from large-scale behavior data, instead of using static question selection algorithms to reduce the error of capacity estimation as much as possible. The representatives of this category include bilevel optimization-based computerized adaptive testing (BOBCAT) [10] and neural computerized adaptive testing (NCAT) [40]. Both the models redefines the CAT problem as a bilevel optimization problem. In the bilevel optimization framework, the outer level optimization problem is to learn both the response model parameters and a data-driven question selection algorithm by explicitly maximizing the predictive likelihood of student responses in a held-out meta question set, the inner level optimization problem adapts the outer level response model to each student by maximizing the predicted likelihood of their responses in an observed training question set. To solve the bilevel optimization problem of CAT, BOBCAT employs a biased approximate estimator of the gradient w.r.t. the question selection algorithm parameters in the bilevel optimization problem, while NCAT formally transform it into an equivalent reinforcement learning problem [5, 14].

While the above learning-based question selectors have achieved great success, the drawback of bilevel optimization-based framework in flexible is still apparent, since the internal and external optimizations (i.e., the parameter learning of the two components) are mutually coupled in the training process, i.e., the internal (external) optimization is used to measure the quality of external (internal) optimization. The coupling training process requires the question selector to be retrained from scratch when adding new questions.

Therefore, in this paper, on one hand, we attempt to design a deep learning-based question selector to automatically predict the question selection probability, so as to more comprehensively capture the data characteristics; on the other hand, in order to address the training inflexibility issue that the parameter learning of question selection is coupled with that of student proficiency estimation model, we specially design a ground-truth construction module to train the question selector independently. With these special designs, when new questions are added to the question bank, the question selector can be directly used to obtain the question selection scores in the new question bank without retraining (we only need to retrain the student proficiency estimation model to obtain the parameters of the new question), which is more applicable to real CAT education scenarios.

Table 1 A list of major notations used in this work

Full size table

Preliminaries and problem formalization

This section consists of two parts: preliminaries and problem formalization. The part of preliminaries focuses on introducing the terminology, the goals and composition structure of traditional computerized adaptive testing (CAT). The part of problem formalization describes the definition and expansion of the problem in the CAT task in this paper. The major notations used in this paper are listed in Table 1.

Preliminaries

In an intelligent education system, suppose there are L students and M questions, which can be represented by the student set $S=\{s_1,s_2,\ldots ,s_L\}$ and the question bank $ E=\left\{ e_{1}, e_{2}, \ldots , e_{M}\right\} $, respectively, each element of response logs R can be denoted by a set of triplet (s, e, r), where $s\in S$, $e\in E$, and $r \in \{0,1\}$ that represents the response score of student s got on question e. Here $r=1$ indicates the answer is correct and $r=0$ otherwise. A typical CAT system has two components, namely the proficiency estimation model M and the question selector $\pi $, the former is based on response logs R from students to train the proficiency estimation model M, the latter selects the next question e dynamically based on the student’s behavior record R and proficiency estimate $\theta $. After multiple rounds of question selection based on student interaction data, when the test ends, a test sequence is formed to accommodate student s. Our goal is to design a strategy $\pi $ to select a question set of size N, denoted by $E_{i} = \left\{ e_1^*, e_2^*, \ldots , e_N^*\right\} $, step by step according to the performance of $s_{i}$, which can test the examinee more accurately.

As mentioned in the section of related work, some latest data-driven methods formalize the CAT problem as a bilevel optimization problem in the meta-learning configuration, where the parameter learning of the proficiency estimation model M and the question selector $\pi $ are coupled together. Specifically, in the outer level optimization, the parameters of the two modules are learned by maximizing the prediction accuracy in the meta set. In the inner level optimization, the two module parameters learned in the outer-level are applied to the students’ proficiency assessment so as to maximize their prediction accuracy in the training set.

Problem formalization

In this paper, our goal is to decouple the training process of the proficiency estimation model M and question selector $\pi $. Therefore, we focus on designing a learnable question selector $\pi $ used to select one question e from E for a student $s_i$ at each step k by giving the current estimated student proficiency ${\hat{\theta }}_i^{(k-1)}$ and question parameters. After receiving the response r of student $s_i$ on question e, the model M updates and estimates new proficiency ${\hat{\theta }}_i^{(k-1)}$. It is worth noting that the update of student proficiency ${\hat{\theta }}_i$ is only based on the previous selected questions and response results, and independent of the question selector itself. This procedure repeats n times, so as to accurately estimate the student proficiency, i.e., $\theta _i^{(n)} \rightarrow \theta _i^0$, where $\theta _i^0$ is the true proficiency (usually unknown) of student $s_i$.

The proposed approach

Overall framework

The main idea of the proposed approach is to select a most suitable question for a student in each step through a question selector according to the student’s current knowledge mastery (or knowledge proficiency) of each knowledge concept. It is obvious that there two components that have big effects on the quality of the select question, i.e, the student proficiency estimation model to diagnose student’s current knowledge mastery and the question selector to recommend a suitable question. Due to the focus of this paper is on designing an effective question selector, we will directly employ existing student proficiency estimation models (such as IRT [7] or MIRT [11]) to get the student’s knowledge proficiency in each knowledge concept. However, due to the particularity of the task, there is no ground truth to measure the quality of the selected question, and thus the general training procedure cannot be directly applied. To address this issue, we devise an ground-truth construction module to generate reasonable labels for enabling the training process of the question selector to be executed well. In addition, we further suggest a pairwise rank loss function to make the training of the model stable as well as thus provide better final performance.

For better understanding, the overall framework the proposed approach DL-CAT has been summarized in Fig. 2, which is mainly composed of four steps. To start with, a classical cognitive diagnosis model will be trained based on all students’ response logs, where the student proficiency estimation model is to get students’ knowledge proficiency by modelling student exercising scores. Second, the student’s ability parameter and exercise latent parameters obtained in the student proficiency estimation model will be combined as the input of the question selector, and then the question selector will predict the student’s scores on all candidate question. Next, the ground-truth module will be employed to generate corresponding label for the given student. Fourth, the loss function is calculated based on the predicted output and the generated labels to update the weight parameters of the question selector, where the suggested pairwise rank loss is adopted. The above procedure will be repeated for all students and not end until the question selector converge. After that, the question selector will recommend the question with the best score to each student, respectively.

Question selector

The question selector $\pi $ selects a sequence of suitable questions to a student based on the student’s current knowledge proficiency state by exploring the relationship between the student and questions. It is intuitive that the more accurate the diagnosed student’s current knowledge proficiency is, the more suitable the question selected by the question selector to the student is. Therefore, it is important to obtain accurate diagnosis results for each student by a cognitive diagnosis model $M_{C}$ before mining the potential relationship between the student and new questions that the student has never done.

For this aim, we first train a cognitive diagnosis model $M_{C}$ based on all students’ response logs in the training dataset D by minimizing the following loss:

$$\begin{aligned} {\text {loss}}=-\frac{1}{n} \sum _{(s_{i}, e_{j}, r_{ij}) \in D} r_{ij} \log \hat{r_{ij}}+(1- r_{ij}) \log (1-\hat{r_{ij}}), \end{aligned}$$

(3)

where $\hat{r_{ij}} = \ {M}_C(s_i,e_j)$ denotes the model $M_{C}$’s predicted probability of student $s_i$ correctly answering question $e_{j}$.

Then, we will directly extract the student-related parameters from the trained model $M_{C}$ as the student’s current knowledge proficiency. Besides, to obtain the representation of a new question $e_j$ that the student has never done, we also extracted the question-related parameters from model $M_{C}$. Note that, in the following, $e_{j}$ denotes the question in the question bank that the student $s_i$ has never done.

Take the Item Response Theory (IRT [7]) model as an example, whose forward pass process can be denoted as follows:

$$\begin{aligned} \left\{ \begin{array}{l} {\textbf{h}}_S = X_i^S\times W_S, \ W_S \in R^{L\times D}\\ {\textbf{h}}_E = X_j^S\times W_E, \ W_E \in R^{M\times D}\\ \theta _i = {\textbf{h}}_S \times W_{\theta }, \ W_{\theta } \in R^{D\times 1}\\ \alpha _j = {\textbf{h}}_E \times W_{\alpha }, \ W_{\alpha } \in R^{D\times 1}\\ \beta _j = {\textbf{h}}_E \times W_{\beta }, \ W_{\beta } \in R^{D\times 1}\\ {\hat{r}}_{ij} = \hbox {Sigmoid}(\alpha _j\cdot (\theta _i-\beta _j))\\ \end{array} \right. , \end{aligned}$$

(4)

where D is the embedding dimension, ${\textbf{x}}_i^S \in \{0,1\}^{1\times L}$ is the student one-hot vector for $s_i$, ${\textbf{x}}_j^E \in \{0,1\}^{1\times M}$ is the question one-hot vector for $e_j$, and $W_S$, $W_E$, $W_{\theta }$, $W_{\alpha }$, and $W_{\beta }$ are trainable matrices in the embedding layers. As a result, ${\hat{r}}_{ij} = {M}_C(s_i,e_j|{\hat{\theta }}_{i}, \alpha _{j},\beta _j)$. Here, we can extract a student-related parameter $\theta _i$ for student $s_i$ and two question-related parameters for a new question $e_j$, i.e., the question discrimination $\alpha _j$ and the question difficulty $\beta _j$.

Afterward, the proposed question selector $\pi $ takes the student’s current knowledge proficiency and question parameters as its inputs. To integrate the two types of input features, a simple yet effective concatenation operation is used to get the formal input of $\pi $ as

$$\begin{aligned} X_{ij} = [ \theta _{i},\alpha _{j}, \beta _{j}]. \end{aligned}$$

(5)

As a result, for a new question $e_j$, the proposed question selector $\pi $ will predict the score $p_{ij}$ of student $s_{i}$ on question $e_{j}$ by a h-layer multi-layer perceptron (MLP) [25], whose basic forward pass process is as follows:

$$\begin{aligned} \left\{ \begin{array}{l} y_{1}=\sigma \left( {W}_{1} \times {X}_{ {ij}}+ {b}_{1}\right) \\ \ldots ,\\ y_{i} = \sigma (W_{i}\times y_{h-2}+b_{i}),\ 2 \ge i \ge h-1 \\ p_{ij}={\sigma }\left( {W}_{h} \times {y}_{h-1}+b_{h}\right) \\ \end{array} \right. , \end{aligned}$$

(6)

where $\sigma (\cdot )$ is the activation function, ${y}_{i}$ is the output of the ith layer, $W_i$ and $b_i$ are the weights and bias of the ith layer.

Ground-truth construction

As the question selector $\pi $ is independent to model ${M}_C$, it is infeasible to directly employ model ${M}_C$ to determine whether a question selection is suitable for a student. To measure the quality of a selected question, we propose an approximate ground-truth construction module to obtain effective ground truth.

Specifically, this module will generate a ground truth to measure the effectiveness of a question selected by $\pi $. To this end, RMSE (root mean squared error) is used to compute the difference between student real proficiency $\theta _i^{0}$ and the updated proficiency $\hat{\theta _i}$ after a question is selected for student $s_i$. Here, the student’s $\theta _i^{0}$ is obtained based on the student’s meta set response logs. First, we initialize the knowledge proficiency $\theta _{i}$ of student $s_{i}$, select a set of questions $\{e_{1}, e_{2},\ldots , e_{q}\}$ as support set $D_{T}^{i}$, and obtain the updated student’s $\hat{\theta _i}$ based on ${M}_C$ and $D_{T}^{i}$. Then we calculate the $Rmse_{ij}$ of ${\hat{\theta }}_{i}$ by

$$\begin{aligned} Rmse_{ij}=\sqrt{\frac{1}{|D_{V}^{i}|} \sum _{f=1}^{|D_{V}^{i}|}\left( r_{if}-\ {M}(s_i,e_f|{\hat{\theta }}_{i}, \alpha _{f},\beta _f)\right) ^{2}},\nonumber \\ \end{aligned}$$

(7)

where $D_{V}^{i}$ denotes the meta set, $|D_{V}^{i}|$ is the number of questions in $D_{V}^{i}$, $r_{if}$ is the real response score on question $ e_{f}$, ${M}(s_i,e_f|{\hat{\theta }}_{i}, \alpha _{f},\beta _f)$ is the probability predicted by model ${M}_C$ on question $e_{f}$ under the updated student proficiency ${\hat{\theta }}_{i}$.

The ground-truth construction module simulates the real CAT estimation process by further measuring the quality of the selected questions through the results on the validation set. In this paper, the approximation generated through the simulation process is used as the label for the prediction score, so as to overcome the problem of not directly reflecting the suitability of the selected questions for the students.

Training with pairwise rank loss

Due to the small difference among values of ground truth generated by the ground-truth construction module, the relationship between the order of student’s selection of questions cannot be accurately reflected, where thus the traditional loss functions fail to solve this well [38]. To tackle this problem, we are inspired by Learning to rank [17] and learning loss [38] to design a loss method suitable for the current problem, we propose a novel pairwise rank loss, which can reflect the sequence of selected questions with the relation of partial sequence pairs well. With the sorted prediction score sequence $P_{i}=\{p_{i1},\ldots ,p_{iq}\}$ and the computed RMSE sequence ${RMSE}_{i} = \{{rmse}_{i1},\ldots ,{rmse}_{iq} \}$, we can compute the pairwise rank loss ${L}_i$ of student $s_{i}$ on q questions by

$$\begin{aligned} \begin{aligned}&{L}_{i}=\sum _{j=1}^{q} \sum _{k=1}^{q-1} \max \left( 0, A\left( {rmse}_{ij}, \ {rmse}_{ik}\right) \cdot \left( p_{ij}-p_{ik}\right) \right) \\&A\left( {rmse}_{ij}, {rmse}_{ik}\right) = \left\{ \begin{array}{ll}-a, &{} \text{ if }\ {rmse}_{ij} < {rmse}_{ik} \\ +a, &{} \text{ otherwise } \end{array}\right. , \end{aligned} \end{aligned}$$

(8)

where parameter a is used as a punitive coefficient to represent the weight of position, the higher the position is, the greater the value of a is.

With the tailored loss function, the proposed question selector can be trained well, where more details about the training process have been summarized in Algorithm 1. To start with, initialize the knowledge proficiency of each student in S, and Pre-training the basic cognitive diagnosis model M based on the historical data until the model converges, the question-related parameters can be extracted from model M (Lines 1–2). Second, when step = 0,directly obtain the initialization current knowledge proficiency parameter, if step > 0, update the knowledge proficiency parameter (Lines 4–6).After that, we divide the record data $D^s$ of a student s into training dataset $D_T^s$ and meta set $D_V^s$ (Lines 8–9). Next, calculate the predicted score of the question in $D_T^s$ of the dataset according to the spliced features, use the rmse indicator in the $D_T^v$ dataset to verify the updated ${\hat{\theta }}$, (Lines 10–15). Fifth, use Eq. (8) to calculate the loss of $P_{i}$ and $RMSE_{i}$, update the selector parameters $f(\varphi )$ for the next iteration (Lines 16–17).

Experiment

In the experiment part, we mainly focus on answering the following questions:

(RQ1) What is the performance of the proposed DL-CAT compared to the state-of-art approaches?
(RQ2) What is the performance of DL-CAT in simulation experiment?
(RQ3) Whether the designed pairwise rank loss function in this paper is more effective compared to the traditional listwise loss method?
(RQ4) Is there any difference in the impact of different punitive coefficient settings?
(RQ5) What are the advantages of DL-CAT in terms of efficiency?

Experimental settings

Datasets. To validate the effectiveness of the proposed DL-CAT on computerized adaptive testing, two real-world educational datasets were used in the following experiments, including one public dataset ASSIST [8] and one private dataset Math, whose statistics are summarized in Table 2 and descriptions are listed as follows:

ASSISTments (ASSISTments 2009–2010 skill builder) Feng et al. [8] is an openly available dataset created in 2009 by the ASSISTments online tutoring service system. Here we adopt the public corrected version [36] that do not contain the duplicated data. As can be seen, there are more than 4 thousand students, nearly 18 thousand questions, and over 300 thousand response logs in the dataset.
Math is a private dataset based on the real behavior records of mathematics subjects corresponding to students on the education platform. The dataset mainly contains behavioral data of students in grades 1 to 6, with a total of more than 10,000 students and more than 1 million response records.

Data processing. To make the reliproficiency of experimental results, for both ASSIST and Math, we first filtered the knowledge concepts with less than ten related questions. Moreover, in ASSIST, we also filtered the questions answered less than 50 times and the students who answered less than 10 questions.

Table 2 Statistics of the datasets

Full size table

Benchmark methods. For a fair comparison, we adopted two capability assessment models, IRT [7] and MIRT [11] as the proficiency evaluation models, the following is their corresponding descriptions:

IRT [7]: IRT is the most typical cognitive diagnosis model, which takes the simple logistic function to integrate input vectors and represents the student mastery as a unidimensional continuous vector for predicting the probabilities of a student correctly answering questions;
MIRT [11]: As the successor of IRT, MIRT extends IRT’s unidimensional student and question latent traits into multidimensional space to enhance the learnt representation for the demands of multidimensional data.

In addition, we compared DL-CAT with five state-of-the-art question selection algorithms, which are heuristic-based question selectors including MFI, MAAT and its variant (MAAT Cov), learning-based question selector including BOBCAT, and a random selector named Random. The more details of these comparison algorithms are elaborated as follows:

Random: Selects questions randomly from the question bank, which is used to quantify other methods.
MFI: Linden and Pashley [15] is the most classical selection algorithm, which is also the most widely used selection algorithm, selects the strategy with the most information of Fisher to select the question, and this method is only applicable to IRT model.
MAAT: Bi et al. [2] and Caiet al. [3] based on the idea of active learning, the model is deeply studied from the perspective of model change. The expected model change (EMC) caused by each question is calculated to measure the information content of the problem. Finally, the question with the largest expected model change (EMC) is selected,and this method is independent of CDM.
MAAT Cov: Bi et al. [2] is the full version of MAAT, quality Module first quantifies the informativeness of questions and generates candidate subset with the highest quality, diversity Module selects one question at each step that maximizes the concept coverage, which considers the coverage index of knowledge concept to choose the question;
BOBCAT: Ghosh and Lan [10] leverages the bilevel optimization framework to learn data-driven problem selection algorithms directly from training data, which is agnostic to the underlying student response model and is computationally efficient during the adaptive testing process.

Evaluation metrics. We used different question selector $\pi $ to select questions $e_{j}^{*}$ for students in question bank E, and used a M to update students’ proficiency $ {\hat{\theta }}$. Finally, we evaluate the accuracy (ACC) and area under the curve (AUC) of ${\hat{\theta }}$ on test set. Higher ACC (AUC) value indicates more suitable questions selected for students.

Training details. In the question selector module, an 3-layer (i.e.,$h=3$) MLP is used, whose hidden size is set to 8. In the ground-truth construction module, the training set is randomly divided into the support set ($D_{T}^{s}$) and the meta set ($D_{V}^{s}$) according to 50% and 50%. In the training process, we adopted the Adam optimizer [12] with a learning rate being 0.001. To reduce the computational cost, we selected the top 40% of the questions predicted by the selector to calculate the RMSE values.

Experimental results

Overall performance (RQ1)

To verify the effectiveness of the proposed algorithm, Table 3 summarizes the comparison results of the proposed DL-CAT and compared methods in terms of ACC and AUC values, where the number of steps in the question selection is set to 5 and 10, respectively. As a consequence, we have three observations that can can be seen from the table. Firstly, the proposed DL-CAT holds a bigger performance increase than other methods when the number of steps increase from 5 to 10, and the performance improvement is significant on the ASSIST dataset. Secondly, all methods hold better performance in terms of AUC and ACC values when adopting the IRT model for cognitive diagnosis, and the performance gap is significant on the ASSIST dataset. Thirdly, the proposed DL-CAT exhibits better AUC and ACC values than all compared methods on both ASSIST and Math datasets under whichever cognitive diagnosis model, where there is more than $0.7\%$ improvement hold the proposed approach compared to the second best algorithm in step 10 of ASSIST.

For deep insight into the results, Fig. 3a presents the AUC values obtained by the proposed approach and all comparison methods on the ASSIST dataset from 0th step to 20th step, where we used the ground-truth construction module to obtain the minimum RMSE value for demonstrating the upper bound of the selection algorithm. We can observe that the line represented by real always holds the best performance, which indicates that the proposed DL-CAT is always better than other compared methods. Besides, it is obvious that the AUC value gaps between the proposed approach and other compared methods except for BOBCAT gradually increase with the number of steps increasing, which entails that the proposed approach is more effective when the number of steps in question selection is big. In summary, the effectiveness of the proposed ground-truth module and DL-CAT can be validated.

Table 3 The performance comparison of the proposed DL-CAT and compared methods on both ASSIST and Math datasets in terms of AUC and ACC values, where two settings of steps

Full size table

The performance of $\theta $ estimation (RQ2)

Due to the CAT’s goal to estimate the student’s proficiency $\theta $, in addition to the above evaluation of the students’ scores prediction, we designed a simulation experiment to measure the gap between student’s proficiency $\theta $ and diagnosed proficiency $\hat{\theta _{i}^{t}}$ based on the t-step question selection and the corresponding responses from the student $s_i$. Specifically, we manually constructed the student’s proficiency $\theta _{0}$ and generated the corresponding response for proficiency estimation. Figure 3b shows the RMSE profiles of all methods which can be computed by RMSE = $|\hat{\theta _{i}^{T}} -\theta _{0}|$ [16]. We can observe that the proposed DL-CAT achieves the best performance on the student’s proficiency estimation, especially for the number of steps larger than 10, where the proposed DL-CAT holds faster convergence speed and better final RMSE values.

Effectiveness of pairwise rank loss (RQ3)

To verify the effectiveness of the devised pairwise rank loss, we further built some variants of the proposed approach with some traditional loss functions to compare their performance, where the compared loss functions include ListMleloss [13], ListNetloss [18], and SetRank [23]. Figure 4a presents the AUC values obtained by the proposed approach and three built variant approaches on the ASSIST dataset, where four settings for the number of steps (i.e., 5, 10, 15, and 20) are adopted for a comprehensive comparison. It is obvious that the devised pairwise rank loss enables the proposed approach to hold the best performance under whichever number of steps. Besides, the performance leading of the pairwise rank loss under different numbers of steps over other loss functions do not change significantly, which demonstrates that the proposed loss is more robust, especially compared to the ListMleloss [13] and SetRank [23]. As a result, we can conclude that the proposed pairwise rank loss design is effective in improving the overall performance of DL-CAT.

Impact of different penalty parameter a (RQ4)

It can be observed that the value of a in Eq. (8), as a penalty term in the designed loss function, may have an effect on the training of the question selector. To investigate the effect of different settings of a on the final performance of the proposed approach, we consider four different yet classical settings for the parameter a, including $a=1$, $a=(q-j)$, $a=k$, and $a = (q-j) * k$. It is worthy noting that $a=1$ entails that the position relationship is not considered, $a=(q-j)$ means that the higher of the position is, the larger the penalty coefficient is, $a=k$ indicates that the comparison position is considered, while $a = (q-j) * k$ represents that two situations all considered. Then, Fig. 4b summarizes the AUC values obtained by the proposed approach with the above four settings for parameter a on the ASSIST dataset. We can observe that the proposed approach with $a = (q-j) * k$ holds higher AUC values than that with other settings and the performance leading of $a = (q-j) * k$ is significant when compared to $a=1$. Therefore, it can be proved that the penalty coefficient should fully consider the positional relationship.

Evaluation on efficiency (RQ5)

The above experiments have demonstrated the effectiveness of the proposed approach as well as some devised strategies. To further show the superiority of the proposed approach, here we compared the efficiencies of the proposed approach and comparison methods by comparing their runtime of training, testing, and adding questions. Table 4 summarizes the time cost of the different models in the training and test stages on the ASSIST dataset. It can be seen that the proposed DL-CAT is much more efficient than BOBCAT, MAAT and MAAT Cov in the testing phase. However, in order to verify that the proposed DL-CAT is decoupled during the training phase, we simulated a CAT system scenario where new questions will be added to the question bank E, and added 50 questions to the question bank E. It can be seen that the training efficiency of the proposed DL-CAT in the scenario of adding new questions is much better than BOBCAT.

Table 4 Efficiency experiment on ASSIST

Full size table

Conclusions and future work

In this paper, we proposed a novel CAT framework with Decoupled Learning selector (DL-CAT), which uses a deep neural network to select the next question when giving the current estimated student proficiency. To achieve the goal of decoupling the parameter learning of the question selector and student proficiency estimation modules, a ground-truth construction strategy was devised, and a pairwise loss function was suggested to make that the question selector can be trained independently. Extensive experiments show that the DL-CAT framework has significant advantages in performance and efficiency. Besides, the effectiveness of the ground-truth construction strategy and pairwise loss function were also verified in the experiment parts.

In the future, on one hand, we will further attempt to improve the performance and efficiency of the DL-CAT framework through designing the student ability initialization strategy inspired by meta-learning [9, 28, 33, 39]; on the other hand, we would like to apply the proposed model to other fields (e.g., psychological assessment) and refine new problems.

Data availability

The Assistment2009 dataset used in this paper can be obtained from the website https://sites.google.com/site/assistmentsdata/home/2009-2010-assistment-data. However, the Math dataset is a private dataset that cannot be publicly shared due to (legal/commercial) restrictions.

References

Almond RG, Mislevy RJ (1999) Graphical models and computerized adaptive testing. Appl Psychol Meas 23(3):223–237
Article Google Scholar
Bi H, Ma H, Huang Z, Yin Y, Liu Q, Chen E, Su Y, Wang S (2020) Quality meets diversity: a model-agnostic framework for computerized adaptive testing. In: 2020 IEEE International Conference on Data Mining (ICDM), pp 42–51
Cai W, Zhang Y, Zhang Y, Zhou S, Wang W, Chen Z, Ding C (2017) Active learning for classification with maximum model change. ACM Trans Inf Syst (TOIS) 36(2):1–28
Article Google Scholar
Chang H-H, Ying Z (1996) A global information approach to computerized adaptive testing. Appl Psychol Meas 20(3):213–229
Article Google Scholar
Cheng S, Liu Q, Chen E, Huang Z, Huang Z, Chen Y, Ma H, Hu G (2019) Dirt: deep learning enhanced item response theory for cognitive diagnosis. In: Proceedings of the 28th ACM international conference on information and knowledge management, pp 2397–2400
Dodd BG, De Ayala R, Koch WR (1995) Computerized adaptive testing with polytomous items. Appl Psychol Meas 19(1):5–22
Article Google Scholar
Embretson SE, Reise SP (2013) Item response theory. Psychology Press
Book Google Scholar
Feng M, Heffernan N, Koedinger K (2009) Addressing the assessment challenge with an online system that tutors as it assesses. User Model User-Adapt Interact 19(3):243–266
Article Google Scholar
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: International conference on machine learning, pp 1126–1135
Ghosh A, Lan A (2021) Bobcat: bilevel optimization-based computerized adaptive testing. arXiv preprint arXiv:2108.07386
Hooker G, Finkelman M, Schwartzman A (2009) Paradoxical results in multidimensional item response theory. Psychometrika 74(3):419–442
Article MathSciNet MATH Google Scholar
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Lan Y, Zhu Y, Guo J, Niu S, Cheng X (2014) Position-aware ListMLE: a sequential learning process for ranking. In: UAI, pp 449–458
Li X, Xu H, Zhang J, Chang H (2020) Deep reinforcement learning for adaptive learning systems. arXiv preprint arXiv:2004.08410
Linden WJ, Pashley PJ (2009) Item selection and ability estimation in adaptive testing. In: Elements of adaptive testing. Springer, pp 3–30
Linden WJ, van der Linden WJ, Glas CA (2000) Computerized adaptive testing: theory and practice. Springer
Book Google Scholar
Liu T-Y et al (2009) Learning to rank for information retrieval. Found Trends® Inf Retr 3(3):225–331
Liu Y, Zhang X, Zhu X, Guan Q, Zhao X (2017) Listnet-based object proposals ranking. Neurocomputing 267:182–194
Article Google Scholar
Lord FM (2012) Applications of item response theory to practical testing problems. Routledge
Book Google Scholar
Meijer RR, Nering ML (1999) Computerized adaptive testing: overview and introduction. Appl Psychol Meas 23(3):187–194
Article Google Scholar
Mills CN, Steffen M (2000) The GRE computer adaptive test: operational issues. In: Computerized adaptive testing: theory and practice. Springer, pp 75–99
Mulder J, Linden WJ (2009) Multidimensional adaptive testing with Kullback–Leibler information item selection. In: Elements of adaptive testing. Springer, pp 77–101
Pang L, Xu J, Ai Q, Lan Y, Cheng X, Wen J (2020) Setrank: learning a permutation-invariant ranking model for information retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp 499–508
Plajner M, Vomlel J (2015) Bayesian network models for adaptive testing. arXiv preprint arXiv:1511.08488
Riedmiller M, Lernen A (2014) Multi layer perceptron. Machine Learning Lab Special Lecture, University of Freiburg, pp 7–24
Rudner LM (2002) An examination of decision-theory adaptive testing procedures. In: Annual meeting of the American Educational Research Association
Rudner LM (2009) Implementing the graduate management admission test computerized adaptive test. In: Elements of adaptive testing. Springer, pp 151–165
Schein AI, Popescul A, Ungar LH, Pennock DM (2002) Methods and metrics for cold-start recommendations. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, pp 253–260
Segall DO (1996) Multidimensional adaptive testing. Psychometrika 61(2):331–354
Article MATH Google Scholar
Tsilifis P, Ghanem RG, Hajali P (2017) Efficient Bayesian experimentation using an expected information gain lower bound. SIAM/ASA J Uncertain Quantif 5(1):30–62
Article MathSciNet MATH Google Scholar
Van Der Linden WJ (1999) Multidimensional adaptive testing with a minimum error-variance criterion. J Educ Behav Stat 24(4):398–412
Article Google Scholar
Veldkamp BP, van der Linden WJ (2002) Multidimensional adaptive testing with constraints on test content. Psychometrika 67(4):575–588
Article MathSciNet MATH Google Scholar
Vilalta R, Drissi Y (2002) A perspective view and survey of meta-learning. Artif Intell Rev 18(2):77–95
Article Google Scholar
Wang C, Chang H-H (2011) Item selection in multidimensional computerized adaptive testing—gaining information from different angles. Psychometrika 76(3):363–384
Article MathSciNet MATH Google Scholar
Wang C, Chang H-H, Boughton KA (2011) Kullback–Leibler information and its applications in multi-dimensional adaptive testing. Psychometrika 76(1):13–39
Article MathSciNet MATH Google Scholar
Xiong X, Zhao S, Van Inwegen EG, Beck JE (2016) Going deeper with deep knowledge tracing. International Educational Data Mining Society
Google Scholar
Yao L (2013) Comparing the performance of five multidimensional CAT selection procedures with different stopping rules. Appl Psychol Meas 37(1):3–23
Yoo D, Kweon IS (2019) Learning loss for active learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 93–102
Yoon J, Kim T, Dia O, Kim S, Bengio Y, Ahn S (2018) Bayesian model-agnostic meta-learning. In: Advances in neural information processing systems, vol 31
Zhuang Y, Liu Q, Huang Z, Li Z, Shen S, Ma H (2022) Fully adaptive framework: neural computerized adaptive testing for online education

Download references

Author information

Authors and Affiliations

Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, Anhui, China
Haiping Ma & Yi Zeng
Institutes of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230088, Anhui, China
Haiping Ma
School of Computer Science and Technology, Anhui University, Hefei, 230601, Anhui, China
Shangshang Yang & Xingyi Zhang
Baidu Talent Intelligence Center, Baidu Inc. Beijing, Beijing, 10085, China
Chuan Qin
Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei, 230601, Anhui, China
Limiao Zhang

Authors

Haiping Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Shangshang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Chuan Qin
View author publications
You can also search for this author in PubMed Google Scholar
Xingyi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Limiao Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Limiao Zhang.

Ethics declarations

Conflict of interest

The authors declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ma, H., Zeng, Y., Yang, S. et al. A novel computerized adaptive testing framework with decoupled learning selector. Complex Intell. Syst. 9, 5555–5566 (2023). https://doi.org/10.1007/s40747-023-01019-1

Download citation

Received: 17 October 2022
Accepted: 17 February 2023
Published: 24 March 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s40747-023-01019-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A novel computerized adaptive testing framework with decoupled learning selector

Abstract

Similar content being viewed by others

Condensed Discriminative Question Set for Reliable Exam Score Prediction

Improving Short Answer Grading Using Transformer-Based Pre-training

Automatic Short Answer Grading via Multiway Attention Networks

Explore related subjects

Introduction

Related work

Heuristic-based question selector

Learning-based question selector

Preliminaries and problem formalization

Preliminaries

Problem formalization

The proposed approach

Overall framework

Question selector

Ground-truth construction

Training with pairwise rank loss

Experiment

Experimental settings

Experimental results

Overall performance (RQ1)

The performance of \(\theta \) estimation (RQ2)

Effectiveness of pairwise rank loss (RQ3)

Impact of different penalty parameter a (RQ4)

Evaluation on efficiency (RQ5)

Conclusions and future work

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation