Introduction

Computerized adaptive testing (CAT) is a new personalized testing mode, with the aim to accurately estimate students’ proficiency by adaptively selecting the best suited questions [20]. To achieve this goal, CAT executes two modules (i.e., student proficiency estimation module and question selector module) alternately step by step, which is shown in Fig. 1. Especially, in each step n, the student proficiency estimation module output the current estimated student proficiency \(\theta _n\) based on the previous response (steps 1 to \(n-1\)); the question selector then selects the next question for the student to answer based on given \(\theta _n\) and question traits. Owing to the cost reduction, efficiency improvement, and safety characteristics, CAT has been widely used in many standard examination institutions, e.g., GRE [21] and Graduate Management Admission Test [27].

Fig. 1
figure 1

An illustrative example of CAT procedure

In past decades, considerable efforts have been made for CAT tasks [1]. One of the most typical works uses 2-parameter Item Response Theory (IRT) [7, 19] to estimate the student proficiency based on previous responses and Maximum Fisher Information (MFI) [15, 20] as the question selector. Equation (1) provides the IRT function, where \(P_k(\theta _i)\) is the probability of the student with proficiency \(\theta _i\) answering question \(e_k\) with two pretrained parameters (i.e., difficulty \(\alpha _k\) and discrimination \(\beta _K\)) correctly, \(\sigma [\cdot ]\) is the logistic function. Given the student’s previous responses (steps 1 to n), the IRT model can estimate the student proficiency \({\hat{\theta }}_i^{(n)}\). Equation (2) shows the fisher informativeness of question \(e_k\) under the current estimate of student proficiency \(\theta _i^{(n)}\), \(P_{{k}}^{\prime }({\hat{\theta }})=\partial P_{{k}}({\hat{\theta }}) / \partial {\hat{\theta }}\) is the derivation process. The question with the maximum informativeness is selected. That is, the question with high-discrimination and difficulty closed to current estimate of student’s proficiency is the best suited one at each step. Afterwards, many similar approaches were developed, such as, the CAT system [6] that uses multidimensional Item Response Theory (MIRT) [11] as the student proficiency estimation module and Kullback–Leibler Information (KLI) [4] as the question selector, the CAT system [20] that uses bayesian network models [24] as the student proficiency estimation module and the expected information gain maximum [30] as the question selector.

$$\begin{aligned}{} & {} P_k(\theta _i)=P\left( Y=1|\theta _i, e_k \right) =\sigma \left[ \alpha _{K}\left( \theta _i-\beta _{K}\right) \right] \end{aligned}$$
(1)
$$\begin{aligned}{} & {} I_{k}({\hat{\theta }}_i^{(n)})=\frac{P_{k}^{\prime }({\hat{\theta }}_i^{(n)})}{P_{k}({\hat{\theta }}_i^{(n)})\left[ 1-P_{k}({\hat{\theta }}_i^{(n)})\right] } \end{aligned}$$
(2)

Although these traditional question selection methods have certain effectiveness, they still have great limitations. Specifically, most question selection algorithms are based on predefined criteria, which have certain preferences and cannot effectively capture the complex data characteristics [4, 15]. To address the limitations of these heuristic–based question selectors, some experts also attempted to explore question selectors based on learning selection strategies [10] by redefining the CAT problem as a bilevel optimization [10, 40] and reinforcement learning problem [5, 14]. These learning-based question selectors have shown advantages over the criteria-based question selectors [14]. However, these existing learning-based CAT frameworks are not flexible enough since their two modules (i.e., the student proficiency estimation model and the question selector) are coupled during the training. For example, the bilevel optimization-based CAT framework [10] obtains the parameters of two modules (i.e., student proficiency estimation and question selector) through coupled internal and external optimizations during the training process, i.e., the result of external optimization model is used to measure the quality the internal optimization.

To this end, we propose a novel CAT framework with decoupled learning selector (DL-CAT). To be specific, the main contributions of this paper can be summarized as follows:

  • We propose a novel deep learning-based question selector, which directly output the next question through giving the currently estimated student proficiency and question features. Compared with these existing CAT systems, the advantages are reflected in two aspects. On one hand, it uses deep neural networks to more intelligently select the next question directly instead of predefined criteria. On the other hand, the question selector is agnostic with both the model and training process of student proficiency estimation module.

  • We are the first to decouple the parameter learning of the question selector and student proficiency estimation model in the learning-based CAT framework. To achieve this goal, we specially design an approximate ground-truth module and loss learning module to update the parameters of the question selector independently.

  • The experimental results on two real-world datasets demonstrate the effectiveness and efficiency of the proposed DL-CAT. Especially, the DL-CAT shows certain advantages in terms of effectiveness and significant advantages in terms of efficiency compared to both the traditional question selectors and novel learning-based question selectors.

Related work

In this paper, we focus on the question selector component of CAT. Existing question selectors of CAT are mainly divided into the following two categories: heuristic-based question selector and learning-based question selector.

Heuristic-based question selector

Most of the question selectors in this category are model-specific, which are specially designed by experts according to the characteristics of different student proficiency estimation models (i.e., the other component of CAT), such as Maximum Fisher Information (MFI) [15], Kullback–Leibler Information (KLI) [4], Shannon entropy, mutual information, minimizing error [29, 34, 35, 37] and their extension question selection methods are specially designed for IRT models

[11, 26, 31]. Among them, the core of Maximum Fisher Information (MFI) [15]’s question selection strategy is to use the estimated value of the current student’s ability to calculate the information content of the remaining questions in the question bank and then select the question with the largest amount of information as the next test one. The function of Maximum Fisher Information (MFI) [15] is only regarding the ability estimation value, therefore, Chang and Ying proposed a question selection method based on KL global information, and the selected question is the question with the largest KL value. The larger the KL value, the better the current estimated value of student proficiency and other student proficiency values within a reasonable fluctuation range. Afterward, when the student proficiency estimation module is expanded from single-dimensional IRT to multidimensional IRT (MIRT), a variety of question selection strategies through multidimensional expansion and optimization of MFI and KLI were proposed. The KLI-based multidimensional extensions mainly include a weighted KL strategy (PWKL) [32] and a capability posterior distribution KL strategy [22]. The flexible of these model-specific question selectors is limited, since it is difficult to specially design a question selector according to the neural network-based student proficiency estimation models.

To address the limitations of the model-specific question selectors, a model-agnostic question selector, namely MAAT, was proposed to select questions based on uncertainty [2]. Inspired by the idea of active learning, the model is designed from the perspective of the model change, i.e., the expected model change (EMC) caused by each question is calculated to measure the information content of the problem [3]. Finally, the question with the largest expected model change (EMC) is selected, and the proposed question selector is independent of student proficiency estimation model.

In conclusion, these algorithms based on heuristics are only designed based on some prior knowledge or experience, with the drawback of fully considering the characteristics of data.

Learning-based question selector

To address the problem that the heuristic-based question selection algorithms cannot capture the data features, the data-driven learning question selectors are gradually emphasized. This category attempts to learn and continuously optimize a question selector from large-scale behavior data, instead of using static question selection algorithms to reduce the error of capacity estimation as much as possible. The representatives of this category include bilevel optimization-based computerized adaptive testing (BOBCAT) [10] and neural computerized adaptive testing (NCAT) [40]. Both the models redefines the CAT problem as a bilevel optimization problem. In the bilevel optimization framework, the outer level optimization problem is to learn both the response model parameters and a data-driven question selection algorithm by explicitly maximizing the predictive likelihood of student responses in a held-out meta question set, the inner level optimization problem adapts the outer level response model to each student by maximizing the predicted likelihood of their responses in an observed training question set. To solve the bilevel optimization problem of CAT, BOBCAT employs a biased approximate estimator of the gradient w.r.t. the question selection algorithm parameters in the bilevel optimization problem, while NCAT formally transform it into an equivalent reinforcement learning problem [5, 14].

While the above learning-based question selectors have achieved great success, the drawback of bilevel optimization-based framework in flexible is still apparent, since the internal and external optimizations (i.e., the parameter learning of the two components) are mutually coupled in the training process, i.e., the internal (external) optimization is used to measure the quality of external (internal) optimization. The coupling training process requires the question selector to be retrained from scratch when adding new questions.

Therefore, in this paper, on one hand, we attempt to design a deep learning-based question selector to automatically predict the question selection probability, so as to more comprehensively capture the data characteristics; on the other hand, in order to address the training inflexibility issue that the parameter learning of question selection is coupled with that of student proficiency estimation model, we specially design a ground-truth construction module to train the question selector independently. With these special designs, when new questions are added to the question bank, the question selector can be directly used to obtain the question selection scores in the new question bank without retraining (we only need to retrain the student proficiency estimation model to obtain the parameters of the new question), which is more applicable to real CAT education scenarios.

Table 1 A list of major notations used in this work

Preliminaries and problem formalization

This section consists of two parts: preliminaries and problem formalization. The part of preliminaries focuses on introducing the terminology, the goals and composition structure of traditional computerized adaptive testing (CAT). The part of problem formalization describes the definition and expansion of the problem in the CAT task in this paper. The major notations used in this paper are listed in Table 1.

Preliminaries

In an intelligent education system, suppose there are L students and M questions, which can be represented by the student set \(S=\{s_1,s_2,\ldots ,s_L\}\) and the question bank \( E=\left\{ e_{1}, e_{2}, \ldots , e_{M}\right\} \), respectively, each element of response logs R can be denoted by a set of triplet (ser), where \(s\in S\), \(e\in E\), and \(r \in \{0,1\}\) that represents the response score of student s got on question e. Here \(r=1\) indicates the answer is correct and \(r=0\) otherwise. A typical CAT system has two components, namely the proficiency estimation model M and the question selector \(\pi \), the former is based on response logs R from students to train the proficiency estimation model M, the latter selects the next question e dynamically based on the student’s behavior record R and proficiency estimate \(\theta \). After multiple rounds of question selection based on student interaction data, when the test ends, a test sequence is formed to accommodate student s. Our goal is to design a strategy \(\pi \) to select a question set of size N, denoted by \(E_{i} = \left\{ e_1^*, e_2^*, \ldots , e_N^*\right\} \), step by step according to the performance of \(s_{i}\), which can test the examinee more accurately.

As mentioned in the section of related work, some latest data-driven methods formalize the CAT problem as a bilevel optimization problem in the meta-learning configuration, where the parameter learning of the proficiency estimation model M and the question selector \(\pi \) are coupled together. Specifically, in the outer level optimization, the parameters of the two modules are learned by maximizing the prediction accuracy in the meta set. In the inner level optimization, the two module parameters learned in the outer-level are applied to the students’ proficiency assessment so as to maximize their prediction accuracy in the training set.

Fig. 2
figure 2

The question selector and its training methodology

Problem formalization

In this paper, our goal is to decouple the training process of the proficiency estimation model M and question selector \(\pi \). Therefore, we focus on designing a learnable question selector \(\pi \) used to select one question e from E for a student \(s_i\) at each step k by giving the current estimated student proficiency \({\hat{\theta }}_i^{(k-1)}\) and question parameters. After receiving the response r of student \(s_i\) on question e, the model M updates and estimates new proficiency \({\hat{\theta }}_i^{(k-1)}\). It is worth noting that the update of student proficiency \({\hat{\theta }}_i\) is only based on the previous selected questions and response results, and independent of the question selector itself. This procedure repeats n times, so as to accurately estimate the student proficiency, i.e., \(\theta _i^{(n)} \rightarrow \theta _i^0\), where \(\theta _i^0\) is the true proficiency (usually unknown) of student \(s_i\).

The proposed approach

Overall framework

The main idea of the proposed approach is to select a most suitable question for a student in each step through a question selector according to the student’s current knowledge mastery (or knowledge proficiency) of each knowledge concept. It is obvious that there two components that have big effects on the quality of the select question, i.e, the student proficiency estimation model to diagnose student’s current knowledge mastery and the question selector to recommend a suitable question. Due to the focus of this paper is on designing an effective question selector, we will directly employ existing student proficiency estimation models (such as IRT [7] or MIRT [11]) to get the student’s knowledge proficiency in each knowledge concept. However, due to the particularity of the task, there is no ground truth to measure the quality of the selected question, and thus the general training procedure cannot be directly applied. To address this issue, we devise an ground-truth construction module to generate reasonable labels for enabling the training process of the question selector to be executed well. In addition, we further suggest a pairwise rank loss function to make the training of the model stable as well as thus provide better final performance.

For better understanding, the overall framework the proposed approach DL-CAT has been summarized in Fig. 2, which is mainly composed of four steps. To start with, a classical cognitive diagnosis model will be trained based on all students’ response logs, where the student proficiency estimation model is to get students’ knowledge proficiency by modelling student exercising scores. Second, the student’s ability parameter and exercise latent parameters obtained in the student proficiency estimation model will be combined as the input of the question selector, and then the question selector will predict the student’s scores on all candidate question. Next, the ground-truth module will be employed to generate corresponding label for the given student. Fourth, the loss function is calculated based on the predicted output and the generated labels to update the weight parameters of the question selector, where the suggested pairwise rank loss is adopted. The above procedure will be repeated for all students and not end until the question selector converge. After that, the question selector will recommend the question with the best score to each student, respectively.

Question selector

The question selector \(\pi \) selects a sequence of suitable questions to a student based on the student’s current knowledge proficiency state by exploring the relationship between the student and questions. It is intuitive that the more accurate the diagnosed student’s current knowledge proficiency is, the more suitable the question selected by the question selector to the student is. Therefore, it is important to obtain accurate diagnosis results for each student by a cognitive diagnosis model \(M_{C}\) before mining the potential relationship between the student and new questions that the student has never done.

For this aim, we first train a cognitive diagnosis model \(M_{C}\) based on all students’ response logs in the training dataset D by minimizing the following loss:

$$\begin{aligned} {\text {loss}}=-\frac{1}{n} \sum _{(s_{i}, e_{j}, r_{ij}) \in D} r_{ij} \log \hat{r_{ij}}+(1- r_{ij}) \log (1-\hat{r_{ij}}), \end{aligned}$$
(3)

where \(\hat{r_{ij}} = \ {M}_C(s_i,e_j)\) denotes the model \(M_{C}\)’s predicted probability of student \(s_i\) correctly answering question \(e_{j}\).

Then, we will directly extract the student-related parameters from the trained model \(M_{C}\) as the student’s current knowledge proficiency. Besides, to obtain the representation of a new question \(e_j\) that the student has never done, we also extracted the question-related parameters from model \(M_{C}\). Note that, in the following, \(e_{j}\) denotes the question in the question bank that the student \(s_i\) has never done.

Take the Item Response Theory (IRT [7]) model as an example, whose forward pass process can be denoted as follows:

$$\begin{aligned} \left\{ \begin{array}{l} {\textbf{h}}_S = X_i^S\times W_S, \ W_S \in R^{L\times D}\\ {\textbf{h}}_E = X_j^S\times W_E, \ W_E \in R^{M\times D}\\ \theta _i = {\textbf{h}}_S \times W_{\theta }, \ W_{\theta } \in R^{D\times 1}\\ \alpha _j = {\textbf{h}}_E \times W_{\alpha }, \ W_{\alpha } \in R^{D\times 1}\\ \beta _j = {\textbf{h}}_E \times W_{\beta }, \ W_{\beta } \in R^{D\times 1}\\ {\hat{r}}_{ij} = \hbox {Sigmoid}(\alpha _j\cdot (\theta _i-\beta _j))\\ \end{array} \right. , \end{aligned}$$
(4)

where D is the embedding dimension, \({\textbf{x}}_i^S \in \{0,1\}^{1\times L}\) is the student one-hot vector for \(s_i\), \({\textbf{x}}_j^E \in \{0,1\}^{1\times M}\) is the question one-hot vector for \(e_j\), and \(W_S\), \(W_E\), \(W_{\theta }\), \(W_{\alpha }\), and \(W_{\beta }\) are trainable matrices in the embedding layers. As a result, \({\hat{r}}_{ij} = {M}_C(s_i,e_j|{\hat{\theta }}_{i}, \alpha _{j},\beta _j)\). Here, we can extract a student-related parameter \(\theta _i\) for student \(s_i\) and two question-related parameters for a new question \(e_j\), i.e., the question discrimination \(\alpha _j\) and the question difficulty \(\beta _j\).

Afterward, the proposed question selector \(\pi \) takes the student’s current knowledge proficiency and question parameters as its inputs. To integrate the two types of input features, a simple yet effective concatenation operation is used to get the formal input of \(\pi \) as

$$\begin{aligned} X_{ij} = [ \theta _{i},\alpha _{j}, \beta _{j}]. \end{aligned}$$
(5)

As a result, for a new question \(e_j\), the proposed question selector \(\pi \) will predict the score \(p_{ij}\) of student \(s_{i}\) on question \(e_{j}\) by a h-layer multi-layer perceptron (MLP) [25], whose basic forward pass process is as follows:

$$\begin{aligned} \left\{ \begin{array}{l} y_{1}=\sigma \left( {W}_{1} \times {X}_{ {ij}}+ {b}_{1}\right) \\ \ldots ,\\ y_{i} = \sigma (W_{i}\times y_{h-2}+b_{i}),\ 2 \ge i \ge h-1 \\ p_{ij}={\sigma }\left( {W}_{h} \times {y}_{h-1}+b_{h}\right) \\ \end{array} \right. , \end{aligned}$$
(6)

where \(\sigma (\cdot )\) is the activation function, \({y}_{i}\) is the output of the ith layer, \(W_i\) and \(b_i\) are the weights and bias of the ith layer.

Ground-truth construction

As the question selector \(\pi \) is independent to model \({M}_C\), it is infeasible to directly employ model \({M}_C\) to determine whether a question selection is suitable for a student. To measure the quality of a selected question, we propose an approximate ground-truth construction module to obtain effective ground truth.

Specifically, this module will generate a ground truth to measure the effectiveness of a question selected by \(\pi \). To this end, RMSE (root mean squared error) is used to compute the difference between student real proficiency \(\theta _i^{0}\) and the updated proficiency \(\hat{\theta _i}\) after a question is selected for student \(s_i\). Here, the student’s \(\theta _i^{0}\) is obtained based on the student’s meta set response logs. First, we initialize the knowledge proficiency \(\theta _{i}\) of student \(s_{i}\), select a set of questions \(\{e_{1}, e_{2},\ldots , e_{q}\}\) as support set \(D_{T}^{i}\), and obtain the updated student’s \(\hat{\theta _i}\) based on \({M}_C\) and \(D_{T}^{i}\). Then we calculate the \(Rmse_{ij}\) of \({\hat{\theta }}_{i}\) by

$$\begin{aligned} Rmse_{ij}=\sqrt{\frac{1}{|D_{V}^{i}|} \sum _{f=1}^{|D_{V}^{i}|}\left( r_{if}-\ {M}(s_i,e_f|{\hat{\theta }}_{i}, \alpha _{f},\beta _f)\right) ^{2}},\nonumber \\ \end{aligned}$$
(7)

where \(D_{V}^{i}\) denotes the meta set, \(|D_{V}^{i}|\) is the number of questions in \(D_{V}^{i}\), \(r_{if}\) is the real response score on question \( e_{f}\), \({M}(s_i,e_f|{\hat{\theta }}_{i}, \alpha _{f},\beta _f)\) is the probability predicted by model \({M}_C\) on question \(e_{f}\) under the updated student proficiency \({\hat{\theta }}_{i}\).

The ground-truth construction module simulates the real CAT estimation process by further measuring the quality of the selected questions through the results on the validation set. In this paper, the approximation generated through the simulation process is used as the label for the prediction score, so as to overcome the problem of not directly reflecting the suitability of the selected questions for the students.

figure a

Training with pairwise rank loss

Due to the small difference among values of ground truth generated by the ground-truth construction module, the relationship between the order of student’s selection of questions cannot be accurately reflected, where thus the traditional loss functions fail to solve this well [38]. To tackle this problem, we are inspired by Learning to rank [17] and learning loss [38] to design a loss method suitable for the current problem, we propose a novel pairwise rank loss, which can reflect the sequence of selected questions with the relation of partial sequence pairs well. With the sorted prediction score sequence \(P_{i}=\{p_{i1},\ldots ,p_{iq}\}\) and the computed RMSE sequence \({RMSE}_{i} = \{{rmse}_{i1},\ldots ,{rmse}_{iq} \}\), we can compute the pairwise rank loss \({L}_i\) of student \(s_{i}\) on q questions by

$$\begin{aligned} \begin{aligned}&{L}_{i}=\sum _{j=1}^{q} \sum _{k=1}^{q-1} \max \left( 0, A\left( {rmse}_{ij}, \ {rmse}_{ik}\right) \cdot \left( p_{ij}-p_{ik}\right) \right) \\&A\left( {rmse}_{ij}, {rmse}_{ik}\right) = \left\{ \begin{array}{ll}-a, &{} \text{ if }\ {rmse}_{ij} < {rmse}_{ik} \\ +a, &{} \text{ otherwise } \end{array}\right. , \end{aligned} \end{aligned}$$
(8)

where parameter a is used as a punitive coefficient to represent the weight of position, the higher the position is, the greater the value of a is.

With the tailored loss function, the proposed question selector can be trained well, where more details about the training process have been summarized in Algorithm 1. To start with, initialize the knowledge proficiency of each student in S, and Pre-training the basic cognitive diagnosis model M based on the historical data until the model converges, the question-related parameters can be extracted from model M (Lines 1–2). Second, when step = 0,directly obtain the initialization current knowledge proficiency parameter, if step > 0, update the knowledge proficiency parameter (Lines 4–6).After that, we divide the record data \(D^s\) of a student s into training dataset \(D_T^s\) and meta set \(D_V^s\) (Lines 8–9). Next, calculate the predicted score of the question in \(D_T^s\) of the dataset according to the spliced features, use the rmse indicator in the \(D_T^v\) dataset to verify the updated \({\hat{\theta }}\), (Lines 10–15). Fifth, use Eq. (8) to calculate the loss of \(P_{i}\) and \(RMSE_{i}\), update the selector parameters \(f(\varphi )\) for the next iteration (Lines 16–17).

Experiment

In the experiment part, we mainly focus on answering the following questions:

  • (RQ1) What is the performance of the proposed DL-CAT compared to the state-of-art approaches?

  • (RQ2) What is the performance of DL-CAT in simulation experiment?

  • (RQ3) Whether the designed pairwise rank loss function in this paper is more effective compared to the traditional listwise loss method?

  • (RQ4) Is there any difference in the impact of different punitive coefficient settings?

  • (RQ5) What are the advantages of DL-CAT in terms of efficiency?

Experimental settings

Datasets. To validate the effectiveness of the proposed DL-CAT on computerized adaptive testing, two real-world educational datasets were used in the following experiments, including one public dataset ASSIST [8] and one private dataset Math, whose statistics are summarized in Table  2 and descriptions are listed as follows:

  • ASSISTments (ASSISTments 2009–2010 skill builder) Feng et al. [8] is an openly available dataset created in 2009 by the ASSISTments online tutoring service system. Here we adopt the public corrected version [36] that do not contain the duplicated data. As can be seen, there are more than 4 thousand students, nearly 18 thousand questions, and over 300 thousand response logs in the dataset.

  • Math is a private dataset based on the real behavior records of mathematics subjects corresponding to students on the education platform. The dataset mainly contains behavioral data of students in grades 1 to 6, with a total of more than 10,000 students and more than 1 million response records.

Data processing. To make the reliproficiency of experimental results, for both ASSIST and Math, we first filtered the knowledge concepts with less than ten related questions. Moreover, in ASSIST, we also filtered the questions answered less than 50 times and the students who answered less than 10 questions.

Table 2 Statistics of the datasets

Benchmark methods. For a fair comparison, we adopted two capability assessment models, IRT [7] and MIRT [11] as the proficiency evaluation models, the following is their corresponding descriptions:

  • IRT [7]: IRT is the most typical cognitive diagnosis model, which takes the simple logistic function to integrate input vectors and represents the student mastery as a unidimensional continuous vector for predicting the probabilities of a student correctly answering questions;

  • MIRT [11]: As the successor of IRT, MIRT extends IRT’s unidimensional student and question latent traits into multidimensional space to enhance the learnt representation for the demands of multidimensional data.

In addition, we compared DL-CAT with five state-of-the-art question selection algorithms, which are heuristic-based question selectors including MFI, MAAT and its variant (MAAT Cov), learning-based question selector including BOBCAT, and a random selector named Random. The more details of these comparison algorithms are elaborated as follows:

  • Random: Selects questions randomly from the question bank, which is used to quantify other methods.

  • MFI: Linden and Pashley [15] is the most classical selection algorithm, which is also the most widely used selection algorithm, selects the strategy with the most information of Fisher to select the question, and this method is only applicable to IRT model.

  • MAAT: Bi et al. [2] and Caiet al. [3] based on the idea of active learning, the model is deeply studied from the perspective of model change. The expected model change (EMC) caused by each question is calculated to measure the information content of the problem. Finally, the question with the largest expected model change (EMC) is selected,and this method is independent of CDM.

  • MAAT Cov: Bi et al. [2] is the full version of MAAT, quality Module first quantifies the informativeness of questions and generates candidate subset with the highest quality, diversity Module selects one question at each step that maximizes the concept coverage, which considers the coverage index of knowledge concept to choose the question;

  • BOBCAT: Ghosh and Lan [10] leverages the bilevel optimization framework to learn data-driven problem selection algorithms directly from training data, which is agnostic to the underlying student response model and is computationally efficient during the adaptive testing process.

Evaluation metrics. We used different question selector \(\pi \) to select questions \(e_{j}^{*}\) for students in question bank E, and used a M to update students’ proficiency \( {\hat{\theta }}\). Finally, we evaluate the accuracy (ACC) and area under the curve (AUC) of \({\hat{\theta }}\) on test set. Higher ACC (AUC) value indicates more suitable questions selected for students.

Training details. In the question selector module, an 3-layer (i.e.,\(h=3\)) MLP is used, whose hidden size is set to 8. In the ground-truth construction module, the training set is randomly divided into the support set (\(D_{T}^{s}\)) and the meta set (\(D_{V}^{s}\)) according to 50% and 50%. In the training process, we adopted the Adam optimizer [12] with a learning rate being 0.001. To reduce the computational cost, we selected the top 40% of the questions predicted by the selector to calculate the RMSE values.

Experimental results

Overall performance (RQ1)

To verify the effectiveness of the proposed algorithm, Table 3 summarizes the comparison results of the proposed DL-CAT and compared methods in terms of ACC and AUC values, where the number of steps in the question selection is set to 5 and 10, respectively. As a consequence, we have three observations that can can be seen from the table. Firstly, the proposed DL-CAT holds a bigger performance increase than other methods when the number of steps increase from 5 to 10, and the performance improvement is significant on the ASSIST dataset. Secondly, all methods hold better performance in terms of AUC and ACC values when adopting the IRT model for cognitive diagnosis, and the performance gap is significant on the ASSIST dataset. Thirdly, the proposed DL-CAT exhibits better AUC and ACC values than all compared methods on both ASSIST and Math datasets under whichever cognitive diagnosis model, where there is more than \(0.7\%\) improvement hold the proposed approach compared to the second best algorithm in step 10 of ASSIST.

For deep insight into the results, Fig. 3a presents the AUC values obtained by the proposed approach and all comparison methods on the ASSIST dataset from 0th step to 20th step, where we used the ground-truth construction module to obtain the minimum RMSE value for demonstrating the upper bound of the selection algorithm. We can observe that the line represented by real always holds the best performance, which indicates that the proposed DL-CAT is always better than other compared methods. Besides, it is obvious that the AUC value gaps between the proposed approach and other compared methods except for BOBCAT gradually increase with the number of steps increasing, which entails that the proposed approach is more effective when the number of steps in question selection is big. In summary, the effectiveness of the proposed ground-truth module and DL-CAT can be validated.

Table 3 The performance comparison of the proposed DL-CAT and compared methods on both ASSIST and Math datasets in terms of AUC and ACC values, where two settings of steps
Fig. 3
figure 3

Illustration of AUC and RMSE value curves of all methods as the number of steps increases

The performance of \(\theta \) estimation (RQ2)

Due to the CAT’s goal to estimate the student’s proficiency \(\theta \), in addition to the above evaluation of the students’ scores prediction, we designed a simulation experiment to measure the gap between student’s proficiency \(\theta \) and diagnosed proficiency \(\hat{\theta _{i}^{t}}\) based on the t-step question selection and the corresponding responses from the student \(s_i\). Specifically, we manually constructed the student’s proficiency \(\theta _{0}\) and generated the corresponding response for proficiency estimation. Figure 3b shows the RMSE profiles of all methods which can be computed by RMSE = \(|\hat{\theta _{i}^{T}} -\theta _{0}|\) [16]. We can observe that the proposed DL-CAT achieves the best performance on the student’s proficiency estimation, especially for the number of steps larger than 10, where the proposed DL-CAT holds faster convergence speed and better final RMSE values.

Effectiveness of pairwise rank loss (RQ3)

To verify the effectiveness of the devised pairwise rank loss, we further built some variants of the proposed approach with some traditional loss functions to compare their performance, where the compared loss functions include ListMleloss [13], ListNetloss [18], and SetRank [23]. Figure 4a presents the AUC values obtained by the proposed approach and three built variant approaches on the ASSIST dataset, where four settings for the number of steps (i.e., 5, 10, 15, and 20) are adopted for a comprehensive comparison. It is obvious that the devised pairwise rank loss enables the proposed approach to hold the best performance under whichever number of steps. Besides, the performance leading of the pairwise rank loss under different numbers of steps over other loss functions do not change significantly, which demonstrates that the proposed loss is more robust, especially compared to the ListMleloss [13] and SetRank [23]. As a result, we can conclude that the proposed pairwise rank loss design is effective in improving the overall performance of DL-CAT.

Impact of different penalty parameter a (RQ4)

It can be observed that the value of a in Eq. (8), as a penalty term in the designed loss function, may have an effect on the training of the question selector. To investigate the effect of different settings of a on the final performance of the proposed approach, we consider four different yet classical settings for the parameter a, including \(a=1\), \(a=(q-j)\), \(a=k\), and \(a = (q-j) * k\). It is worthy noting that \(a=1\) entails that the position relationship is not considered, \(a=(q-j)\) means that the higher of the position is, the larger the penalty coefficient is, \(a=k\) indicates that the comparison position is considered, while \(a = (q-j) * k\) represents that two situations all considered. Then, Fig. 4b summarizes the AUC values obtained by the proposed approach with the above four settings for parameter a on the ASSIST dataset. We can observe that the proposed approach with \(a = (q-j) * k\) holds higher AUC values than that with other settings and the performance leading of \(a = (q-j) * k\) is significant when compared to \(a=1\). Therefore, it can be proved that the penalty coefficient should fully consider the positional relationship.

Fig. 4
figure 4

Effectiveness validation of the proposed pairwise rank loss and the setting of parameter a

Evaluation on efficiency (RQ5)

The above experiments have demonstrated the effectiveness of the proposed approach as well as some devised strategies. To further show the superiority of the proposed approach, here we compared the efficiencies of the proposed approach and comparison methods by comparing their runtime of training, testing, and adding questions. Table 4 summarizes the time cost of the different models in the training and test stages on the ASSIST dataset. It can be seen that the proposed DL-CAT is much more efficient than BOBCAT, MAAT and MAAT Cov in the testing phase. However, in order to verify that the proposed DL-CAT is decoupled during the training phase, we simulated a CAT system scenario where new questions will be added to the question bank E, and added 50 questions to the question bank E. It can be seen that the training efficiency of the proposed DL-CAT in the scenario of adding new questions is much better than BOBCAT.

Table 4 Efficiency experiment on ASSIST

Conclusions and future work

In this paper, we proposed a novel CAT framework with Decoupled Learning selector (DL-CAT), which uses a deep neural network to select the next question when giving the current estimated student proficiency. To achieve the goal of decoupling the parameter learning of the question selector and student proficiency estimation modules, a ground-truth construction strategy was devised, and a pairwise loss function was suggested to make that the question selector can be trained independently. Extensive experiments show that the DL-CAT framework has significant advantages in performance and efficiency. Besides, the effectiveness of the ground-truth construction strategy and pairwise loss function were also verified in the experiment parts.

In the future, on one hand, we will further attempt to improve the performance and efficiency of the DL-CAT framework through designing the student ability initialization strategy inspired by meta-learning [9, 28, 33, 39]; on the other hand, we would like to apply the proposed model to other fields (e.g., psychological assessment) and refine new problems.