Keywords

1 Introduction

With the development of modern technologies, online platforms for intelligent tutoring systems(ITS) and massive open online courses are becoming more and more prevalent. And knowledge tracing (KT) is considered to be critical for personalized learning in ITS. KT is the task of modeling students’ knowledge state based on historical data, which represents the mastery level of knowledge.

One of the well-known methods to solve the KT problem is recurrent neural networks (RNNs) based model called deep knowledge tracing (DKT) [5]. Although DKT achieves impressive performance for the KT task, it still exists the vibration in prediction outputs [9]. This is unreasonable as students’ knowledge state is expected to transit gradually over time, but not to alternate between mastering and not-yet-mastered.

To find out the root cause of the problem, we utilize FSA as an interpretable structure which can be learned from DKT because FSA has a more interpretable inner mechanism when processing sequential data [3]. We built an FSA for DKT referring [3] to interpret how elements on each input sequence affect the hidden state of DKT. When an input item was accepted by the FSA, it represents that this item has a positive effect on the final prediction outputs of the model, and vice versa. We display the acceptance rate of every input sequence in Fig. 1. We can draw the conclusion from Fig. 1 that the longer the input sequence, the higher the proportion of rejected items, and the lower prediction accuracy. This phenomenon is consistent with the description in [7], who points out that LSTM [2] has the weakness of capturing feature when the input sequence is too long. Accordingly, we proposed a model to solve the problem of long sequence input in KT and experiments show that our proposed model is effective in solving the problem we discovered above.

Our contributions are three-fold. Firstly, to the best of our knowledge, we are the first group to adopt FSA to provide deep analysis on KT task. By interpreting the learning state change using FSA, we can obtain a better understanding of the problem of existing RNN based methods. Secondly, according to the interpretable analysis, we propose a multi-head attention model to handle the problem of long sequence input in KT. Lastly, we evaluate our model on real-world datasets and the results show that our model improves the state-of-the-art baselines.

Fig. 1.
figure 1

Accept/Reject States of DKT. The values above each bar represent the proportion of the rejected items in an input sequence.

Fig. 2.
figure 2

An illustration of our KTA model.

2 Proposed Models

In this section, we will describe the KTA in briefly. The overall structure of the model is shown in Fig. 2. (1) Embedding Layer: The tuples that contain the questions and the corresponding answers are first projected into real-value vectors, namely one-hot embeddings. (2) Feature Extraction: After that, The vectors are fed into a feature extractor, which aims at capturing the latent dependency relationships among the inputs. The main component of the feature extractor consists of N identical blocks. Each block has two sub-layers. The first is a multi-head self-attention mechanism [8], the critical element of the extractor, and the second is a fully connected feed-forward network [8]. Self-attention achieves the extraction of the global relationship by calculating the similarity of each item among the input sequence using the scaled dot-product attention [8]. Here, the attention is calculated h times, which allows the model to learn relevant information in different representative sub-spaces, and making it so-called multi-head. (3) Prediction and Loss: On the prediction stage, only the topmost outputs of attention sub-layer are taken to a Sigmoid function to make the final decision. The prediction and optimization processes are the same as [9], we would not elaborate here.

3 Experiments

AUC Results. We evaluated our models on four popular datasets which are also used in [9]. We also select four popular methods for comparison, PFA [4], BKT [1], DKT [6], DKT+ [9]. Table 1 displays the AUC results of all the datasets.

Table 1. AUC result and F1 score for all datasets tested.

According to Table 1, our proposed model achieves excellent results on four datasets on both evaluation metrics except for the Simulated-5. For example, KTA exceeds DKT+ 10% more on ASSIST2015 regards to AUC. Similar situations happened to the F1 score, and our model achieves notable improvement compared with other models. Moreover, we notice that on Simulated-5 dataset, the performance of our model is not very impressive. One reason is that there is no long sequence in the dataset. Therefore, our model can not exploit the advantage of capturing the long sequence. Another reason is that all the data have the same length of questions, and every question appears only once. Thus the dependence between data is not as strong as other data.

Prediction Visualization. We also provide prediction visualization, as shown in Fig. 3, in order to give a better sense of the self-attention effect on the prediction results. The figure aims to display the change in the prediction of skill along with the number of questions, e.g., s33. Concretely, our model performs more smoothly compared with DKT.

Fig. 3.
figure 3

Line plot for the skill 33 prediction of three models. The student interactions are extracted from ASSISTments 2009. Probability of correctly answering skill 33 is predicted by the trained models.

4 Conclusion

In this paper, we applied the FSA to interpret DKT and through the analysis of FSA, we discover that DKT can not handle the long sequence input. Therefore, we introduce a self-attention model, namely, KTA, which can directly capture the global dependency relationships by computing the similarity among each item of the input regardless of the length of the input sequence. The experimental results show that our proposed model can provide better predictions than existing models.